Skip to content

Latest commit

 

History

History
54 lines (34 loc) · 1.84 KB

README.md

File metadata and controls

54 lines (34 loc) · 1.84 KB

node-etl-pipeline

Experiment with Node streams implementing a data processing pipeline CLI.

Data is an event log of pageviews, gzipped .tsv with header date, time, userid, url, ip, useragent

The pipeline has 4 stages:

1. Read

  • Read .gz files from disk (input directory passed in via the CLI)
  • Extract / unzip
  • Parse the TSV (using csv-parse, that returns an array for each record)
  • Convert that array to an object for more readable access

2. Enrich

3. Aggregate

  • Build a map of { country/city: numberOfEvents }
  • Filter events so userId is unique
  • Build a similar map for { browser/os: numberOfUsers }
  • Iterate over those maps to find the top 5 in each category

4. Store

  • Write the final stream somewhere (file passed in via the cli)

How to run

  • Install the latest version of node (14) e.g. using nvm
  • Install yarn
  • yarn
  • Drop your gzip files in the data dir
  • yarn generate ./data

The cli is built using commander so you can also run yarn cli to see the options and get help.

At the moment the scalability bottleneck is the filtering of unique users - the map of userIds is held in memory. In production that could be replaced with a key/value store like redis or dynamodb.

Next steps

So far I spent about a day, with more time I'd look to implement the following:

  • More tests
  • Add config params e.g. start/end date
  • Replace in-memory user map with k/v store e.g. redis
  • Deploy / host it somewhere