Pet project, that shows my Data Engineer skills. The goal of the project is better understanding of your skills and experience in the Data and Software Engineering field as well as see how you approach certain questions and challenges.
- You can use any programming language (our recommendation is Python)
- The solution should be flexible, stable and scalable as well as ensure a good code quality (we want to bring this into production as a next step)
- We don't expect you to spin up any costly machines in the cloud or elsewhere to process Gigabytes or Terabytes of data!
- Unless stated otherwise, you can use any tool from one of the major cloud providers or any other system that you like
- Ingestion:
- Load one of the datasets from below onto your laptop or any other system (e.g. HDFS, Database, etc.). Keep it simple, only load as much data as you can process with your system of choice (e.g. load a week or month worth of data).
- Download a weather dataset for the same timeframe as the above dataset, so that you can later join the two datasets
- Preparation/Data Cleansing:
- How can we ensure quality of the data? What checks could be implemented? Implement a simple method to ensure the date/timestamp is in the right format throughout the datasets.
- Processing:
- Join the two datasets, so that we know the weather for each entry in the main dataset.
- Analysis (optional):
- Run a small analysis of your choice on the data (e.g. table, chart, map, etc.).Here are some ideas for the
analysis part:
- Impact of weather on the ridership of taxis, bikes, etc.
- How many customers per day, hour or weekday?
- ...
- Run a small analysis of your choice on the data (e.g. table, chart, map, etc.).Here are some ideas for the
analysis part:
- New York City Bike Share: https://www.citibikenyc.com/system-data (Stream + History)
- New York City Taxi Trips: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page (History)
- Feel free to pick another interesting dataset if you like
Project Name | Data Engineer Pet Project |
---|---|
Category | Pet project |
Description | project of processing and analyzing 2 datasets |
Tech stack | Hadoop HDFS, PySpark |
Office | None |
Status | Done |
ETA | 30.09.2022 |
Name | Role | Type | Availability | Location | Time Zone |
---|---|---|---|---|---|
Karim Safiullin | Data Engineer | on bench | full time | Germany, Ilmenau | CEST (UTC +2) |
- write plan to develop app (plan + architecture) - 2h
- find, download, peek on datasets - 2h
- setup hadoop/yarn/spark locally on laptop - 2h
- upload to db/hdfs locally if resources are available - 3h
- setup environment, logging,
basement
of application etc - 4h - implement cleansing logic (date format etc...) - 4h
- join 2 datasets using PySpark or another technologies - 3h
- find anomalies, dependencies, statistics and other things - 5h
- implement unit tests - 3h
Time estimate - 28h Actual time spent - 35h
The structure of datalake storage consists of several stages
.
(with my comments/reflections)
- load repository
https://github.com/BondaiKa/data_engineer_pet_project
- Install packages locally or use virtualenv
pip install -r requirements.txt
- Load datasets locally with setting date ("your date YYYY-mm-dd format")
- Citibike
python3 data-engineer-pet-project-cli.py load-citibike-dataset-locally-cli --date "2022-04-30"
- Weather just download from weather dataset link
- Put data from local to hdfs as example below:
hdfs dfs -put /Volumes/Samsung_T5/datasets/citibike/202204-citibike-tripdata.csv /user/karim/citibike/landing
- After you can run whole pipeline use command below:
bash scripts/run-all.sh
- To run particular command use data-engineer-pet-project-cli.py as example below:
python3 data-engineer-pet-project-cli.py load-citibike-dataset-locally-cli --date "2022-04-30"
- To get reports you should write as example below:
hadoop fs -copyToLocal /user/karim/public/bike_weather/202204_temperature_dependency_bike_weather.csv .
- You can use jupyter notebook and google colab to analyze final reports...
- Firstly, I tried to download citibike dataset directly to hdfs, however I didn't fix problem with hadoop settings. The
hadoop rejected requests like
ConnectionError or Connection reset by peer
So I skip this steps and load from local store to hdfs manually using hdfs dfs command.