Skip to content

BondaiKa/data_engineer_pet_project

Repository files navigation

:shipit: Data Engineer Pet Project :shipit:

Pet project, that shows my Data Engineer skills. The goal of the project is better understanding of your skills and experience in the Data and Software Engineering field as well as see how you approach certain questions and challenges.

Agenda

Description

Requirements/Restrictions ❗

  • You can use any programming language (our recommendation is Python)
  • The solution should be flexible, stable and scalable as well as ensure a good code quality (we want to bring this into production as a next step)
  • We don't expect you to spin up any costly machines in the cloud or elsewhere to process Gigabytes or Terabytes of data!
  • Unless stated otherwise, you can use any tool from one of the major cloud providers or any other system that you like

Assignment:

  • Ingestion:
    • Load one of the datasets from below onto your laptop or any other system (e.g. HDFS, Database, etc.). Keep it simple, only load as much data as you can process with your system of choice (e.g. load a week or month worth of data).
    • Download a weather dataset for the same timeframe as the above dataset, so that you can later join the two datasets
  • Preparation/Data Cleansing:
    • How can we ensure quality of the data? What checks could be implemented? Implement a simple method to ensure the date/timestamp is in the right format throughout the datasets.
  • Processing:
    • Join the two datasets, so that we know the weather for each entry in the main dataset.
  • Analysis (optional):
    • Run a small analysis of your choice on the data (e.g. table, chart, map, etc.).Here are some ideas for the analysis part:
      • Impact of weather on the ridership of taxis, bikes, etc.
      • How many customers per day, hour or weekday?
      • ...

Possible Datasets:

Documentation

Data Engineer Pet Project

Project Name Data Engineer Pet Project
Category Pet project
Description project of processing and analyzing 2 datasets
Tech stack Hadoop HDFS, PySpark
Office None
Status Done
ETA 30.09.2022

Team

Name Role Type Availability Location Time Zone
Karim Safiullin Data Engineer on bench full time Germany, Ilmenau CEST (UTC +2)

Project implementation plan

  1. write plan to develop app (plan + architecture) - 2h
  2. find, download, peek on datasets - 2h
  3. setup hadoop/yarn/spark locally on laptop - 2h
  4. upload to db/hdfs locally if resources are available - 3h
  5. setup environment, logging, basement of application etc - 4h
  6. implement cleansing logic (date format etc...) - 4h
  7. join 2 datasets using PySpark or another technologies - 3h
  8. find anomalies, dependencies, statistics and other things - 5h
  9. implement unit tests - 3h

Report

Time estimate - 28h Actual time spent - 35h

Implementation

The structure of datalake storage consists of several stages.

datalake_schema

(with my comments/reflections)

Installation

  1. load repository
        https://github.com/BondaiKa/data_engineer_pet_project
  2. Install packages locally or use virtualenv
        pip install -r requirements.txt
    
  3. Load datasets locally with setting date ("your date YYYY-mm-dd format")
    • Citibike
       python3 data-engineer-pet-project-cli.py load-citibike-dataset-locally-cli --date "2022-04-30"
  4. Put data from local to hdfs as example below:
        hdfs dfs -put /Volumes/Samsung_T5/datasets/citibike/202204-citibike-tripdata.csv /user/karim/citibike/landing
  5. After you can run whole pipeline use command below:
        bash scripts/run-all.sh
  6. To run particular command use data-engineer-pet-project-cli.py as example below:
        python3 data-engineer-pet-project-cli.py  load-citibike-dataset-locally-cli --date "2022-04-30"
  7. To get reports you should write as example below:
        hadoop fs -copyToLocal /user/karim/public/bike_weather/202204_temperature_dependency_bike_weather.csv .
  8. You can use jupyter notebook and google colab to analyze final reports...

History

  • Firstly, I tried to download citibike dataset directly to hdfs, however I didn't fix problem with hadoop settings. The hadoop rejected requests like ConnectionError or Connection reset by peer So I skip this steps and load from local store to hdfs manually using hdfs dfs command.

Analysis

Dependency between bike trip and temperature tempeate dependency

About

Pet project, that shows my data engineer skills.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published