Data Engineer Pet Project

Pet project, that shows my Data Engineer skills. The goal of the project is better understanding of your skills and experience in the Data and Software Engineering field as well as see how you approach certain questions and challenges.

Agenda

Description
Documentation
Implementation

Description

Requirements/Restrictions ❗

You can use any programming language (our recommendation is Python)
The solution should be flexible, stable and scalable as well as ensure a good code quality (we want to bring this into production as a next step)
We don't expect you to spin up any costly machines in the cloud or elsewhere to process Gigabytes or Terabytes of data!
Unless stated otherwise, you can use any tool from one of the major cloud providers or any other system that you like

Assignment:

Ingestion:
- Load one of the datasets from below onto your laptop or any other system (e.g. HDFS, Database, etc.). Keep it simple, only load as much data as you can process with your system of choice (e.g. load a week or month worth of data).
- Download a weather dataset for the same timeframe as the above dataset, so that you can later join the two datasets
Preparation/Data Cleansing:
- How can we ensure quality of the data? What checks could be implemented? Implement a simple method to ensure the date/timestamp is in the right format throughout the datasets.
Processing:
- Join the two datasets, so that we know the weather for each entry in the main dataset.
Analysis (optional):
- Run a small analysis of your choice on the data (e.g. table, chart, map, etc.).Here are some ideas for the analysis part:
  - Impact of weather on the ridership of taxis, bikes, etc.
  - How many customers per day, hour or weekday?
  - ...

Possible Datasets:

New York City Bike Share: https://www.citibikenyc.com/system-data (Stream + History)
New York City Taxi Trips: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page (History)
Feel free to pick another interesting dataset if you like

Documentation

Data Engineer Pet Project

Project Name	Data Engineer Pet Project
Category	Pet project
Description	project of processing and analyzing 2 datasets
Tech stack	Hadoop HDFS, PySpark
Office	None
Status	Done
ETA	30.09.2022

Team

Name	Role	Type	Availability	Location	Time Zone
Karim Safiullin	Data Engineer	on bench	full time	Germany, Ilmenau	CEST (UTC +2)

Project implementation plan

write plan to develop app (plan + architecture) - 2h
find, download, peek on datasets - 2h
setup hadoop/yarn/spark locally on laptop - 2h
upload to db/hdfs locally if resources are available - 3h
setup environment, logging, basement of application etc - 4h
implement cleansing logic (date format etc...) - 4h
join 2 datasets using PySpark or another technologies - 3h
find anomalies, dependencies, statistics and other things - 5h
implement unit tests - 3h

Report

Time estimate - 28h Actual time spent - 35h

Implementation

The structure of datalake storage consists of several stages.

(with my comments/reflections)

Installation

load repository

    https://github.com/BondaiKa/data_engineer_pet_project

Install packages locally or use virtualenv
```
    pip install -r requirements.txt
```
Load datasets locally with setting date ("your date YYYY-mm-dd format")
- Citibike
```
   python3 data-engineer-pet-project-cli.py load-citibike-dataset-locally-cli --date "2022-04-30"
```
- Weather just download from weather dataset link

Put data from local to hdfs as example below:

    hdfs dfs -put /Volumes/Samsung_T5/datasets/citibike/202204-citibike-tripdata.csv /user/karim/citibike/landing

After you can run whole pipeline use command below:
```
    bash scripts/run-all.sh
```

To run particular command use data-engineer-pet-project-cli.py as example below:

    python3 data-engineer-pet-project-cli.py  load-citibike-dataset-locally-cli --date "2022-04-30"

To get reports you should write as example below:

    hadoop fs -copyToLocal /user/karim/public/bike_weather/202204_temperature_dependency_bike_weather.csv .

You can use jupyter notebook and google colab to analyze final reports...

History

Firstly, I tried to download citibike dataset directly to hdfs, however I didn't fix problem with hadoop settings. The hadoop rejected requests like ConnectionError or Connection reset by peer So I skip this steps and load from local store to hdfs manually using hdfs dfs command.

Analysis

Dependency between bike trip and temperature

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cli		cli
data_engineer_pet_project		data_engineer_pet_project
notebooks		notebooks
scripts		scripts
static		static
test		test
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
data-engineer-pet-project-cli.py		data-engineer-pet-project-cli.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Pet Project

Agenda

Description

Requirements/Restrictions ❗

Assignment:

Possible Datasets:

Documentation

Data Engineer Pet Project

Team

Project implementation plan

Report

Implementation

Installation

History

Analysis

About

Releases

Packages

Languages

License

BondaiKa/data_engineer_pet_project

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Pet Project

Agenda

Description

Requirements/Restrictions ❗

Assignment:

Possible Datasets:

Documentation

Data Engineer Pet Project

Team

Project implementation plan

Report

Implementation

Installation

History

Analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages