Skip to content

aya9aladdin/chicago-crashes-DEproject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chicago Crashes Data Engineering project

This is the final project of the Data Engineering Zoomcamp by DataTalks Club

Introduction

Chicago is one of the biggest states in the US that is full of life. Developing a data pipeline for the Crash data will help analytics to understand why accidents happen and how to prevent them, which will help save lives and make the city a safer place. It also guides how we plan our streets and neighborhoods.

Data Source

The traffic crashes dataset is taken from the Chicago data portal, which contains various open-access datasets related to the city. In my project, I used two data sets the Traffic Crashes - Crashes dataset, which shows information about each traffic crash on city streets within the City of Chicago and the Traffic Crashes - People dataset, which shows information about people involved in a crash and if any injuries were sustained.

Data API

The Socrata Open Data API (SODA) provides programmatic access to this dataset, including the ability to filter, query, and aggregate data.

Technologies

The technologies I used in developing the pipeline are:

Cloud: GCP
Data Lake: GCS buckets
Data Warehouse: BigQuery
Infrastructure as Code (IaC): Terraform
Workflow Orchestration: Prefect cloud
Flow containerization: Docker
Data Transformation: DBT cloud
Data Visualization: Looker Studio

Project flow

image
  1. The flow module first ingests the bulk data with two flows one for each dataset (crashes and people data), those two flows are scheduled only once by cron date schedule expressions.
  2. As the datasets are updated daily, another two flows are scheduled to execute daily for extracting newly updated data for each day
  3. After extracting data, data are stored in Google Data Storage (buckets) as raw Paquet files and transferred to Bigquery Data warehouse
  4. At this point transformation takes place through DBT. stg models are incremental models to synchronize the data flow pipeline daily updates, this model simply cleans the raw data found in the warehouse.
  5. Another model is the core model, which is a group of analytical models that generate a view from the stg model for selecting some interesting insights from the data for future visualization
image
  1. Finally, I fed the Google Looker Studio with these views and made a report from them in different charts Crashes Report

Some insights from the report

  1. Injury severity ratio for each crash prime contribution cause: this heatmap illustrates the percentage of severe injury from total injuries for each reported crash cause. From this report, we can see that bicycles advancing on the red light and the physical condition of the driver are the two most common crash types that result in severe injuries
image
  1. Ratio of each sex as a driver involved in the crash: from this pie chart, we can see that more than half of the crashes involve male drivers
image
  1. Sex ratio as a driver for each crash cause: In this chart, we can see that the females' highest ratio in crash types are distraction from inside the vehicle and passing stopped school buses. On the other side, the male ratio is the highest in the obstructive crosswalk and under alcohol/drug influence crash types
image
  1. Weather influence on crashes: in the below chart, for each weather condition, we can see the percentage of crashes caused by the weather from the total days having that weather condition. We can conclude that severe cross-wind gates and blowing snow are the two most common weather conditions that can contribute to traffic crashes
image
  1. Relation between time of the day and month and its effect on crash frequency: from this chart, we can't see a significant effect of the daytime in different seasons on crash frequency except for early-morning crashes their frequency slightly decreases during May, June, and July. As a general overview crashes most often happen in the evening and least likely at midnight
image

About

Final project for data engineering zoomcamp course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published