Skip to content

LauriESB/pyspark-list-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 

Repository files navigation

PySpark ETL System - AWS

Tools

this application uses as a CRUD source another application developed by me, available here.

developed in Python, its main goal is to read data from a .CSV file belonging to an authenticated user and load it into an existing AWS DynamoDB database in batches using Boto3. additionally, it retrieves data from this database and generates an Excel spreadsheet of abandoned tasks from the last six months, based on specific filtering criteria.

🎮 Technologies used

  • Python
  • PySpark
  • Pandas
  • Boto3
  • XlsxWriter
  • Dotenv
  • Black

🕹️ Workflow

Upload of spreadsheet data into AWS DynamoDB

  • Reads and extracts data from a .CSV file.
  • Transforms the data to match the original database schema.
  • Performs batch writes to AWS DynamoDB using Boto3.

Filter of abandoned tasks and creation of the excel spreadsheet

  • Reads and extracts data from the AWS DynamoDB table.
  • Transforms the data:
    • Creates a DataFrame covering the last six months based on the current system date.
    • Uses a Full Outer Join with the original DataFrame to return only tasks added within the last six months.
    • Filters general to-do tasks that are older than 15 days.
    • Filters shopping tasks that are older than 30 days.
  • Outputs the filtered data into an Excel spreadsheet containing the abandoned tasks.

About

ETL System with PySpark & AWS DynamoDB for Task Management

Topics

Resources

Stars

Watchers

Forks

Languages