Skip to content

This project showcased the ETL process of big data. Raw data about Amazon video games reviews was collected from a site, placed into an AWS database, and queried against using Pyspark and SQL to find out whether Amazon vine reviews influenced customer feedback.

Notifications You must be signed in to change notification settings

Niraj-Khatri/Video_Game_Reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pyspark-AWS Project

The goal of this project was to extract Amazon product review data, clean the data, and load the data to a Postgres database using AWS RDS. Afterwards, I did analysis on the data to determine whether a certified Amazon vine reviewer provided more helpful reviews than a non-vine reviewer.

ETL

I extracted Amazon video game review data from the following site: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz Extract


I cleaned the data and created 4 tables to do future analysis with: customers, products, reviews, and vines.

Cleaning


I created a AWS RDS instance and used an SQL script to create the 4 tables in Postgres. AWS

Postgres


With PySpark, I loaded the data tables to Postgres.

Upload

Data Analysis

I wanted to analyze the Amazon video game data to determine if Amazon vine reviewers provided more helpful reviews.


First, using the vine table, I filtered out reviews that had less than 50% of the helpful votes and reviews with less than 20 total votes.

Filter


Next, I calculated the number of vine reviews and non-vine reviews in the filtered data set.

Vine


Finally, I wanted to look at top products (5 stars). I filtered out the data set for five star reviews only and calculated the percentage of 5 star reviews among vine and non-vine reviews.

5Stars

Conclusion: Vine reviewers gave a product 5 stars half the time a vine review was found helpful. This is 10% more compared to non-vine reviews. This may suggest 5 star reviews are found more helpful overall since they instill confidence in the reader to buy the product.

About

This project showcased the ETL process of big data. Raw data about Amazon video games reviews was collected from a site, placed into an AWS database, and queried against using Pyspark and SQL to find out whether Amazon vine reviews influenced customer feedback.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published