Skip to content

Muzznah/Amazon-Reviews-ETL

Repository files navigation

Analysis of Amazon Reviews.

Background

The purpose of this project was to determine if the reviews from Amazon's Vine program (Vine; V) were reliable. To conduct this analysis, Amazon's reviews on baby category were analyzed, by sorting them into paid(V), and un-paid reviews(!V/NV).The data was cleaned and transformed using, PySpark (see ETL script for detail.)and then loaded into PostgreSQL database. Finally the analysis was done through SQL queries, and Pyspark in PgAdmin and Google Colaboratory.(check SQL-Queries and Analysis-Challenge.)

Analysis

The raw data composed of 1,752,932 number of reviews(row count), with the oldest recorded review date of 1999-07-13, and the latest recorded review date of 2015-08-31. The data was checked for null and duplicate values for the 'product_id', 'customer_id' and 'review_id' columns. Only duplication found was in the product_id column which was filtered out before loading it into the product table in RDS.

It was noted that when data was filtered for verified purchases only, the number of V reviews; dropped from 12100 to 19, and !V reviews; dropped from 1740832 to 1392109. However when you check the average star rating across V reviews and !V reviews, for both verified purchase and otherwise, it comes out to be ~4 for all, This suggests that there is little evidence of bias based on these metrics.

V Review Summary Table:

NV Review Summary Table:

Star Rating Table:

The above three tables highlight that: 1- Average star-rating break up between V and !V showed only slight difference, with both coming out to be approximately 4. 2- Average helpful-vote break up between V and !V showed a significant difference, with V getting an average of ~3 compared to ~1 average for !V review. 3- V reviewers tend to be more conservative while giving a negative rating with only ~5% who gave a low rating of 2-star and none that gave below that. 4- Majority of V reviewers(~79%) gave a high rating of 4-5 star however it matched up with !V reviewers(~79%) rating of 4-5 stars. 5- Looking at only the 5-star rating, you see that only ~42% of V reviewers choose it as opposed to ~63% of !V reviewer 6- Vine reviews are a 0.7% of total reviews and majority of the reviews (~99%) are from regular customers.

Conclusion

  • It seems that most of the vine reviews, are written by people who got the product for free ( as only 19 of them link to a verified purchase) which may have led them to be a little conservative in terms of choosing extreme ratings. However, when you check the average star rating across V reviews and !V reviews, for both verified purchase and otherwise, it comes out to be ~4 for all, This suggests that there is little evidence of bias based on the above metrics.
  • Also since they have a higher perecentage of helpful vote, it seems they offer more detail for products merits and demerits.

Resources

Data

Software

  • Amazon RDS, Google Colaboratory, spark 3.0.0, Python 3.7.7, PostgreSQL 11.8, pgAdmin4.14

About

Cleaning and Transforming big data for ETL. Tools used: PySpark, PostgreSQL database and Google Colaboratory

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors