scraper

Web Scraper used to scrape Amazon Product Pages and avoid capchas.

Author: Julius Remigio

Required Libraries

lxml
scrapy

Install required libraries:

pip install -r requirements.txt

Middlewares used:

See scrapy documentation:

Proxies

The proxies in proxy.txt was created using free public proxies that were available at the time of scraping. It should be updated regularly with working proxies to increase rate of success.

List of public proxies: http://proxylist.hidemyass.com/

Starting a scaping session:

Sessions are started using scrapy CLI utility. Custom parameters are passed using the -a parameter. Custom Parameters:

file - csv file with column header 'asin' (list of amazon products to scrape)
html - folder to store html of scraped products

Example:

scrapy crawl product -a html=./../../html -a file=./../reviews_Women.csv.gz -o ./../reviews_Women.jl --logfile ./../reviews_Women.csv.log

Settings.py

Used to changing scraping behavior such as retries and middleware configuration

Spiders

spider directory contains all spider classes. Currently there is only a products spider for scraping amazon product pages.

Notebooks

Notebooks are used for transforming the data and preparing it for model consumption.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
amzn_products		amzn_products
00a_ExtractDump.ipynb		00a_ExtractDump.ipynb
00a_Parse HTML.ipynb		00a_Parse HTML.ipynb
00b_Cleanup Missing Values.ipynb		00b_Cleanup Missing Values.ipynb
README.md		README.md
old_Scraper Cleanup.ipynb		old_Scraper Cleanup.ipynb
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scraper

Required Libraries

Middlewares used:

Proxies

Starting a scaping session:

Settings.py

Spiders

Notebooks

About

Releases

Packages

Languages

DSE-capstone-sharknado/scraper

Folders and files

Latest commit

History

Repository files navigation

scraper

Required Libraries

Middlewares used:

Proxies

Starting a scaping session:

Settings.py

Spiders

Notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages