- What is Scrapy?
- Create Scrapy Project
- Example Spider
- Extracting data using Scrapy Shell
- Storing Scraped data
- Following Links
- Advance Scrapy
Scrapy is an open-source Python framework designed for web scraping and web crawling. It allows developers to efficiently extract structured data from websites, process it, and save it in formats like JSON, CSV, or databases. Scrapy provides tools to handle requests, follow links, and manage crawling rules, making it powerful for data mining, automated testing, and information gathering from the web.
With Scrapy you write Spiders to retrieve HTML pages from websites and scrape the data you want, clean and validate it, and store it in the data format you want.
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where youโd like to store your code and run:
scrapy startproject <project_name> <project directory name>
for e.g. scrapy startproject quotescraper
Scrapy creates this folder structure:
quotescraper/ <-- run `scrapy crawl` from project root directory
โโโ scrapy.cfg
โโโ quotescraper/ <-- save files in dedicated `output_files/` directory inside this
โโโ __init__.py
โโโ items.py
โโโ middlewares.py
โโโ pipelines.py
โโโ settings.py
โโโ spiders/ <-- create your spiders here (.py files)
โโโ __init__.py
quotescraper/
(outer folder): main project folder, you'll work inside it.scrapy.cfg
tells scrapy how to run your project. You donโt touch this much. Scrapy uses it to know:- The name of your settings module
- Which project to run
quotescraper/
(inner folder): This is the real python package for your project.items.py
defines the structure of the data you want to scrape. Here you define the fields you want to scrape (e.g. quote, author, tag)middlewares.py
handle how requests and responses are processed. (Advanced stuff: like adding custom headers, handling retries, rotating user agents)pipelines.py
process scraped data after it's collected.settings.py
configure your spider's behaviour. For e.g. how fast to crawl, which pipelines to run, user-agent etc.spiders/
This is where your actual spider files go. You create files here. Each spider defines:- The website to crawl
- How to crawl it
- What data to extract
scrapy --help
command shows important commands we can use
scrapy --help
Scrapy 2.13.2 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
- scraping a website https://quotes.toscrape.com/
- create a python file
quotes_spider.py
inside spiders folder. - if we use
scrapy genspider <spider_name> <domain>
, it will quickly generate a template spider file. - put below code inside the file
- run this file using
runspider
command:scrapy runspider quotes_spider.py -o quotes.jsonl
orscrapy crawl quotes -o quotes.jsonl
(run crawl from root project directory) - when this finishes you will have a
quotes.jsonl
file in JSON Line format, containing the text and author. - manage output files separately (save output files outside the spiders folder):
scrapy crawl quotes -o quotescraper/output_files/jsonl_files/quotes.jsonl
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = [
"https://quotes.toscrape.com/tag/humor/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"author": quote.xpath("span/small/text()").get(),
"text": quote.css("span.text::text").get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
- When you ran the command
scrapy runspider quotes_spider.py
, Scrapy looked for a Spider definition inside it and ran it through its crawler engine. name = "quotes"
defines a spider name. This can be used to run the spider via command line:scrapy crawl quotes
. Name must be unique within a project.allowed_domains
is an optional setting in your spider that restricts which domains the spider is allowed to crawl. Use it to avoid accidentally crawling unrelated or external websites.- The crawl started by making requests to the URLs defined in the
start_urls
attribute and called the default callback methodparse
, passing the response object as an argument. parse()
is Scrapyโs default callback method, which is called for requests without an explicitly assigned callback.- In the
parse
callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author. - Look for a link to the next page and schedule another request using the same parse method as callback.
quote.css("span.text::text").get()
, we have added::text
to the CSS query, this means we want to select only the text elements directly insidespan.text
.
Folder structure after saving output in multiple formats:
quotescraper <-- โ๏ธ run crawl from this location
.
|-- quotescraper
| |-- __init__.py
| |-- items.py
| |-- middlewares.py
| |-- output_files <-- ๐ save files in this location
| | |-- csv_files
| | | `-- allNewQuotes.csv
| | `-- jsonl_files
| | |-- allNewQuotes.jsonl
| | `-- quotes.jsonl
| |-- pipelines.py
| |-- settings.py
| `-- spiders
| |-- __init__.py
| |-- quotes_html.py
| `-- quotes_json.py
`-- scrapy.cfg
- We can extract data with Scrapy using
Scrapy Shell
- In the terminal run:
scrapy shell "https://quotes.toscrape.com/tag/humor/"
# In the bash terminal run: scrapy shell "https://quotes.toscrape.com/tag/humor/"
>>> response.css("title")
[<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css("title").getall()
['<title>Quotes to Scrape</title>']
>>> response.css("title").get()
'<title>Quotes to Scrape</title>'
>>> response.css("title::text").getall()
['Quotes to Scrape']
>>> response.css("title::text").get()
'Quotes to Scrape'
>>> view(response)
True
>>> response.xpath("//title")
[<Selector query='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath("//title/text()").get()
'Quotes to Scrape'
# extracting quotes, authors and tags
>>> quote = response.css("div.quote")[0]
>>> text = quote.css("span.text::text").get()
>>> text
'โThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.โ'
>>> author = quote.css("small.author::text").get()
>>> author
'Jane Austen'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['aliteracy', 'books', 'classic', 'humor']
# we can iterate over all the quote elements and put them together into a Python dictionary
>>> for quote in response.css("div.quote"):
... text = quote.css("span.text::text").get()
... author = quote.css("small.author::text").get()
... tags = quote.css("div.tags a.tag::text").getall()
... print(dict(text=text, author=author, tags=tags))
{'text': 'โThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.โ', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'text': 'โA day without sunshine is like, you know, night.โ', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
....
response
: HTML response object Scrapy gets from a URL..css("title")
: a CSS selector that finds the<title>
tag in the HTML.- The result is a SelectorList (a list of selector objects)
response.css("title").getall()
returns a list of strings, each being the full HTML string of the matched tag.response.css("title").get()
returns just the first match as a string.response.css("title::text").getall()
returns a list of text contents inside thetitle
tag.response.css("title::text").get()
returns the first text content inside thetitle
tag.view(response)
opens the response HTML in your default web browser.
Note:
- Besides CSS, Scrapy selectors also support using XPath expressions:
response.xpath("//title")
- XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood.
>>> response.css("div.quote")
>>> response.xpath("//div[@class='quote']")
Both return the same <div class="quote">
blocks.
-
Simplest way to store the scraped data by command:
scrapy crawl quotes -o quotes.jsonl
-
scrapy crawl quotes
runs the spider namedquotes
. This is defined in your spider class. -
-o quotes.jsonl
tells Scrapy to save the scraped data into a file namedquotes.jsonl
in JSON Lines format. -
JSON Lines is like JSON but each line is a separate JSON object.
-
The
-O
command-line switch overwrites any existing file; use-o
instead to append new content to any existing file. However, appending to aJSON file(quotes.json)
makes the file contents invalid JSON. When appending to a file, consider using a different serialization format, such asJSON Lines(quotes.jsonl)
. -
JSON data is held memory in an array and new data is appended to it, for e.g.
[ {"name": "Color TV", "price": "1200"}, {"name": "DVD player", "price": "200"} ]
As a result, it is advised to use JSON lines format if you want to save data in JSON.
{"name": "Color TV", "price": "1200"} {"name": "DVD player", "price": "200"}
- This means telling your spider to go to the next page (or any other link) and continue scraping from there.
- Letโs say, instead of just scraping the stuff from the first two pages from https://quotes.toscrape.com, you want quotes from all the pages in the website.
- First thing to do is extract the link to the page we want to follow.
# In bash terminal type: scrapy shell https://quotes.toscrape.com
>>> response.css("li.next a").get()
'<a href="/page/2/">Next <span aria-hidden="true">โ</span></a>'
>>> response.css("li.next a::attr(href)").get()
'/page/2/'
Note: ::attr(attribute_name)
can be used to extract any attribute of an HTML tag using CSS selectors in Scrapy.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape.com/page/1/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("span small::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
You can also pass a selector to response.follow
instead of a string; this selector should extract necessary attributes:
for href in response.css("ul.pager a::attr(href)"):
yield response.follow(href, callback=self.parse)
For <a>
elements there is a shortcut: response.follow
uses their href attribute automatically. So the code can be shortened further:
for a in response.css("ul.pager a"):
yield response.follow(a, callback=self.parse)
To create multiple requests from an iterable, you can use response.follow_all
instead:
yield from response.follow_all(css="ul.pager a", callback=self.parse)
Back to Table of Content โฌ๏ธ
- Scrapy Items are a predefined data structure that holds your data.
- Instead of yielding your scraped data in the form of a dictionary for example, you define an Item schema beforehand in your items.py file and use this schema when scraping data.
- This enables you to quickly and easily check what structured data you are collecting in your project.
- In Scrapy,
items.py
is where you define the data fields (structure) you want to extract from a website.
- Define fields in
items.py
import scrapy
class QuotescraperItem(scrapy.Item):
# define the fields for your item here like:
author = scrapy.Field()
text = scrapy.Field()
tags = scrapy.Field()
- Use the item in your spider
import scrapy
from ..items import QuotescraperItem
class NewQuotesSpider(scrapy.Spider):
name = "new_quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = [
"https://quotes.toscrape.com/",
]
def parse(self, response, **kwargs):
for quote in response.css("div.quote"):
item = QuotescraperItem()
item["author"] = quote.xpath("span/small/text()").get()
item["text"] = quote.css("span.text::text").get()
item["tags"] = quote.css("div.tags a.tag::text").getall()
yield item
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
- Run your spider and save output.
scrapy crawl new_quotes -o quotescraper/output/csv_files/allNewQuotes.csv
-
Item Pipelines are the data processors of Scrapy, which all our scraped Items will pass through and from where we can clean, process, validate, and store our data.
-
Using Scrapy Pipelines we can:
- Clean our data (ex. remove currency signs from prices)
- Format our data (ex. convert strings to ints)
- Enrich our data (ex. convert relative links to absolute links)
- Valdiate our data (ex. make sure the price scraped is a viable price)
- Store our data in databases, queues, files or object storage buckets
-
Task
- convert author names in uppercase.
- remove quotation marks from text.
-
Steps
- Create a pipeline in
pipelines.py
class QuotescraperPipeline: def process_item(self, item, spider): item["author"] = item["author"].upper() item['text'] = item['text'].replace('โ', '').replace('โ', '').replace('"', '').strip() return item
- spider file
quotes_advance.py
# spider file: quotes_advance.py import scrapy from ..items import QuotescraperItem class AdvanceQuotesSpider(scrapy.Spider): name = "advance_quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = [ "https://quotes.toscrape.com/tag/humor/", ] def parse(self, response, **kwargs): for quote in response.css("div.quote"): item = QuotescraperItem() item["author"] = quote.xpath("span/small/text()").get() item["text"] = quote.css("span.text::text").get() item["tags"] = quote.css("div.tags a.tag::text").getall() yield item next_page = response.css('li.next a::attr("href")').get() if next_page is not None: yield response.follow(next_page, self.parse)
- Enable pipeline in
settings.py
. The number (300) is priority. Lower = runs earlier.
ITEM_PIPELINES = { 'quotescraper.pipelines.QuotescraperPipeline': 300, }
Note: Each item scraped will now go through process_item() in your pipeline.
Save output:
scrapy crawl advance_quotes -o quotescraper/output/jsonl_files/advance_quotes.jsonl
Final result:
{"author": "JANE AUSTEN", "text": "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.", "tags": ["aliteracy", "books", "classic", "humor"]}
- Create a pipeline in
By default, Scrapy pipelines apply to all spiders. But you can make a pipeline process only specific spiders โ and skip the others.
A simple method to solve this problem is to modify our process_item()
# pipelines.py file
class QuotescraperPipeline:
def process_item(self, item, spider):
if spider.name != "advance_quotes":
return item # skip processing for other spiders
# process only for advance_quotes spider
item["author"] = item["author"].upper()
item["text"] = item["text"].replace("โ", "").replace("โ", "").replace('"', "")
return item
This approach can be used especially in smaller or medium-sized Scrapy projects. For larger projects we can write multiple pipeline classes and use item Types instead of spider.name.
Let's say we have two spiders, one for quotes and one for books: advance_quotes.py
and books_spider.py
Scrapy_Advance/
โโโ QuoteBookScrape/
โโโ scrapy.cfg
โโโ QuoteBookScrape/
โโโ __init__.py
โโโ items.py
โโโ pipelines.py
โโโ settings.py
โโโ spiders/
โโโ advance_quotes.py
โโโ books_spider.py
Step-1: items.py
Define two different item types (classes)
import scrapy
class QuoteItem(scrapy.Item):
# define the fields for your item here like:
author = scrapy.Field()
text = scrapy.Field()
tags = scrapy.Field()
class BookItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
price = scrapy.Field()
Step 2: Create your spiders advance_quotes.py
and books_spider.py
# advance_quotes.py
import scrapy
from ..items import QuoteItem
class AdvanceQuotesSpider(scrapy.Spider):
name = "advance_quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["https://quotes.toscrape.com/tag/inspirational/"]
def parse(self, response, **kwargs):
for quote in response.css("div.quote"):
item = QuoteItem()
item["author"] = quote.xpath("span/small/text()").get()
item["text"] = quote.css("span.text::text").get()
item["tags"] = quote.css("div.tags a.tag::text").getall()
yield item
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
# books_spider.py
import scrapy
from ..items import BookItem
class BooksSpider(scrapy.Spider):
name = "books_spider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/catalogue/category/books/fiction_10/index.html"]
def parse(self, response, **kwargs):
for book in response.css("article.product_pod"):
item = BookItem()
item["title"] = book.css("h3 a::attr(title)").get()
item["price"] = book.css("p.price_color::text").get()
yield item
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Step 3: pipelines.py
Define two pipeline classes - one for each item type.
from .items import QuoteItem, BookItem
class AdvanceQuotesPipeline:
def process_item(self, item, spider):
if isinstance(item, QuoteItem):
item['text'] = item['text'].replace('โ', '').replace('โ', '').strip()
item['author'] = item['author'].upper()
return item
class BookStorePipeline:
def process_item(self, item, spider):
if isinstance(item, BookItem):
item['title'] = item['title'].strip()
item['price'] = item['price'].replace('ยฃ', '').strip()
return item
Step 4: settings.py
Enable both pipelines
ITEM_PIPELINES = {
"QuoteBookScrape.pipelines.AdvanceQuotesPipeline": 300,
"QuoteBookScrape.pipelines.BookStorePipeline": 400,
}
Scrapy provides a built-in, easy-to-use system called Feed Exporters that lets you save your scraped data in different formats like: json, jsonl, csv, xml, and even custom formats
-
We were already using Scrapy's Feed system, just without realizing the name:
scrapy crawl advance_quotes -o output_files/quote.csv
This is Command-line Feed Exporter Argument -
Telling Scrapy to save the data to a CSV via the command line is okay, but can be a little messy. The other option is setting it in your code (
settings.py
), which Scrapy makes very easy.- We can configure it in our
settings.py
file by passing it a dictionary with the path/name of the file and the file format:
FEEDS = { 'output/files_using_feed/quotes.csv': { 'format': 'csv', 'overwrite': True, }, 'output/files_using_feed/quotes.jsonl': { 'format': 'jsonlines', 'overwrite': True, 'indent': 2, 'encoding': 'utf8', }, }
- The default overwriting behaviour of the FEEDS functionality is dependent on where the data is going to be stored. However, you can set it to overwrite existing data or not by adding a overwrite key to the FEEDS dictionary with either True or False.
- When saving locally, by default overwrite is set to False.
- We can add 1 or many formats โ Scrapy will export to all of them.
- We can configure it in our
Setting a static filepath is okay for development or very small projects, however, when in production you will likely don't want all your data being saved into one big file.
So to solve this Scrapy allows you to create dynamic file paths/names using spider variables.
%(time)s
gets replaced by a timestamp when feed is being created%(name)s
gets replaced by the spider name
# add this code at the bottom of the file settings.py
FEEDS = {
'QuoteBookScrape/output_files/files_using_feed/%(name)s/%(name)s_%(time)s.csv': {
'format': 'csv',
},
}
- In bash run:
scrapy crawl <spider name>
. - For e.g. use project we created for learning pipelines:
scrapy crawl book_spider
andscrapy crawl advance_quotes
- generated path:
QuoteBookScrape/QuoteBookScrape/output_files/files_using_feed/advance_quotes/advance_quotes_2025-06-22T19-21-50+00-00.csv
QuoteBookScrape/QuoteBookScrape/output_files/files_using_feed/books_spider/books_spider_2025-06-22T19-41-19+00-00.csv
We can set FEEDS dynamically in each spider (custom_settings)
If we want full control from inside the spider file (e.g., different formats for each spider), we can override custom_settings
.
# advance_quotes.py
import scrapy
class AdvanceQuotesSpider(scrapy.Spider):
name = "custom_quotes"
start_urls = ["https://quotes.toscrape.com/tag/humor/"]
custom_settings = {
'FEEDS': {
'QuoteBookScrape/output_files/files_CustomSettings/quotes.jsonl': {
'format': 'jsonlines',
'overwrite': True,
}
}
}
def parse(self, response, **kwargs):
pass
QuoteBookScrape
.
|-- QuoteBookScrape
| |-- __init__.py
| |-- items.py
| |-- middlewares.py
| |-- output_files
| | |-- files_CustomSettings <-- files generated from `quotes_custom_settings.py` file
| | | `-- quotes.jsonl
| | |-- files_using_CLI <-- generated using command line `scrapy crawl <spider name> -O QuoteBookScrape/output_files/....`
| | | |-- advance_quotes.csv
| | | `-- books.csv
| | `-- files_using_feed <-- dynamic output files location set in `settings.py`
| | |-- advance_quotes
| | | `-- advance_quotes_2025-06-22T19-21-50+00-00.csv
| | `-- books_spider
| | `-- books_spider_2025-06-22T19-41-19+00-00.csv
| |-- pipelines.py
| |-- settings.py
| `-- spiders
| |-- __init__.py
| |-- advance_quotes.py
| |-- books_spider.py
| `-- quotes_custom_settings.py <--๐ FEEDS custom_settings (FEEDS for specific spider)
`-- scrapy.cfg
CLI(command line) > Spider custom_settings > Project settings
- This means CLI
scrapy crawl myspider -O output.csv
will override bothcustom_settings
andsettings.py
. - If we use
scrapy crawl myspider
without creating output file from command line and if our spider hascustom_settings
, then it overridessettings.py
. - If neither is present, Scrapy uses
settings.py
All pages of the site have a Login button that redirects us to a /login page. At this login page, you can type in any combination of username and password, and when you press the Login button, you will get redirected to the home page that now have the Logout button which means that you are logged in.
Logging in
using Scrapy means simulating a form submission to authenticate yourself on a website so that you can scrape pages that are only accessible after logging in.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class LoginSpider(scrapy.Spider):
name = 'login_quotes'
start_urls = ['https://quotes.toscrape.com/login']
def parse(self, response):
# Send a POST request with the login form
return FormRequest.from_response(
response,
formdata={'user name': 'admin', 'password': 'admin'},
callback=self.after_login
)
def after_login(self, response):
# We're now logged in if credentials are correct
if "Logout" in response.text:
self.logger.info("Login successful!")
else:
self.logger.error("Login failed!")
# Proceed to scrape after login
return response.follow("/tag/humor/", callback=self.parse_quotes)
def parse_quotes(self, response):
open_in_browser(response) # open page in browser
# code to parse data.....
- Inspect login form
- Find the form fields (look for
<input name="...">
name attribute) - Identify required fields and hidden
csrf_token
- Find the form fields (look for
- To submit form data use
FormRequest.from_response()
in default callback method Scrapy calls for the URL(s) in start_urls. - Scrapy makes it easy to submit the login form by extracting hidden fields automatically.
callback = self.after_login
After submitting the form, Scrapy will call the after_login method with the server's response. This is where you check if login succeeded or failed.- If login succeeded (or even if it failed), you navigate to the page
/tag/humor/
. Scrapy will call theparse_quotes
method on the new response. parse_quotes(self, response)
method handles the HTML from the quotes page.open_in_browser(response)
opens the HTML page in your default browser so you can visually check what was loaded.
Websites like QuotesToScrap intended to help you to learn web scraping, so it doesn't block your scraping bot.
However, what you will quickly find out when you start scraping protected websites like Amazon, Google, is that building and running your scrapers is the easy part.
The true difficulty of web scraping is in being able to reliably retrieve HTML responses from the pages you want to scrape. This is because most websites want to limit or completely stop your ability to scrape data from their websites.
However, the most important and easiest to mitigate ways of bypassing a websites anti-bot protection systems is to fake your headers and user-agents, and use rotating proxy pools.
User-Agent strings are part of the HTTP request headers and are used to identify the client (user or bot) making a request to a web server.
They typically include information like:
- The application (browser, script, bot)
- The operating system (Windows, macOS, Linux, Android, etc.)
- The browser engine (e.g., Chrome, Firefox, Safari)
- Sometimes even device type (e.g., mobile, tablet, desktop)
Example User agent sent when you visit a website with a Chrome browser:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36
When you use Scrapy with the default settings, the user-agent your spider sends is the following by default: Scrapy/VERSION (+https://scrapy.org)
This immediately tells the website you're a bot, and you might get blocked or redirected to CAPTCHA pages.
- This applies to all spiders by default.
# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36'
- use this if you want a different User-Agent for each spider
# myspider.py
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/119.0'
}
- Best for setting it on specific requests only.
# myspider.py
import scrapy
# add method inside spider class
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse,
headers={
"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"
}
)
Websites detect repeated requests with the same user-agent and may block or throttle your scraper. Different real users use different devices and browsers. Rotating user-agents helps you appear more like genuine traffic.
- Use Python's
random.choice()
to pick a random user-agent for each request
# myspider.py
import scrapy
import random
from scrapy import Request
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
'Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
]
# add method inside spider class
def start_requests(self):
for url in self.start_urls:
yield Request(
url=url,
headers={'User-Agent': random.choice(USER_AGENTS)},
callback=self.parse
)
- Install
pip install scrapy-user-agents
- In
settings.py
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
-
A User-Agent is one specific HTTP header that tells the server what browser/device is making the request.
-
Just adding fake user-agents to your requests will help you scrape websites with simple anti-bot protection systems, however, for websites with proper anti-bot protection just setting users-agents isn't enough.
-
To convince these websites (Amazon, Google, etc.) that you are not a scraper, you must be using fake browser headers to mimic the browser fingerprints of real users.
-
Browser headers include all the headers typically sent by a real browser - not just the User-Agent.
You can add fake browser headers just as you would add fake user-agents as user-agents are just one type of header.
# myspider.py
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote_spider'
start_urls = ['https://quotes.toscrape.com']
def start_requests(self):
fake_browser_header = {
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"sec-ch-ua": '"Google Chrome";v="137", "Chromium";v="137", "Not/A)Brand";v="24"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
"sec-fetch-site": "none",
"sec-fetch-mode": "navigate",
"sec-fetch-user": "?1",
"Sec-Fetch-Dest": "document",
"accept-encoding": "gzip, deflate, br, zstd",
"accept-language": "en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,hi;q=0.6"
}
for url in self.start_urls:
yield scrapy.Request(
url=url,
headers=fake_browser_header,
callback=self.parse
)
def parse(self, response):
print(response.text) # Just print to verify
Note: Every browser (Chrome, Firefox, Edge, etc.) attaches slightly different headers in a different order, based on the operating system the browser is running on. So it is important to ensure the headers (and header order) we attach to our requests is correct.
- As we saw most websites try to limit or completely block scrapers from accessing their website data. Part of the solution is to use user-agents and browser headers to make your scraper appear more like a real browser.
- However, this won't work when scraping at scale as your
IP address
will be static. This is where web scraping proxies come in. Web scraping proxies
are IP addresses that you route your requests through instead of using your own or servers IP address.
- To integrate a list of proxies with our spider, we can build our own proxy management layer, or we can simply install an existing Scrapy middleware that will manage our proxy list for us.
- There are several free Scrapy middlewares out there that we can choose from. For now, we will use
scrapy-rotating-proxies
.
Steps:
- Install package:
pip install scrapy-rotating-proxies
- In
settings.py
:
# settings.py
# enable proxy middleware in your downloader middlewares
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
# insert your list of proxies here
ROTATING_PROXY_LIST = [
# IP:PORT
178.207.11.148:3129,
103.145.32.249:8080,
161.35.98.111:8080,
# add more
]
-
After this, all requests our spider makes will be proxied using one of the proxies from the
ROTATING_PROXY_LIST
. -
As an alternative (to ROTATING_PROXY_LIST), you can specify a
ROTATING_PROXY_LIST_PATH
options with a path to a file with proxies, one per line:
ROTATING_PROXY_LIST_PATH = '/my/path/proxies.txt'
-
Note that if both list and path are present then path takes precedence.