You are currently viewing Scrapy Cheatsheet

Scrapy Cheatsheet

Installation

pip install scrapy

Create a new Scrapy project

scrapy startproject project_name

Create a new Spider

cd project_name
scrapy genspider spider_name example.com

Running a Spider

scrapy crawl spider_name

XPath Selectors

# Example XPath selector
response.xpath('//div[@class="example"]/p/text()').extract()

CSS Selectors

# Example CSS selector
response.css('div.example p::text').extract()

Extracting Data

# Extracting text
response.css('div.example p::text').extract()

# Extracting attribute
response.css('a::attr(href)').extract()

Following Links

# Following links using XPath
next_page = response.xpath('//a[@class="next-page"]/@href').extract_first()
yield scrapy.Request(url=next_page, callback=self.parse)

# Following links using CSS
next_page = response.css('a.next-page::attr(href)').extract_first()
yield scrapy.Request(url=next_page, callback=self.parse)

Item Pipeline

# Define items in items.py
class MyItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()

# Use in spider
def parse(self, response):
    item = MyItem()
    item['field1'] = response.css('...').extract_first()
    item['field2'] = response.css('...').extract_first()
    yield item

Middlewares

# Example middleware to set a custom User-Agent
class CustomUserAgentMiddleware:
    def process_request(self, request, spider):
        request.headers['User-Agent'] = 'Custom User Agent'

Setting User Agent

# Set user-agent in settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

Exporting Data

# Export data to CSV
scrapy crawl spider_name -o output.csv

# Export data to JSON
scrapy crawl spider_name -o output.json

Debugging

# Run spider in debug mode
scrapy crawl spider_name -o output.json -t json --nolog

Leave a Reply