Scrapy Cheatsheet

Post author:techguidehub
Post category:CheatSheet
Post last modified:26 November 2023

Installation

pip install scrapy

Create a new Scrapy project

scrapy startproject project_name

Create a new Spider

cd project_name
scrapy genspider spider_name example.com

Running a Spider

scrapy crawl spider_name

XPath Selectors

# Example XPath selector
response.xpath('//div[@class="example"]/p/text()').extract()

CSS Selectors

# Example CSS selector
response.css('div.example p::text').extract()

Extracting Data

# Extracting text
response.css('div.example p::text').extract()

# Extracting attribute
response.css('a::attr(href)').extract()

Following Links

# Following links using XPath
next_page = response.xpath('//a[@class="next-page"]/@href').extract_first()
yield scrapy.Request(url=next_page, callback=self.parse)

# Following links using CSS
next_page = response.css('a.next-page::attr(href)').extract_first()
yield scrapy.Request(url=next_page, callback=self.parse)

Item Pipeline

# Define items in items.py
class MyItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()

# Use in spider
def parse(self, response):
    item = MyItem()
    item['field1'] = response.css('...').extract_first()
    item['field2'] = response.css('...').extract_first()
    yield item

Middlewares

# Example middleware to set a custom User-Agent
class CustomUserAgentMiddleware:
    def process_request(self, request, spider):
        request.headers['User-Agent'] = 'Custom User Agent'

Setting User Agent

# Set user-agent in settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

Exporting Data

# Export data to CSV
scrapy crawl spider_name -o output.csv

# Export data to JSON
scrapy crawl spider_name -o output.json

Debugging

# Run spider in debug mode
scrapy crawl spider_name -o output.json -t json --nolog

Tags: install scrapy, run a spider, Scrapy Cheatsheet, scrapy css selector, scrapy xpath

Installation

Create a new Scrapy project

Create a new Spider

Running a Spider

XPath Selectors

CSS Selectors

Extracting Data

Following Links

Item Pipeline

Middlewares

Setting User Agent

Exporting Data

Debugging

You Might Also Like

dbt ( data build tool ) Cheat Sheet

SSH Command Cheat Sheet

Python Regex Cheat Sheet – Regular Expressions in Python

Leave a Reply Cancel reply