Scrapy: The Ultimate Web Scraping Framework
Introduction to Scrapy
Scrapy is a fast, high-level, versatile, and open-source web scraping and crawling framework. Originally designed for web scraping, Scrapy can also be used for data mining, automated testing, and even website monitoring. With built-in support for managing requests, handling output pipelines, and retrying failed requests, Scrapy allows developers to focus on extracting the data they need, efficiently and accurately.
If you’re venturing into web scraping, Scrapy is one of the most powerful tools to have in your arsenal, thanks to its Pythonic approach, modular design, and vibrant community. Its robust set of features optimizes making HTTP requests, processing responses, following links, and storing scraped data, letting you efficiently turn unstructured website data into well-structured datasets.
Scrapy APIs: 20+ Key Features You Should Know (With Code Snippets)
Here’s a breakdown of Scrapy’s core APIs, along with explanations and practical examples.
1. scrapy.Spider
The Spider
class is the primary class for defining custom crawlers. You create your custom spiders by subclassing scrapy.Spider
.
import scrapy class MySpider(scrapy.Spider): name = 'my_spider' start_urls = ['http://example.com'] def parse(self, response): self.log('Hello from Scrapy!')
2. start_urls
A spider’s start_urls
is a list of URLs where the crawl begins. Scrapy automatically makes requests to these URLs when the spider starts.
class BlogSpider(scrapy.Spider): name = 'blog' start_urls = ['https://blog.example.com'] def parse(self, response): titles = response.xpath('//h2/a/text()').getall() self.log(titles)
3. response
The response
object represents the HTTP response for a given request, allowing access to status codes, headers, and body content.
def parse(self, response): self.log(response.text) # print HTML content
4. Request
The Request
API enables sending new HTTP requests. It includes methods such as callback
to process responses.
import scrapy class MySpider(scrapy.Spider): name = 'my_spider' def start_requests(self): yield scrapy.Request('http://example.com', callback=self.parse) def parse(self, response): self.log('Response received.')
5. callback
The callback
parameter specifies the method that processes the response
of the request.
def start_requests(self): urls = ['http://example.com/page1', 'http://example.com/page2'] for url in urls: yield scrapy.Request(url=url, callback=self.parse_page) def parse_page(self, response): self.log('Processing page')
6. xpath
Scrapy provides the xpath
selector for extracting data using XPath expressions.
def parse(self, response): titles = response.xpath('//h2/text()').getall() self.log(titles)
7. get()
/ getall()
The get()
method retrieves the first matched result, while getall()
retrieves all results.
def parse(self, response): title = response.xpath('//title/text()').get() self.log(title)
8. css
The css
selector extracts data using CSS selectors, which are easier to use compared to XPath for some users.
def parse(self, response): titles = response.css('h2::text').getall() self.log(titles)
9. scrapy.Field
A Field
defines fields in Scrapy Items
(structured data for storing scraped info).
import scrapy class BlogPost(scrapy.Item): title = scrapy.Field() date = scrapy.Field()