codeburst

Bursts of code to power through your day. Web Development articles, tutorials, and news.

Follow publication

Make a Robust Crawler with Scrapy and Django

--

As a developer, you may find yourself wishing to gather, organize, and clean data. You need a scraper to extract data and a crawler to automatically search for pages to scrape.

Scrapy helps you complete both easy and complex data extractions. It has a built-in mechanism to create a robust crawler.

In this article, we’ll learn more about crawling and the Scrapy tool, then integrate Scrapy with Django to scrape and export product details from a retail website. To follow this tutorial, you should have basic Python and Django knowledge and have Django installed and operating.

Selectors

Scraping basically makes a GET request to web pages and parses the HTML responses. Scrapy has its own mechanisms for parsing data, called selectors. They “select” certain parts of the HTML using either CSS or XPath expressions.

Important note: Before you try to scrape any website, go through its robots.txt file. You can access it via <domainname>/robots.txt. There, you will see a list of pages allowed and disallowed for scraping. You should not violate any terms of service of any website you scrape.

XPath Expressions

As a Scrapy developer, you need to know how to use XPath expressions. Using XPath, you can perform actions like select the link that contains the text “Next Page”:

data = response.xpath(“//a[contains(., ‘Next Page’)]”).get()

In fact, Scrapy converts CSS selectors to XPath under the hood.

# sample of css expression
data = response.css(‘.price::text’).getall()
# sample of a xpath expression
data = response.xpath(‘//h1[@class=”gl-heading”]/span/text()’).get()

Expression // selects all elements that fulfil the criteria. If you specify an attribute with @ , it only selects elements with that attribute. /shows the path of the target element. After all, you need the full path of your target element. get() always returns a single result (the first one if there are many results). getall()returns a list with all results.

Note: You may have seen extract and extract_first instead of getall()and get()as they are the same methods. However, the official document indicates that these new methods result in a more concise and readable code.

Starting a Scrapy Project

After you install Scrapy, scrapy startproject <projectname> creates a new project.

Inside the project, type scrapy genspider <spiderName> <domainName> to set up the spider template.

To run the spider and save data as a JSON file, run scrapy crawl <spiderName> -o data.json.

Integrating with Django

scrapy-djangoitem package is a convenient way to integrate Scrapy projects with Django models. Install with pip install scrapy-djangoitem

To use the Django models outside of your Django app you need to set up the DJANGO_SETTINGS_MODULEenvironment variable. And modify PYTHONPATH to import the settings module.

You can simply add this to your scrapy settings file:

import sys
sys.path.append('<path-to-project>/djangoProjectName')
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'djangoProjectName.settings'
# If you you use django outside of manage.py context, you
# need to explicitly setup the django
import django
django.setup()

After integration, you can start working on writing your first spiders.

Spiders

Spiders are classes defining the custom behaviour for crawling and parsing a particular page. Five different spiders are bundled with Scrapy and you can write your own spider classes as well.

Scrapy.spider

Scrapy.spider is the simplest root spider that every other spider inherits from.

class MySpider(scrapy.Spider):
name = ‘example’
allowed_domains = [‘example.com’]
start_urls = [
‘http://www.example.com/1.html',
‘http://www.example.com/2.html'
]
def parse(self, response):
# xpath/css expressions here
yield(item)

Each spider must have:

  • name — should be unique.
  • allowed_domains — specifies what domain it is allowed to scrape.
  • start_urls — specify what pages you want to scrape within that domain.
  • parse method — takes the HTTP response and parses the target elements that we specified with selectors.
  • yield — keyword to generate many dictionaries containing the data.

To set these properties dynamically, use __init__method. So you may use the data coming from your Django views:

class MySpider(scrapy.Spider):
name = 'example'
def __init__(self, *args, **kwargs):
self.url = kwargs.get('url')
self.domain = kwargs.get('domain')
self.start_urls = [self.url]
self.allowed_domains = [self.domain]
def parse(self, response):
...

You don’t need an additional method to generate your requests here, but how are the requests generated? Scrapy.spider provides a default start_requests() implementation. It sends requests to the URLs defined in start_urls. Then it calls parse method for each response one by one. However, you may need to override it in some circumstances. For instance, if the page requires a login, you must override it with a POST request.

The other spiders Scrapy provides are CrawlSpider, which provides a convenient mechanism for the following links by defining a set of rules, XMLFeedSpider to scrape XML pages, CSVFeedSpider to scrape CSV files, and SitemapSpider to scrape URLs held in the sitemap file.

Note: Scrapy is asynchronous by default which means you can chain your responses. So requests are scheduled and processed (one by one) asynchronously.

def parse(self, response):for page in range(self.total_pages): 
url = 'https://example.com/something.html'
for page in range(self.total_pages): # each page is self explainable because it waits until previous page is done
yield Requests(url.format(page), callback=self.parse)

Items

Spiders can return the extracted data as Python key-value pairs. This is similar to Django models except that it is much simpler. You can choose from different item types like dictionaries or item objects. You can also adjust different types using itemadapter

#items.pyclass BrandsItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
..

You may use scrapy-djangoitem extension that defines Scrapy Items using existing Django models.

from scrapy_djangoitem import DjangoItem
from products.models import Product
class BrandsItem(DjangoItem):
django_model = Product
stock = scrapy.Field() # You can still add extra fields

When you declare an item class, you can directly save data as items.

# Inside spider class
...
def parse(self, response):
item
= BrandsItem()
item['brand'] = 'ExampleBrand'
item['name']= response.xpath('//h1[@class=”title”]/text()').get()
..
yield(item)

Item Pipeline

The Item Pipeline class basically receives and processes an item. It may validate, filter, drop and save items to the database. To use it, you should enable it in the settings.py.

ITEM_PIPELINES = { ‘amazon.pipelines.AmazonPipeline’: 300}

Each item pipeline has a process_itemmethod that returns the modified item or raises a dropItem exception.

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class BrandsPipeline:# parameters are scraped item and its spider
def
process_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter.get('price'): # if scraped data has a price
item.save() # save it to database
return item
else:
raise DropItem(f"Missing price in {item}")

Run Spiders with Django Views

Instead of the typical way of running Scrapy, via scrapy crawl, you can connect your spiders with django-views, which automates the scraping process. This creates a real-time full-stack application with a standalone crawler.

The whole process is described in the image below:

The client sends a request to the server with URLs to scrape. URLs might come with user input or something else depending on your needs. The server takes the request, triggers Scrapy to crawl the target elements. Spiders use selectors to extract data. It might use Item Pipeline to customize it before storing it. Once data is in storage, it means scrapy status is finished. As the last step, your server may call back that data from the storage and send it with a response to the client.

The problem in this process is in the seventh step. There is no way for the Django app to know when the scaping status is finished. So, to have a persistent connection with your server, you may need to send requests to it every second, asking, “Hey! Is there anything else?” However, it is not an effective way to build a real-time application. A better solution is to use web sockets. The Django channels library establishes a WebSocket connection with the browser.

Helper libraries can help creating your real-time application with Scrapy. Scrapyd is a standalone service running on a server where you can deploy and control your spiders. The ScrapyRT library ensures responses are returned immediately as JSON instead of having the data saved in a database, so you can create your own API.

Next Steps

This article is intended to be a practical guide to those who want to explore Scrapy structure beyond connecting it with Django. There are a lot more things you can do with Scrapy. With a couple of lines, you can design a web crawler that automatically navigates to your target website and extracts the data you need. Many websites run entirely on JavaScript nowadays. So sometimes you may need to open a modal or press a button to scrape data. This would become a nightmare if you use other tools instead of Scrapy. Thankfully, there is a plugin for scrapy-splash integration for handling JavaScript code easily for your target website. Besides, Scrapy handles errors gracefully. It even has a built-in ability to resume scraping from the last page if it encounters an error from the page. Although you get all of them for free, you need to allocate some time to learn it as it is not as easy to use compared to other scraping tools. For the next steps, if you are intended to create your standalone crawler, you may find adriancast’s Scrapyd-Django-Template helpful. Check out how they implemented Scrapyd to deploy and run Scrapy spiders inside the Django app.

--

--

Published in codeburst

Bursts of code to power through your day. Web Development articles, tutorials, and news.

Responses (4)