Web crawling and scraping in Python

muhammad abdulmoiz

Published in

codeburst

3 min readJan 12, 2019

In this article we will learn following things

Basic crawling setup In Python
Basic crawling with AsyncIO
Scraper Util service
Python scraping via Scrapy framework

Web Crawler

A web crawler is an internet bot that systematically browses world wide web for the purpose of extracting useful information.

Web Scraping

Extracting useful information from a webpage is termed as web scraping.

Basic Crawler demo

We have been using following tools

Python request (https://pypi.org/project/requests/) module used to make a crawler bot
Parsel(https://parsel.readthedocs.io/en/latest/usage.html) library used as a scraping tool

Task I

Scrap Recurship(http://recurship.com) website ,extract all links and images present in the page.

Demo Code

Github link: https://github.com/abdulmoizeng/crawlers-demo/blob/master/crawler-demo/spider.py

Task II

Scrap Recurship site and extract links, One by one navigate to each link and extract images information.

Demo Code

Github link: https://github.com/abdulmoizeng/crawlers-demo/blob/master/crawler-demo/spider2.py

Stats

Task II takes around 22 seconds to complete

We have been using Python “request” and “parsel” package. Here are some list of features which these packages offer

Request package

Python request module basically offers following features

HTTP method calls
Working with response codes and headers
Maintains redirection and history for requests
Maintains sessions
Work with cookies
Errors and exceptions

Parsel package

Python parsel package offers following features

Extract text using CSS or XPath selectors
Regular expression helper methods

Crawler Service using request and Parsel

Service code : https://github.com/abdulmoizeng/crawlers-demo/blob/master/library-example/spider.py

We are using service as a class “RequestManager”, It offers following functionality

POST and GET calls with logging
Saving responses as log files of each HTTP request
Setting headers and cookies
Session management
Agent spoofing

Consider a real world example describe in the readme and play with the service accordingly

https://github.com/abdulmoizeng/crawlers-demo/tree/master/library-example

Scraping with AsyncIO

Scenario

Scrap Recurship site and extract links, Navigate each link asynchronously and extract images information.

Demo code

Github link : https://github.com/abdulmoizeng/crawlers-demo/blob/master/async-crawler-demo/spider.py

Stats

By Asyncio scraping similar task which takes 21 secondspreviously took 6 secondsto complete.

AsyncIO stats assessment

Its pretty similar task we perform before but now we are giving it powers by asyncio.

That’s pretty awesome but can we achieve more good performance? Lets explore this in upcoming part

Open Source Python Frameworks for spiders

As Python has very rich community we have frameworks which will take care of the optimizations and configurations.

We just need to follow their patterns.

Popular Frameworks

Following are three popular spider framework python has

Scrapy
Pyspider
Mechanical soup

Let’s try our similar scraping scenario we have been using through one of them.

I am choosing scrapy for the demo purpose

Scrapy

Scrapy is a scraping framework supported by an active community with which you can build your own scraping tool.

Feature offered

Scraping and parsing tools
Easily export the data it collects in a number of formats like JSON or CSV and store the data on a backend of your choosing
Has a number of built-in extensions for tasks like cookie handling, user-agent spoofing, restricting crawl depth, and others
Has an API for easily building your own additions.
Scrapy also offers a cloud to host spiders where spiders scale on demand and run from thousand to billions. Its like heroku of spiders.

Scenario

Scrap Recurship site and extract links, Navigate each link via scrapy and extract images information.

Demo Code

Github link : https://github.com/abdulmoizeng/crawlers-demo/blob/master/scrappy-demo/spider.py

Commands

scrapy run spider -o output.json spider.py

Stats

Task completed with Json export with in 1 second

Conclusion

Pretty awesome! Amazing speed scrapy shows and also remember we can deploy it to scrapyhub. If we are going to perform simple crawling which runs at very few time we should go with the basic tools but in case we are going to scale our spiders or we want our spiders to be performance optimized from beginning then we have to choose one of available spider framework.