Web crawling and scraping in Python

In this article we will learn following things
- Basic crawling setup In Python
- Basic crawling with AsyncIO
- Scraper Util service
- Python scraping via Scrapy framework
Web Crawler
A web crawler is an internet bot that systematically browses world wide web for the purpose of extracting useful information.
Web Scraping
Extracting useful information from a webpage is termed as web scraping.
Basic Crawler demo
We have been using following tools
- Python request (https://pypi.org/project/requests/) module used to make a crawler bot
- Parsel(https://parsel.readthedocs.io/en/latest/usage.html) library used as a scraping tool
Task I
Scrap Recurship(http://recurship.com) website ,extract all links and images present in the page.
Demo Code
Github link: https://github.com/abdulmoizeng/crawlers-demo/blob/master/crawler-demo/spider.py
Task II
Scrap Recurship site and extract links, One by one navigate to each link and extract images information.
Demo Code
Github link: https://github.com/abdulmoizeng/crawlers-demo/blob/master/crawler-demo/spider2.py
Stats
Task II takes around 22 seconds
to complete
We have been using Python “request” and “parsel” package. Here are some list of features which these packages offer
Request package
Python request module basically offers following features
- HTTP method calls
- Working with response codes and headers
- Maintains redirection and history for requests
- Maintains sessions
- Work with cookies
- Errors and exceptions
Parsel package
Python parsel package offers following features
- Extract text using CSS or XPath selectors
- Regular expression helper methods
Crawler Service using request and Parsel
Service code : https://github.com/abdulmoizeng/crawlers-demo/blob/master/library-example/spider.py
We are using service as a class “RequestManager”, It offers following functionality
- POST and GET calls with logging
- Saving responses as log files of each HTTP request
- Setting headers and cookies
- Session management
- Agent spoofing
Consider a real world example describe in the readme and play with the service accordingly
https://github.com/abdulmoizeng/crawlers-demo/tree/master/library-example
Scraping with AsyncIO
Scenario
Scrap Recurship site and extract links, Navigate each link asynchronously and extract images information.
Demo code
Github link : https://github.com/abdulmoizeng/crawlers-demo/blob/master/async-crawler-demo/spider.py
Stats
By Asyncio scraping similar task which takes 21 seconds
previously took 6 seconds
to complete.
AsyncIO stats assessment
Its pretty similar task we perform before but now we are giving it powers by asyncio.
That’s pretty awesome but can we achieve more good performance? Lets explore this in upcoming part
Open Source Python Frameworks for spiders
As Python has very rich community we have frameworks which will take care of the optimizations and configurations.
We just need to follow their patterns.
Popular Frameworks
Following are three popular spider framework python has
- Scrapy
- Pyspider
- Mechanical soup
Let’s try our similar scraping scenario we have been using through one of them.
I am choosing scrapy for the demo purpose
Scrapy
Scrapy is a scraping framework supported by an active community with which you can build your own scraping tool.
Feature offered
- Scraping and parsing tools
- Easily export the data it collects in a number of formats like JSON or CSV and store the data on a backend of your choosing
- Has a number of built-in extensions for tasks like cookie handling, user-agent spoofing, restricting crawl depth, and others
- Has an API for easily building your own additions.
- Scrapy also offers a cloud to host spiders where spiders scale on demand and run from thousand to billions. Its like heroku of spiders.
Scenario
Scrap Recurship site and extract links, Navigate each link via scrapy and extract images information.
Demo Code
Github link : https://github.com/abdulmoizeng/crawlers-demo/blob/master/scrappy-demo/spider.py
Commands
scrapy run spider -o output.json spider.py
Stats
Task completed with Json export with in 1 second
Conclusion
Pretty awesome! Amazing speed scrapy shows and also remember we can deploy it to scrapyhub. If we are going to perform simple crawling which runs at very few time we should go with the basic tools but in case we are going to scale our spiders or we want our spiders to be performance optimized from beginning then we have to choose one of available spider framework.