Web crawling and scraping in Python

muhammad abdulmoiz
codeburst
Published in
3 min readJan 12, 2019
Processing the webpage

In this article we will learn following things

  • Basic crawling setup In Python
  • Basic crawling with AsyncIO
  • Scraper Util service
  • Python scraping via Scrapy framework

Web Crawler

A web crawler is an internet bot that systematically browses world wide web for the purpose of extracting useful information.

Web Scraping

Extracting useful information from a webpage is termed as web scraping.

Basic Crawler demo

We have been using following tools

Task I

Scrap Recurship(http://recurship.com) website ,extract all links and images present in the page.

Demo Code

Github link: https://github.com/abdulmoizeng/crawlers-demo/blob/master/crawler-demo/spider.py

Task II

Scrap Recurship site and extract links, One by one navigate to each link and extract images information.

Demo Code

Github link: https://github.com/abdulmoizeng/crawlers-demo/blob/master/crawler-demo/spider2.py

Stats

Task II takes around 22 seconds to complete

We have been using Python “request” and “parsel” package. Here are some list of features which these packages offer

Request package

Python request module basically offers following features

  • HTTP method calls
  • Working with response codes and headers
  • Maintains redirection and history for requests
  • Maintains sessions
  • Work with cookies
  • Errors and exceptions

Parsel package

Python parsel package offers following features

  • Extract text using CSS or XPath selectors
  • Regular expression helper methods

Crawler Service using request and Parsel

Service code : https://github.com/abdulmoizeng/crawlers-demo/blob/master/library-example/spider.py

We are using service as a class “RequestManager”, It offers following functionality

  • POST and GET calls with logging
  • Saving responses as log files of each HTTP request
  • Setting headers and cookies
  • Session management
  • Agent spoofing

Consider a real world example describe in the readme and play with the service accordingly

https://github.com/abdulmoizeng/crawlers-demo/tree/master/library-example

Scraping with AsyncIO

Scenario

Scrap Recurship site and extract links, Navigate each link asynchronously and extract images information.

Demo code

Github link : https://github.com/abdulmoizeng/crawlers-demo/blob/master/async-crawler-demo/spider.py

Stats

By Asyncio scraping similar task which takes 21 secondspreviously took 6 secondsto complete.

AsyncIO stats assessment

Its pretty similar task we perform before but now we are giving it powers by asyncio.

That’s pretty awesome but can we achieve more good performance? Lets explore this in upcoming part

Open Source Python Frameworks for spiders

As Python has very rich community we have frameworks which will take care of the optimizations and configurations.

We just need to follow their patterns.

Popular Frameworks

Following are three popular spider framework python has

  • Scrapy
  • Pyspider
  • Mechanical soup

Let’s try our similar scraping scenario we have been using through one of them.

I am choosing scrapy for the demo purpose

Scrapy

Scrapy is a scraping framework supported by an active community with which you can build your own scraping tool.

Feature offered

  • Scraping and parsing tools
  • Easily export the data it collects in a number of formats like JSON or CSV and store the data on a backend of your choosing
  • Has a number of built-in extensions for tasks like cookie handling, user-agent spoofing, restricting crawl depth, and others
  • Has an API for easily building your own additions.
  • Scrapy also offers a cloud to host spiders where spiders scale on demand and run from thousand to billions. Its like heroku of spiders.

Scenario

Scrap Recurship site and extract links, Navigate each link via scrapy and extract images information.

Demo Code

Github link : https://github.com/abdulmoizeng/crawlers-demo/blob/master/scrappy-demo/spider.py

Commands

scrapy run spider -o output.json spider.py

Stats

Task completed with Json export with in 1 second

Conclusion

Pretty awesome! Amazing speed scrapy shows and also remember we can deploy it to scrapyhub. If we are going to perform simple crawling which runs at very few time we should go with the basic tools but in case we are going to scale our spiders or we want our spiders to be performance optimized from beginning then we have to choose one of available spider framework.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Responses (1)

What are your thoughts?