codeburst

Bursts of code to power through your day. Web Development articles, tutorials, and news.

Follow publication

Member-only story

Web Scraping 101 with Python & Beautiful Soup

Ethan Jarrell
codeburst
Published in
6 min readMay 5, 2018

--

Webscraping is a method of data mining from web sites that uses software to extract all the information available from the targeted site by simulating human behavior. Each year, more and more businesses adopt webscraping tools as part of their business intelligence and advertising initiatives.

A popular use of web scraping is to search for online deals like airline tickets, concerts etc. For example, a python script could scrape a website when ticket sales go online, and use a bot to purchase the best tickets. A script would be able to do this much more quickly and efficiently than a human, as it can generate multiple requests per minute.

Obviously, although there can be enormous benefits to web scraping, it can also be used to cause harm, or adversely affect a business. After we talk about how to scrape the web, I’ll go over some of the reasons why you shouldn’t.

That being said, the actual code for webscraping is pretty simple.

Step 1: Find the URL you want to scrape.

One of my favorite things to scrape the web for, is to find speeches by famous politicians, scrape the text for the speech, and then analyze it for how often they approach certain topics, or use certain phrases. However, as with any sites, some of these speeches are protected, and scraping can be prohibited. Before you try to start scraping a site, it’s a good idea to check the rules of the website first. The scraping rules can be found in the robots.txt file, which can be found by adding a /robots.txt path to the main domain of the site.

Step 2: Identify the structure of the sites HTML

Once you’ve found a site that you can scrape, you can use chrome’s developer tools to inspect the site’s HTML structure. This is important, because more than likely, you’ll want to scrape data from certain HTML elements, or elements with specific classes or IDs. With the inspect tool, you can quickly identify which elements you need to target.

Step 3: Install Beautiful Soup and Requests

There are other packages and frameworks, like Scrapy. But Beautiful Soup allows you to parse the HTML in a a beautiful way, so that’s what I’m going to use. With Beautiful Soup, you’ll also need to install a Request library, which will fetch the url content.

--

--

Responses (3)

Write a response