Python For Beginners: How You Can Instantly Download Lots Of Cat Photos

Published in

codeburst

7 min readJul 20, 2020

You probably clicked on this article because you happened to see the clowder of adorable cats in the photo. Congratulations, you have been click-baited! Just kidding. All jokes aside, in this article I will be going over how to write a simple code for web scraping in Python using an import module package called BeautifulSoup that can extract specific information such as pictures or text from any websites.

I will be going through a step-by-step process detailing how you can use Python to extract multiple images from any website which allows you to save time as you will no longer have to manually download images one by one! The end goal is to have a good understanding of web scraping and how to utilize the main modules, such as Requests and BeautifulSoup functions. I will be discussing additional modules that you will need to know for the code. Do not worry; there will be another cute cat photo at the end of the article, so stay tuned!

Disclaimer

Please be advised that the use of web scraping is not illegal in the United States. However, be sure not to break any laws or terms of services that can negatively impact the targeted website. Please check if web scraping is legal and allowed in your region before you attempt to use it. This article is mainly for learning purposes and understanding what you can do with web scraping in Python.

Prerequisite On What You Need To Know

Do your research on miscellaneous operating system interfaces — OS, system-specific parameters and functions — sys, high-level file operations — shutil import module, and thread-based parallelism — threading.
Do your research and understand the specific functions for the Requests import module — Make sure to download the package to your computer using PIP install! This is important for the code.
Do your research and understand the functions for BeautifulSoup — bs4 import module — Make sure to download the package to your computer using PIP install! This is important for the code.
You need to have some basic knowledge of HTML such source file — src, tags, image links — JPEG.
Learn the fundamentals and syntax of Python functions — def.
Declaring variables and learn to give a convenient name for your variables
Learn and practice using for loop and if loops — Conditional statements in general.
Read and learn the use of local and global variables.
You should learn how to use counters in your code.
Not required, but it is an excellent practice to learn the basics of regular expression operations — re.
Not required, but it is good practice to learn how to use error handling exceptions in your code.

How To View The HTML Page Source On Any Website

You will need to know some basics of HTML when looking into web scraping in Python. I will suggest that you take a look at the HTML structure of a website and see how they embedded the image links and the text. To see the page source, go to a random website and right-click, then click on “View page source.”

Right-click and look for “View page source.”

<img> Tag Example And What You Need To Know

Here is an example of “img” tags that are being used in HTML. You should take a look at some image tag examples for your knowledge. The overall goal is to have our code scan for the image tags in the HTML structure, then extract any images embedded in the HTML. In this case, we want to extract “sample.jpg” and download it to the respective folder.

<img src="sample.jpg" alt="This is an example" width="500" height="600">

Overall Workflow

I will be going over the workflow that demonstrates the logic and functions needed to build the code. You will see how everything works and why you will need specific functions to extract data from the HTML page source. A workflow is an excellent way for you to have a more precise visualization, and it will help with any roadblocks that you may face.

Run-Through Of The Code Structure

The first thing you should do is figure out what module packages you will need for your code. In this case, we will be using requests, sys, os, shutil, re, threading, and BeautifulSoup import modules.

Note: the BeautifulSoup function is located in “bs4”, so we are just letting the compiler know that BeautifulSoup will be called “bs” throughout the code.

import requests, sys, os, shutil, re, threading
from bs4 import BeautifulSoup as bs

We are creating our first Python function called “is_URL_Valid,” where it will grab the website page source code utilizing the requests module. The variable named “res” will request the website page called “URL” thanks to the “request.get” function. The if statement will check to see if the web page status call is equal to “OK,” then it will return the BeautifulSoup object. The first parameter, “res.text,” is the raw content of the response, and the second parameter, “features=” lxml” is the fastest HTML parser in BeautifulSoup.

def is_URL_Valid(URL):
    res = requests.get(URL)
    if res.status_code == requests.codes.ok:                      #FYI - HTTP Status Code for 200 means "OK"
        return bs( res.text, features="lxml" )

The second Python function will be called “downloadingImages,” where we will be using the variables called “URL” and “imgName” to download the images. Next, we will need to make the code run as many times as it needs to download multiple specific images, so we will need to increment a global variable by calling the “COUNTER” variable in this function. The goal of the function is to try to see if the webpage is accessible by checking the HTTP status code. We open the images located on the website and download the specific targeted file’s raw content using “shutil.copyfileobj.” Once the function finished downloading all the possible images, it will automatically close thanks to “file.close()” — the execution will hang without a close() function. The exception is there just in case there are no specific images on the site.

def downloadingImages( URL, imgName ):
    
    global COUNTER
    COUNTER+=1
    try:
        res = requests.get( URL, stream=True )
        if res.status_code == requests.codes.ok:                    
            res.raw.decode_content = True                           # To handle the content-Encoding http header
           
            file = open( imgName, "wb" )
                             
            shutil.copyfileobj(res.raw, file)                       
            file.close()                                            
            print ("[*] Downloaded Image: %s" % imgName)               
            
    except Exception:
        print ("[~] Error Occured with %s : %s" % (imgName, error))

Now we have to create the main function that will take the variable called “HTML” to check if the website is valid. We need to create another variable name “tags” so we can use it to make the program understand that we are using it for HTML. Next, we will need to call a variable name “source” to get “src” from the HTML page source. We will use the regular expression to validate that the link found with the variable “src” will contain an image link — file extensions such as PNG, JPG, JPEG, and SVG. Once the program found all the image links, then the variable “target” will begin downloading the images.

def main():
    html = is_URL_Valid( Insert a website link here" )
    tags = filter( html )for tag in tags:
        source = tag.get( "src" )if source:
            
            source = re.match( r"((?:http[s]?:/.*)?/(.*\.(?:png|jpg|jpeg|svg)))", source )if source:
                (URL, imgName) = source.groups(0)
                
                
                target = threading.Thread( target=downloadingImages, args=(URL, imgName.split("/")[-1]), daemon=True)
                target.start()

Final Results

html = is_URL_Valid( "Insert a website link here" )
#Copy and paste a website link above and then run the code!

After running the code, you should notice the output of what images were extracted and downloaded.

Now you can download and stare at cute cats for as long as you want!

Overall

The code is straightforward once you break down the workflow and figure out what you need to research further on. The hardest part is figuring out how to write the regular expression for validating a website link with an image extension. Everything else is simple to understand as long as you do your research based on the prerequisite stated in this article. I believe reviewing this web scraping code is a great way to practice your Python coding and is very beginner-friendly!

Thank you for taking the time to check out my article. I hope this will help people who are interested in learning to code. I am primarily focusing on becoming a better programmer and just coding for fun during my free time. I would love to hear from all of you! You can connect with me on Linkedin.