Chapter 12: Web Scraping

Scene: Escaping the Local Drive

Chaitanya is staring at the terminal. He’s successfully backed up the school’s files, validated input, and organized folders. But he is restless.

Chaitanya: “Ma’am, my scripts are fast, but they are trapped. They only know about the files physically sitting on my computer. I want them to look up current weather data, download new curriculum PDFs automatically, and pull the latest news for the school bulletin.”

Aditi Ma’am: “You want your code to talk to the internet. That requires a technique called Web Scraping. It is the art of writing a program that acts like a web browser, but moves a thousand times faster than a human clicking a mouse.”

Chaitanya: “Is that legal?”

Aditi Ma’am: “Usually, yes, as long as you don’t overwhelm a website’s servers by downloading 50 pages a second. But before we rip the data out of a website, let’s start with the simplest form of internet control: making Python open your actual browser.”

The `webbrowser` Module

Aditi Ma’am: “Python comes with a built-in module called webbrowser. It has exactly one job: opening a new tab to a specific URL.”

Python

import webbrowser

# Python takes control of your mouse and opens Chrome/Firefox/Edge
webbrowser.open('https://inventwithpython.com/')

Chaitanya: (Runs it) “A new tab just popped open on my screen. That’s neat, but what is the practical use of that? I can just click a bookmark.”

Aditi Ma’am: “Imagine a script that opens all your daily work tabs automatically when you boot up your computer. Or better yet, imagine a script that takes text you copied to your clipboard and automatically searches a map for it.”

Project: The Automatic Map Launcher (`mapIt.py`)

Aditi Ma’am: “The Geography teacher gave you a massive list of addresses in a text file. Every time you want to see where one is, you have to highlight it, press Ctrl-C, open a browser, go to Google Maps, click the search bar, and press Ctrl-V.”

Chaitanya: “That takes about 10 seconds per address.”

Aditi Ma’am: “Let’s reduce it to zero. We will write a script called mapIt.py. It reads your clipboard, builds the Google Maps URL automatically, and opens the tab. First, we need to understand how Google Maps formats its URLs.”

Chaitanya: “I just checked. If I search for ‘New Delhi’, the URL becomes https://www.google.com/maps/place/New+Delhi.”

Aditi Ma’am: “Exactly. So all we have to do is take the text from the clipboard, stick it onto the end of https://www.google.com/maps/place/, and open it.”

Python

import webbrowser, sys, pyperclip

# Check if command line arguments were passed
if len(sys.argv) > 1:
    # Get address from command line
    # sys.argv is a list of strings like ['mapIt.py', 'New', 'Delhi']
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard
    address = pyperclip.paste()

# Open the web browser
webbrowser.open('https://www.google.com/maps/place/' + address)

Chaitanya: “Wait, what is sys.argv?”

Aditi Ma’am: “Sometimes you don’t want to use the clipboard. You just want to type python mapIt.py New Delhi directly into your terminal. sys.argv is a list that captures exactly what you typed. Since index 0 is the script name (mapIt.py), we use sys.argv[1:] to grab the actual address.”

Chaitanya: (Copies an address from a text file and runs the script) “Whoa! It instantly opened Chrome and dropped a pin on the exact building! No typing required.”

Downloading Files with `requests`

Aditi Ma’am: “webbrowser is fun, but it relies on a human to look at the screen. True automation means Python downloads the web page and reads it directly. For that, we use a third-party module called requests.”

Chaitanya: “I’ll install it. pip install requests.”

Aditi Ma’am: “The internet works on a Request/Response system. You request a file from a server, and the server hands it back. We use requests.get() to download a file.”

Python

import requests

# Download a text copy of Romeo and Juliet
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')

print(type(res))

Output: <class 'requests.models.Response'>

Chaitanya: “Okay, the Response object is stored in res. How do I see the actual play?”

Aditi Ma’am: “Check the text attribute of the Response object. Let’s print just the first 250 characters.”

Python

print(len(res.text))
print(res.text[:250])

Output:

Plaintext

178978
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare
...

Chaitanya: “It downloaded a 178,000-character book in half a second! It’s just sitting there in a massive string variable.”

Checking for Errors (`raise_for_status`)

Aditi Ma’am: “The internet is messy. Sometimes websites are down. Sometimes you mistype the URL. If you try to download a page that doesn’t exist, requests.get() will not crash your program. It will just download a ‘404 Not Found’ error page.”

Chaitanya: “That’s bad. My script will think it downloaded a syllabus, but it actually downloaded an error page.”

Aditi Ma’am: “Exactly. To force Python to crash immediately if the download fails, you must always call res.raise_for_status() right after requests.get().”

Python

res = requests.get('https://automatetheboringstuff.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

Output: There was a problem: 404 Client Error: Not Found for url: ...

Chaitanya: “Good. So it triggers an exception I can catch, just like we learned in Chapter 11. I should always use raise_for_status().”

Saving Downloaded Files to the Hard Drive

Chaitanya: “Alright, I have Romeo and Juliet stored in the res.text variable. But when I close Python, it vanishes from RAM. How do I save it to my SchoolSystem folder permanently?”

Aditi Ma’am: “You use the standard open() and write() functions from Chapter 9, with one critical difference. You must open the file in Write Binary (‘wb’) mode, not just regular Write (‘w’) mode.”

Chaitanya: “Why Binary? It’s just text!”

Aditi Ma’am: “Because requests downloads raw data. Even if it is text, we want to maintain the exact Unicode encoding the server used. If you download an image or a PDF later, you absolutely must use binary mode, or the file will corrupt.”

Aditi Ma’am: “Also, you cannot just say playFile.write(res.text). A 10-Gigabyte video would crash your RAM. You must write it in chunks using a for loop and the iter_content() method.”

Python

import requests

res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.raise_for_status()

# 1. Open a new file in Write Binary ('wb') mode
playFile = open('RomeoAndJuliet.txt', 'wb')

# 2. Write the data in 100,000-byte chunks
for chunk in res.iter_content(100000):
    playFile.write(chunk)

# 3. Close the file
playFile.close()

Chaitanya: (Checks his folder) “The .txt file is there! It’s safely on my hard drive.”

Aditi Ma’am: “This is the exact loop you will use to download anything—PDFs, spreadsheets, images. iter_content(100000) ensures your script only uses 100 Kilobytes of RAM at a time, no matter how massive the file is.”

Aditi Ma’am: “You can now download entire web pages to your hard drive. But right now, you are only downloading simple .txt files.”

Chaitanya: “What happens if I download a real webpage, like a Wikipedia article?”

Aditi Ma’am: “You get a massive, unreadable wall of HTML code. Millions of <tags>, <div> elements, and scripts.”

Chaitanya: “How do I extract just the specific headline or the table I want from all that garbage?”

Aditi Ma’am: “For that, you need a scalpel. In Part 2, I will teach you how to use the BeautifulSoup module to parse HTML and slice the exact data you want right off the webpage.”

PART 2

Scene: The HTML Nightmare

Chaitanya is eager to try his new requests skills on a real website. He writes a quick script to download the Wikipedia page for India and print the text to the screen.

Python

import requests

res = requests.get('https://en.wikipedia.org/wiki/India')
res.raise_for_status()
print(res.text[:500])

Chaitanya: (Hits Run) “Ah! What is this? My screen is filled with garbage!”

Output:

HTML

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>India - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled";...

Aditi Ma’am: “That is not garbage, Chaitanya. That is HTML (Hypertext Markup Language). It is the skeleton of the internet. When you look at a website in Chrome, your browser reads all those hidden <tags> and paints a pretty picture for you. But Python sees the raw, unvarnished truth.”

Chaitanya: “But I just want the main text! I don’t want all these <meta> and <script> tags. How do I slice the actual information out of this mess?”

Aditi Ma’am: “You could try using Regular Expressions from Chapter 7…”

Chaitanya: “Write a regex to parse HTML? That sounds like a nightmare.”

Aditi Ma’am: “It is. Never use regex to parse HTML. HTML is too unpredictable. Instead, we use a third-party module called BeautifulSoup. It is a digital scalpel designed specifically to cut through HTML and extract exactly what you want.”

Installing and Boiling the Soup

Aditi Ma’am: “First, install it from your terminal: pip install beautifulsoup4.”

Aditi Ma’am: “BeautifulSoup works by taking that massive string of HTML text and turning it into a neat, organized tree of objects that Python can easily navigate. We call this ‘making the soup’.”

Python

import requests, bs4

# 1. Download the page
res = requests.get('https://nostarch.com')
res.raise_for_status()

# 2. Make the soup!
noStarchSoup = bs4.BeautifulSoup(res.text, 'html.parser')

print(type(noStarchSoup))

Output: <class 'bs4.BeautifulSoup'>

Chaitanya: “Okay, the HTML is loaded into the noStarchSoup variable. Now how do I use the scalpel?”

The `select()` Method and CSS Selectors

Aditi Ma’am: “The only method you really need to know is select(). You pass it a string, and it searches the entire HTML document for tags that match that string.”

Chaitanya: “What kind of string?”

Aditi Ma’am: “A CSS Selector. Web developers use CSS to style specific parts of a webpage. We can hijack those exact same selectors to find those parts. Here is your cheat sheet:”

If you want to find…	Pass this to `select()`
Any tag named `<div>`	`soup.select('div')`
An element with `id="author"`	`soup.select('#author')`
All elements with `class="notice"`	`soup.select('.notice')`
All `<span>` tags inside a `<div>`	`soup.select('div span')`

Chaitanya: “Let me look at an example. Suppose the school website has an announcements page. The HTML looks like this:”

HTML

<div id="announcements">
    <p class="urgent">School is closed tomorrow!</p>
    <p>Don't forget your permission slips.</p>
</div>

Chaitanya: “If I only want the urgent announcement, I would look for the urgent class. So I’d use soup.select('.urgent')?”

Aditi Ma’am: “Exactly. Let’s write the Python code to test that.”

Python

import bs4

# Pretend we downloaded this HTML
html_doc = """
<div id="announcements">
    <p class="urgent">School is closed tomorrow!</p>
    <p>Don't forget your permission slips.</p>
</div>
"""

soup = bs4.BeautifulSoup(html_doc, 'html.parser')

# Find all elements with class="urgent"
urgent_elements = soup.select('.urgent')

print(type(urgent_elements))
print(len(urgent_elements))

Output:

Plaintext

<class 'bs4.element.ResultSet'>
1

Chaitanya: “It returned a ResultSet, which is basically a list. And the length is 1, so it found exactly one match!”

Extracting Text with `getText()`

Chaitanya: “But wait. If I print urgent_elements[0], I get <p class="urgent">School is closed tomorrow!</p>. It still has the HTML tags glued to it. How do I get just the clean, human-readable text?”

Aditi Ma’am: “You call the getText() method on the specific element.”

Python

clean_text = urgent_elements[0].getText()
print(clean_text)

Output: School is closed tomorrow!

Chaitanya: “That is brilliant. The tags just melted away.”

Extracting Attributes with `get()`

Aditi Ma’am: “Sometimes the text isn’t what you want. What if you want to find all the PDF links on a page so you can download them? A link in HTML looks like this:” <a href="https://school.edu/syllabus.pdf">Download Here</a>

Chaitanya: “If I use getText(), I’ll just get the words ‘Download Here’. That’s useless. I need the actual URL inside the href part.”

Aditi Ma’am: “Exactly. href is called an Attribute. To rip an attribute out of an HTML tag, you treat the element like a Python dictionary and use the get() method.”

Python

# Pretend we found this link element
# <a id="syllabus_link" href="https://school.edu/syllabus.pdf">Download Here</a>

link_elements = soup.select('#syllabus_link')
first_link = link_elements[0]

# Extract the URL from the href attribute
url = first_link.get('href')
print(url)

Output: https://school.edu/syllabus.pdf

Chaitanya: “I see! select() finds the tag, getText() gets the words inside it, and get() extracts the hidden URLs or data attached to it.”

Aditi Ma’am: “You have just mastered the three pillars of Web Scraping.”

Finding the Right CSS Selector (The Developer Tools Trick)

Chaitanya: “Ma’am, this is great if I already know the HTML of the website. But real websites are thousands of lines long. I can’t read all that to find the id or class of the one button I want to scrape.”

Aditi Ma’am: “You don’t have to. The engineers at Google already built a tool to do it for you. It’s built right into your Chrome browser.”

Chaitanya: “Where?”

Aditi Ma’am: “Go to any website. Right-click on the exact piece of text or image you want to scrape, and click ‘Inspect’ (or press F12). Your browser will split in half, revealing the HTML Matrix.”

Chaitanya: (Opens a weather website, right-clicks the temperature, and hits Inspect) “Whoa. It highlighted the exact line of HTML for that temperature!”

Aditi Ma’am: “Now, right-click that highlighted HTML line, go to Copy, and select Copy selector. Chrome will instantly generate the perfect CSS Selector string for you. You just paste it straight into your Python soup.select() code.”

Chaitanya: “That feels like cheating.”

Aditi Ma’am: “It’s not cheating; it’s efficiency. You let the browser do the heavy lifting of figuring out the complex paths like div#weather-widget > span.temp, and you let Python do the scraping.”

Aditi Ma’am: “You now have all the individual pieces, Chaitanya. You know how to download pages with requests, parse them with BeautifulSoup, and save files to your hard drive.”

Chaitanya: “Are we going to build a project now?”

Aditi Ma’am: “Yes. In Part 3, we are going to combine everything into a single, unstoppable script. We are going to write a program that automatically goes to a website, finds every single image linked on the page, downloads them all, and saves them neatly into a folder on your computer.”

PART 3

Scene: The Automated Harvester

Chaitanya has his terminal open, ready to unleash his new skills.

Aditi Ma’am: “The Principal wants an offline archive of every single daily comic strip posted on a popular educational website. There are over 100 comics. If you right-click and ‘Save Image As’ for every single one, it will take you hours. We are going to write a script to do it in minutes.”

Chaitanya: “So the script needs to open the first page, find the comic image, download it, find the ‘Next’ button, click it, and repeat?”

Aditi Ma’am: “Exactly. But remember, requests doesn’t ‘click’ buttons. It just downloads HTML. To move to the next page, we have to use BeautifulSoup to find the URL hidden inside the ‘Next’ button, and then feed that new URL to requests.”

Step 1: Setting up the Loop and Folder

Aditi Ma’am: “First, we import our modules and create a safe folder to store the images. We use os.makedirs(exist_ok=True) so it doesn’t crash if we run the script twice.”

Python

import requests, os, bs4

url = 'https://xkcd.com'               # Starting URL
os.makedirs('comics', exist_ok=True)   # Store comics in ./comics

Step 2: Downloading the Page

Aditi Ma’am: “Now we start our while loop. We want it to run until it hits a page that doesn’t have a ‘Previous’ or ‘Next’ button, which usually means the end of the archive.”

Python

while not url.endswith('#'):
    print(f'Downloading page {url}...')
    res = requests.get(url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

Chaitanya: “Okay, the HTML is in the soup. Now I need to find the image. I used Chrome’s ‘Inspect’ tool on the website, and I noticed the main comic image is always inside a <div> tag with the id comic. Inside that is an <img> tag.”

Aditi Ma’am: “Perfect. So your CSS selector is #comic img. Let’s extract it.”

Step 3: Finding and Extracting the Image URL

Python

    # Find the URL of the comic image
    comicElem = soup.select('#comic img')
    
    if comicElem == []:
        print('Could not find comic image.')
    else:
        # Extract the 'src' attribute, which holds the image link
        comicUrl = 'https:' + comicElem[0].get('src')
        
        # Download the actual image file
        print(f'Downloading image {comicUrl}...')
        res = requests.get(comicUrl)
        res.raise_for_status()

Chaitanya: “Wait, why did you add 'https:' + to the front of the URL?”

Aditi Ma’am: “Because some websites use relative URLs for their images, like //imgs.xkcd.com/comics/math.png. If you hand that to requests, it won’t know what protocol to use and will crash. You always have to make sure it’s a complete, absolute URL.”

Step 4: Saving the Image in Chunks

Chaitanya: “Now I have the raw image data sitting in the res variable. I need to save it to the hard drive using Write Binary ('wb') mode and chunks.”

Python

        # Save the image to the ./comics folder
        # We use os.path.basename to grab just the filename (e.g., 'math.png')
        imageFile = open(os.path.join('comics', os.path.basename(comicUrl)), 'wb')
        
        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
            
        imageFile.close()

Step 5: Finding the ‘Prev’ Button to Loop

Aditi Ma’am: “The image is saved. Now, the bot needs to navigate to the previous comic to continue the archive. Inspect the ‘Prev’ button on the website.”

Chaitanya: “It looks like this: <a rel="prev" href="/2890/">...</a>. So I just need to select a[rel="prev"] and extract the href attribute!”

Python

    # Get the Prev button's url
    prevLink = soup.select('a[rel="prev"]')[0]
    
    # Update the main URL variable so the while loop continues to the next page!
    url = 'https://xkcd.com' + prevLink.get('href')

print('Done.')

Chaitanya: (Runs the complete script) “Look at it go! It’s downloading the pages, finding the image links, saving the PNG files, and jumping to the next URL automatically. My comics folder is filling up with images at a rate of two per second!”

Aditi Ma’am: “You have just written a web scraper. You automated a task that would have taken you all afternoon into a 30-line script.”

The Wall: JavaScript and Logins

Chaitanya: “This is incredible. I can use this to log into the school’s online portal and download my grade reports automatically!”

Aditi Ma’am: “Stop right there. Your script cannot do that.”

Chaitanya: “Why not? I can just find the login button and the username field.”

Aditi Ma’am: “requests and BeautifulSoup are fast, but they are dumb. They do not run JavaScript, they cannot fill out passwords, and they cannot click buttons. They just politely ask for a text file and read it. If a website requires you to log in, or if it loads its data using heavy JavaScript animations, BeautifulSoup will fail completely. It will just see a blank page.”

Chaitanya: “So I’m locked out of any modern, interactive website?”

Enter Selenium: The Ghost in the Machine

Aditi Ma’am: “No. When requests fails, we bring in the heavy artillery: Selenium.”

Aditi Ma’am: “Selenium is not a web scraper. It is a web driver. It physically opens a real Google Chrome browser on your screen, and uses Python to control the mouse and keyboard like a ghost.”

Chaitanya: “It actually opens the browser?”

Aditi Ma’am: “Yes. You can watch it type your password, click the ‘Submit’ button, scroll down the page, and wait for the JavaScript to load. Because it uses a real browser, websites cannot tell the difference between Selenium and a human.”

Chaitanya: “How do I use it?”

Aditi Ma’am: “You install it with pip install selenium. Then, you import the webdriver module. Here is a tiny taste of what it looks like:”

Python

from selenium import webdriver
from selenium.webdriver.common.by import By

# Open a real Chrome browser
browser = webdriver.Chrome()

# Go to a website
browser.get('https://inventwithpython.com')

# Find a link by its text and physically click it!
linkElem = browser.find_element(By.LINK_TEXT, 'Read Online for Free')
linkElem.click()

Chaitanya: “That is terrifying and amazing.”

Aditi Ma’am: “Selenium is powerful, but it is slow because it has to load all the images and scripts of a real browser.

Rule of Thumb: Always try to use requests and BeautifulSoup first. It is a hundred times faster.
Only use Selenium if you absolutely have to log in or interact with JavaScript.”

Summary Box (Chapter 12)

webbrowser: Use webbrowser.open(url) to open a tab in your default browser.
requests: Use requests.get(url) to download HTML or files. Always follow it with res.raise_for_status() to catch bad links.
Saving Files: Open a file in 'wb' (Write Binary) mode, and use a for chunk in res.iter_content(100000): loop to save it without crashing your RAM.
BeautifulSoup: Parse HTML using bs4.BeautifulSoup(res.text, 'html.parser').
Extracting Data: Use soup.select('.className') to find tags. Use getText() to get the visible words, and get('href') to extract hidden URLs or attributes.
selenium: When BeautifulSoup fails on login screens or dynamic JavaScript, use Selenium to physically control a real browser window.

Aditi Ma’am: “You have conquered the internet, Chaitanya. Your programs can now pull data from anywhere in the world. But right now, we are mostly dealing with raw text and images. Tomorrow, we dive into the language of the business world.”

Chaitanya: “What’s that?”

Aditi Ma’am: “Spreadsheets. In Chapter 13: Working with Excel Spreadsheets, you will learn how to read, write, and calculate thousands of rows of .xlsx files without ever opening Microsoft Excel.”

Are you ready to dive into Chapter 13: Working with Excel Spreadsheets, where Chaitanya learns to manipulate the business world’s favorite file format?

Scene: Escaping the Local Drive

The webbrowser Module

Project: The Automatic Map Launcher (mapIt.py)

Downloading Files with requests

Checking for Errors (raise_for_status)

Saving Downloaded Files to the Hard Drive

Scene: The HTML Nightmare

Installing and Boiling the Soup

The select() Method and CSS Selectors

Extracting Text with getText()

Extracting Attributes with get()

Finding the Right CSS Selector (The Developer Tools Trick)

Scene: The Automated Harvester

Step 1: Setting up the Loop and Folder

Step 2: Downloading the Page

Step 3: Finding and Extracting the Image URL

Step 4: Saving the Image in Chunks

Step 5: Finding the ‘Prev’ Button to Loop

The Wall: JavaScript and Logins

Enter Selenium: The Ghost in the Machine

Summary Box (Chapter 12)

Leave a Comment Cancel reply

The `webbrowser` Module

Project: The Automatic Map Launcher (`mapIt.py`)

Downloading Files with `requests`

Checking for Errors (`raise_for_status`)

The `select()` Method and CSS Selectors

Extracting Text with `getText()`

Extracting Attributes with `get()`