Chapter 7: Pattern Matching with Regular

Scene: The “Needle in a Haystack” Crisis

Chaitanya is sitting in the computer lab, his posture slumped. On his screen is a document titled Parent_Feedback_Raw.txt. It is a messy, 50-page collection of emails, notes, and scraped web forms.

Aditi Ma’am walks in, reviewing a clipboard. She notices Chaitanya’s distress.

Aditi Ma’am: You look like you’re trying to read the Matrix, Chaitanya. What’s the problem?

Chaitanya: It’s the Principal’s request. She wants a clean list of every parent’s phone number from this feedback file so we can send out the “School Closed” SMS alert. But the data is a mess!

Chaitanya: Look at this:

  • Some parents wrote: 415-555-1011
  • Others wrote: (415) 555-9999
  • Some just typed: 415 555 0000
  • And one person wrote: My number is 415.555.1234

Chaitanya: I’m trying to write a Python function to find them all, but there are too many variations.

The “Hard Way” (Without Regex)

Aditi Ma’am: Show me what you have written so far.

Chaitanya: It’s ugly. I wrote a function called isPhoneNumber() that checks every single character manually.

Python

def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True

print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))

Aditi Ma’am: And to find the numbers in the document?

Chaitanya: I have to loop through the entire string, chunk by chunk.

Python

message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done')

Aditi Ma’am: Chaitanya, this code is fragile. It breaks if the phone number has an extension. It breaks if they use parentheses. It breaks if they use dots instead of dashes. You are trying to describe a “Phone Number” using rigid if statements. That’s like trying to draw a portrait using only a ruler.

Chaitanya: Is there a better way?

Aditi Ma’am: Yes. We use Regular Expressions (or Regex). Instead of telling the computer how to check every character, you describe what the pattern looks like.

Finding Patterns with Regex

Aditi Ma’am: Regex is a special text-processing language that Python (and almost all other modern languages) can understand. It allows you to define a “Pattern” for the data you want.

Aditi Ma’am: In your manual code, you were looking for:

  1. Three digits
  2. A hyphen
  3. Three digits
  4. A hyphen
  5. Four digits

Aditi Ma’am: In Regex, \d stands for “any digit” (0-9). So the pattern becomes: \d\d\d-\d\d\d-\d\d\d\d

Chaitanya: That is so much shorter.

Aditi Ma’am: And much more powerful. To use this in Python, we use the re module.

The Four Steps of Regex

Aditi Ma’am: Using regex always involves these four steps. Memorize them.

  1. Import: import re
  2. Compile: Create a Regex Object using re.compile().
  3. Search: Use the .search() method to look for the pattern in a string.
  4. Group: Use the .group() method to get the actual text that was found.

Chaitanya: Let’s try it on the school data.

Python

import re

# Step 2: Create the Regex Object
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

# Step 3: Search text
mo = phoneNumRegex.search('My number is 415-555-4242.')

# Step 4: Get result
print('Phone number found: ' + mo.group())

Output: Phone number found: 415-555-4242

Chaitanya: Wait, what is mo?

Aditi Ma’am: It stands for Match Object.

  • If search() finds the pattern, it returns a Match Object.
  • If it finds nothing, it returns None.

Aditi Ma’am: That’s why you will often see code like this to prevent crashes:

Python

mo = phoneNumRegex.search('My number is 415-555-4242.')
if mo:
    print('Found: ' + mo.group())
else:
    print('No match found.')

Chaitanya: Why did you put an r before the string? r'\d\d\d...'?

Aditi Ma’am: Good catch. Remember Raw Strings from the last chapter? In Regex, we use backslashes \ for everything (\d for digit, \w for word). Python also uses backslashes for things like newlines \n. To stop them from fighting, we always use raw strings r'...' for regex patterns. It tells Python, “Do not process the backslashes. Leave them for the Regex engine.”

Grouping with Parentheses

Chaitanya: Ma’am, sometimes I only need the Area Code. In the string 415-555-4242, I just want 415.

Aditi Ma’am: Then we use Parentheses () to create Groups. Regex isn’t just about matching; it’s about extracting.

New Pattern: (\d\d\d)-(\d\d\d-\d\d\d\d)

  • Group 1: The first (\d\d\d)
  • Group 2: The rest (\d\d\d-\d\d\d\d)

Chaitanya: How do I access them?

Aditi Ma’am: Using the group() method with a number.

Python

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')

print(mo.group(1))
print(mo.group(2))
print(mo.group(0))
print(mo.group())

Output:

415
555-4242
415-555-4242
415-555-4242

Aditi Ma’am: group(0) or just group() always gives you the entire match. group(1) gives you the first set of parentheses, and so on.

Chaitanya: What if I want all the pieces at once?

Aditi Ma’am: Use groups() (plural). It returns a list (technically a tuple) of all the groups.

Python

areaCode, mainNumber = mo.groups()
print('Area Code: ' + areaCode)
print('Main Number: ' + mainNumber)

Chaitanya: But wait! Parentheses have a special meaning in Regex now. What if I need to match a real parenthesis? Like (415) 555-4242?

Aditi Ma’am: You have to escape it with a backslash: \( and \).

Python

phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is (415) 555-4242.')
print(mo.group(1))

Output: (415)

Matching Multiple Options with the Pipe |

Aditi Ma’am: Sometimes, the pattern isn’t just one fixed thing. In the “Father Name” field, parents sometimes write “Mr. Sharma” or just “Sharma”. Imagine you want to search for either “Batman” or “Tina Fey”.

Chaitanya: Do I make two regex objects?

Aditi Ma’am: No. Use the Pipe Character |. It means “OR”.

Python

heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
print(mo1.group())

Output: Batman

Chaitanya: It only found Batman. What about Tina?

Aditi Ma’am: search() finds the first occurrence. Since Batman came first in the string, that’s what it returned.

Aditi Ma’am: You can also use the pipe inside parentheses. Say you want to match “Batman”, “Batmobile”, “Batcopter”, or “Batbat”.

Python

batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.group(1))

Output:

Batmobile
mobile

Aditi Ma’am: See? mo.group() gave the whole word “Batmobile”. mo.group(1) gave just the part inside the parentheses: “mobile”. This is incredibly useful for parsing prefixes.

Optional Matching with the Question Mark ?

Chaitanya: Ma’am, back to the phone numbers. Some parents write 415-555-4242, but some just write 555-4242. My pattern \d\d\d-\d\d\d-\d\d\d\d expects an area code. It will fail on the short numbers.

Aditi Ma’am: We can make the area code Optional. We use the Question Mark ?.

  • The ? symbol says: “The group matches zero or one times.”

Chaitanya: So it’s like saying, “It might be there, it might not.”

Aditi Ma’am: Exactly. Let’s rewrite the phone regex.

Python

# The pattern says: (Area Code + hyphen) is optional.
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')

mo1 = phoneRegex.search('My number is 415-555-4242')
print(mo1.group())

mo2 = phoneRegex.search('My number is 555-4242')
print(mo2.group())

Output:

415-555-4242
555-4242

Aditi Ma’am: It worked for both! In the first case, the optional group appeared (1 match). In the second case, it didn’t (0 matches).

Chaitanya: What if I want to match the actual text “dinner?” versus “dinner”?

Aditi Ma’am: Again, because ? is a special character, you must escape it: \?.


Aditi Ma’am: That is enough for now, Chaitanya.

  • You know how to create a Regex Object.
  • You know how to Search and Group results.
  • You know how to make parts of the pattern Optional using ?.

Chaitanya: What about repeating things? Writing \d\d\d\d\d is annoying. Can I just say “5 digits”?

Aditi Ma’am: Yes. In Part 2, I will teach you the Star *, the Plus +, and Curly Brackets {}. That is where the real speed comes in.


Scene: The “Repetition” Problem PART 2

Aditi Ma’am: Chaitanya, remember our phone number pattern? \d\d\d-\d\d\d-\d\d\d\d

Chaitanya: Yes. It works, but typing \d four times is annoying. What if I want to match a credit card number? That’s 16 digits! Do I have to type \d sixteen times?

Aditi Ma’am: No. Regex has special characters for Repetition. Think of them as multipliers in math.

The Star * (Zero or More)

Aditi Ma’am: The Star means “Match zero or more times.”

  • Imagine you are looking for “Batman”.
  • Sometimes people write “Batwoman”.
  • Sometimes people write “Batwowowoman”.

Python

batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())

mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())

mo3 = batRegex.search('The Adventures of Batwowowoman')
print(mo3.group())

Output:

Batman
Batwoman
Batwowowoman

Chaitanya: Wait. In “Batman”, there is no “wo”.

Aditi Ma’am: Exactly. “Zero times.” The group (wo) is completely optional, but if it is there, it can repeat forever.

The Plus + (One or More)

Aditi Ma’am: The Plus means “Match one or more times.”

  • It must appear at least once.
  • Bat(wo)+man requires at least one “wo”.

Python

batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())

mo2 = batRegex.search('The Adventures of Batman')
print(mo2 == None)

Output:

Batwoman
True

Chaitanya: Ah! It failed on “Batman” because the “wo” was missing.

Aditi Ma’am: Correct. + is strict. * is lenient.

The Curly Brackets {} (Specific Counts)

Aditi Ma’am: Now, for your credit card problem. If you want exactly 16 digits, use {16}.

Regex: \d{16}

Chaitanya: That is so much better.

Aditi Ma’am: You can also give a range {min, max}.

  • Ha{3,5} matches “Haaa”, “Haaaa”, “Haaaaa”.
  • It will not match “Ha” (too short) or “Haaaaaa” (too long).

Aditi Ma’am: You can also leave one side empty.

  • {3,} means “3 or more”.
  • {,5} means “0 to 5”.

Greedy vs. Non-Greedy Matching

Aditi Ma’am: Python’s Regex is Greedy by default. This means it always tries to match the longest possible string.

Chaitanya: Example?

Aditi Ma’am: Look at the string 'HaHaHaHaHa'. Pattern: (Ha){3,5}

Python

greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())

Output: 'HaHaHaHaHa'

Chaitanya: It grabbed all 5.

Aditi Ma’am: Yes. Even though 3 “Ha”s would have been valid, it was greedy and took the maximum (5).

Aditi Ma’am: To make it Non-Greedy (lazy), add a Question Mark ? after the braces. Pattern: (Ha){3,5}?

Python

nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
print(mo2.group())

Output: 'HaHaHa' (It took the minimum valid amount: 3).

Chaitanya: So ? has two meanings now?

  1. Optional Group: (\d\d\d)?
  2. Non-Greedy: {3,5}?

Aditi Ma’am: Yes. Context matters.

findall() Method

Chaitanya: search() is great, but it stops after the first match. I need all the phone numbers in the document.

Aditi Ma’am: Then use findall(). It returns a List of Strings, not a Match Object.

Python

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # No groups
text = 'Cell: 415-555-9999 Work: 212-555-0000'
print(phoneNumRegex.findall(text))

Output: ['415-555-9999', '212-555-0000']

Aditi Ma’am: Warning: If your regex has Groups (), findall() behaves differently. It returns a list of tuples, where each tuple contains the groups.

Python

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # Has groups
print(phoneNumRegex.findall(text))

Output: [('415', '555', '9999'), ('212', '555', '0000')]

Chaitanya: That is actually useful! I can separate area codes instantly.

Character Classes []

Aditi Ma’am: Up until now, \d (digit) has been our only tool. But what if we want to match words? Or vowels?

Chaitanya: Do I have to write (a|e|i|o|u)?

Aditi Ma’am: No. Use square brackets [] to define your own Character Class.

  • [aeiouAEIOU] matches any vowel.
  • [a-zA-Z0-9] matches any alphanumeric character.

Aditi Ma’am: Example: Let’s find all the vowels in “RoboCop eats baby food.”

Python

vowelRegex = re.compile(r'[aeiouAEIOU]')
print(vowelRegex.findall('RoboCop eats baby food.'))

Output: ['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o']

Negative Character Classes [^]

Aditi Ma’am: By adding a caret ^ inside the brackets, you create a Negative Class.

  • [^aeiouAEIOU] means “Match anything that is NOT a vowel.”

Python

consonantRegex = re.compile(r'[^aeiouAEIOU]')
print(consonantRegex.findall('RoboCop eats baby food.'))

Output: ['R', 'b', 'C', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.']

Chaitanya: It even matched the spaces and the period!

Aditi Ma’am: Yes. Because a space is “not a vowel.”

Common Character Class Shortcuts

Aditi Ma’am: Memorize these. They are the keys to regex.

ShorthandMeaningMatches
\dDigit0-9
\DNot a DigitAny char that is NOT 0-9
\wWordLetter, number, or underscore
\WNot a WordSymbols, spaces, punctuation
\sWhitespaceSpace, tab, newline
\SNot WhitespaceAnything visible

Chaitanya: So if I want to match a classic Twitter username like @AlSweigart, I could use \w?

Aditi Ma’am: Yes. @ is a symbol, so it’s not \w. But the name is. Regex: @\w+


Aditi Ma’am: That’s it for Part 2. You now have the power to match multiple things (*, +, {}) and specific types of characters (\d, \w, []). In Part 3, we finish the chapter with the Wildcard Dot ., matching newlines, and building the final Phone & Email Extractor Project.


PART 3

Scene: The “Wildcard” Solution

Aditi Ma’am: Chaitanya, you’ve mastered specific patterns (\d, \w, [aeiou]). Now it’s time to learn how to match the unknown.

Chaitanya: What if I want to match any character? Like if I’m looking for a file extension, but I don’t know if it’s .txt, .py, or .doc?

Aditi Ma’am: You use the Dot . (Period). It is the Wildcard.

  • It matches any single character except a newline.

Python

atRegex = re.compile(r'.at')
print(atRegex.findall('The cat in the hat sat on the flat mat.'))

Output: ['cat', 'hat', 'sat', 'lat', 'mat']

Chaitanya: Wait, it matched “lat” in “flat”?

Aditi Ma’am: Yes, because . matches exactly one character. So flat becomes lat.

Matching Everything (.*)

Aditi Ma’am: Now, combine the Dot . with the Star *.

  • . = Any character.
  • * = Zero or more times.
  • .* = “Everything”.

Chaitanya: Why would I want to match everything?

Aditi Ma’am: To grab content between labels. Imagine scanning a form: First Name: Chaitanya Last Name: Sharma

Python

nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Chaitanya Last Name: Sharma')
print(mo.group(1))
print(mo.group(2))

Output:

Chaitanya
Sharma

Aditi Ma’am: The (.*) ate up everything until it hit “Last Name”. This is how web scrapers work!

Matching Newlines (re.DOTALL)

Chaitanya: But you said . doesn’t match newlines. What if the address spans two lines?

Address: 123 Syntax Ave
New Delhi, India

Aditi Ma’am: By default, .* stops at the end of the line. To make the Dot match newlines too, you pass a special argument to compile(): re.DOTALL.

Python

noNewlineRegex = re.compile(r'.*')
print(noNewlineRegex.search('Serve the public trust.\nProtect the innocent.').group())
# Output: 'Serve the public trust.' (Stops at \n)

newlineRegex = re.compile(r'.*', re.DOTALL)
print(newlineRegex.search('Serve the public trust.\nProtect the innocent.').group())
# Output: 'Serve the public trust.\nProtect the innocent.'

Case-Insensitive Matching (re.IGNORECASE)

Aditi Ma’am: Users are lazy, Chaitanya. They will type “robocop”, “ROBOCOP”, or “RoBoCoP”. To match all of them without writing [rR][oO][bB]..., use re.IGNORECASE (or re.I).

Python

robocop = re.compile(r'robocop', re.I)
print(robocop.search('RoboCop is part man, part machine.').group())

Output: RoboCop

Substituting Strings (sub())

Aditi Ma’am: Regex isn’t just for finding; it’s for finding and replacing. The sub() method is like “Find and Replace” in Word.

Scenario: The Principal wants to censor the names of secret agents in a document. Pattern: Agent \w+ (Agent followed by a word).

Python

namesRegex = re.compile(r'Agent \w+')
print(namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.'))

Output: CENSORED gave the secret documents to CENSORED.

Chaitanya: That is powerful.

Aditi Ma’am: You can even use parts of the original text in the replacement! Suppose you want to hide the agent’s name but keep the first letter: “Agent A****”. You use \1, \2, etc., to refer to Groups ().

Python

agentNamesRegex = re.compile(r'Agent (\w)\w*')
print(agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.'))

Output: A**** told C**** that E**** knew B**** was a double agent.

Chaitanya: Whoa! \1 kept the first letter (Group 1) and **** replaced the rest.

Managing Complex Regex (re.VERBOSE)

Aditi Ma’am: Chaitanya, look at this regex for an email address: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+(\.[a-zA-Z]{2,4})

Chaitanya: My eyes hurt.

Aditi Ma’am: It is ugly. To fix this, Python lets you write regex over multiple lines with comments. You just pass re.VERBOSE to compile(). Python will ignore whitespace and comments inside the pattern.

The Grand Project: Phone & Email Extractor

Aditi Ma’am: Now, combine everything. The Goal:

  1. Get text from the Clipboard (pyperclip).
  2. Find all Phone Numbers.
  3. Find all Email Addresses.
  4. Format them into a nice list.
  5. Paste them back to the Clipboard.

Step 1: The Phone Regex (Verbose Mode)

Python

import pyperclip, re

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code (optional)
    (\s|-|\.)?                        # separator (space, dash, or dot)
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension (optional)
    )''', re.VERBOSE)

Aditi Ma’am: See how readable that is? We handled area codes (415), separators 415.555, and even extensions x1234.

Step 2: The Email Regex

Python

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+      # username (letters, numbers, dots, etc.)
    @                      # @ symbol
    [a-zA-Z0-9.-]+         # domain name (gmail, yahoo)
    (\.[a-zA-Z]{2,4})      # dot-something (.com, .edu)
    )''', re.VERBOSE)

Step 3: Finding matches in Clipboard Text

Python

# Get text off the clipboard
text = str(pyperclip.paste())

matches = [] # List to store results

# Find phones
for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3], groups[5]]) # Standardize format!
    if groups[8] != '':
        phoneNum += ' x' + groups[8] # Add extension if found
    matches.append(phoneNum)

# Find emails
for groups in emailRegex.findall(text):
    matches.append(groups[0]) # Group 0 is the whole email

Step 4: Putting it Back

Python

if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')

Aditi Ma’am: Run it. Copy that messy parent feedback file (Ctrl+A, Ctrl+C) and run the script.

Chaitanya: (Runs script)

Copied to clipboard:
415-555-1011
415-555-9999
415-555-1234
parent@example.com
info@school.edu

Chaitanya: It worked! It extracted 50 numbers and 20 emails in a split second. And it formatted them all to look the same (XXX-XXX-XXXX).

Aditi Ma’am: That is the power of Regex. You turned 50 pages of chaos into clean data.

Summary Box (Chapter 7)

  • ?: Match zero or one (Optional).
  • *: Match zero or more.
  • +: Match one or more.
  • {n}: Match exactly n times.
  • {n,}: Match n or more times.
  • \d, \w, \s: Digit, Word, Whitespace.
  • [abc]: Character Class (Match a, b, or c).
  • [^abc]: Negative Class (Match anything EXCEPT a, b, or c).
  • ^xyz: Starts with xyz.
  • xyz$: Ends with xyz.
  • .: Wildcard (any char except newline).
  • re.DOTALL: Dot matches newlines too.
  • re.IGNORECASE: Case-insensitive.
  • re.VERBOSE: Allows comments/whitespace in regex.

Aditi’s Pro-Tip: “Regex is write-only code. It is very hard to read later. Always use re.VERBOSE and add comments if your pattern is complex!”

Leave a Comment

💬 Join Telegram