Web Analytics

Regular Expressions (RegEx)

Intermediate ~30 min read

Regular expressions are a powerful language for pattern matching in text. With Python's re module, you can search for patterns, validate input, extract data, and transform text. From validating email addresses to parsing log files - regex is an essential skill for text processing, data cleaning, and form validation. Once you master regex, you'll wonder how you ever lived without it!

Basic RegEx Functions

Python's re module provides several functions for pattern matching: search() finds the first match, match() checks only at the start, findall() returns all matches, split() divides text at matches, and sub() replaces matches. These are your bread-and-butter regex operations.

Output
Click Run to execute your code
Raw Strings (r"..."): Always use raw strings for regex patterns by prefixing with r. This prevents Python from interpreting backslashes. Without it, "\d" might be misinterpreted, but r"\d" is always safe. Get in the habit of using r"pattern" for all regex.

Pattern Syntax and Metacharacters

Regex patterns use special characters (metacharacters) to match types of characters and define repetition. \d matches digits, \w matches word characters, + means "one or more", and * means "zero or more". These building blocks combine to create powerful patterns.

Output
Click Run to execute your code
Essential Metacharacters:
\d digit, \D non-digit
\w word char [a-zA-Z0-9_], \W non-word
\s whitespace, \S non-whitespace
. any char (except newline)
^ start, $ end
\b word boundary

Groups and Capturing

Parentheses create groups that capture matched text. You can extract parts of a match, use named groups for clarity, and reference captured groups in replacements. Groups are essential for extracting structured data from text - like pulling apart names, dates, or URLs.

Output
Click Run to execute your code
Greedy vs Non-Greedy: By default, quantifiers like * and + are greedy - they match as much as possible. Add ? to make them non-greedy (match minimum). For example, .* in <b>.*</b> would match too much if you have multiple tags. Use .*? for non-greedy matching.

Practical Validation Examples

Regex shines for validation tasks: checking email formats, extracting phone numbers, validating passwords, cleaning messy text, and parsing structured data. These real-world patterns demonstrate how regex solves common programming challenges.

Output
Click Run to execute your code

Common Mistakes

1. Forgetting raw strings

import re

# Wrong - backslash gets interpreted by Python
pattern = "\d+"  # Python sees this differently!

# Correct - use raw string
pattern = r"\d+"

# Even worse with \b (word boundary)
re.findall("\bword\b", text)  # \b is backspace in Python!
re.findall(r"\bword\b", text)  # Correct!

2. Confusing search() and match()

import re

text = "hello world"

# match() only checks the START of string
re.match(r"world", text)  # None - "world" not at start!

# search() finds anywhere in string
re.search(r"world", text)  # Match found!

# To match entire string, use anchors
re.match(r".*world$", text)  # Works
re.fullmatch(r"hello world", text)  # Better!

3. Greedy matching grabs too much

import re

html = "bold and more"

# Wrong - greedy .* matches everything between first < and last >
re.findall(r".*", html)
# Returns: ['bold and more']

# Correct - non-greedy .*? matches minimum
re.findall(r".*?", html)
# Returns: ['bold', 'more']

4. Not escaping special characters

import re

# Wrong - . matches ANY character
re.findall(r"3.14", "3.14 and 3x14")
# Returns: ['3.14', '3x14']

# Correct - escape the dot to match literal .
re.findall(r"3\.14", "3.14 and 3x14")
# Returns: ['3.14']

# Other chars to escape: . * + ? ^ $ [ ] { } | ( ) \
# Use re.escape() for user input
user_input = "price: $5.00"
pattern = re.escape(user_input)  # "price:\\ \$5\\.00"

5. Capturing when you don't need to

import re

# With capturing group - returns tuples
re.findall(r"(cat|dog)s?", "cats and dogs")
# Returns: ['cat', 'dog']  # Just the group content!

# Without capturing (non-capturing group)
re.findall(r"(?:cat|dog)s?", "cats and dogs")
# Returns: ['cats', 'dogs']  # Full matches!

# Use (?:...) when you need grouping but not capturing

Exercise: Log Parser

Task: Parse log entries to extract timestamp, level, and message.

Requirements:

  • Parse log format: "[TIMESTAMP] LEVEL: message"
  • Extract the timestamp (in brackets)
  • Extract the log level (INFO, ERROR, WARNING)
  • Extract the message after the colon
Output
Click Run to execute your code
Show Solution
import re

def parse_log(log_line):
    """Parse a log line and return components."""
    pattern = r"\[(.+?)\]\s+(INFO|ERROR|WARNING):\s+(.+)"
    match = re.match(pattern, log_line)
    if match:
        return {
            "timestamp": match.group(1),
            "level": match.group(2),
            "message": match.group(3)
        }
    return None

# Test logs
logs = [
    "[2024-03-15 10:30:45] INFO: Server started successfully",
    "[2024-03-15 10:31:02] ERROR: Connection refused",
    "[2024-03-15 10:32:15] WARNING: High memory usage detected"
]

for log in logs:
    result = parse_log(log)
    if result:
        print(f"Time: {result['timestamp']}")
        print(f"Level: {result['level']}")
        print(f"Message: {result['message']}")
        print()

Summary

  • Functions: search(), match(), findall(), split(), sub()
  • Always use: Raw strings r"pattern" for regex patterns
  • Character classes: \d (digit), \w (word), \s (space), . (any)
  • Quantifiers: * (0+), + (1+), ? (0 or 1), {n,m} (range)
  • Anchors: ^ (start), $ (end), \b (word boundary)
  • Groups: (...) capture, (?:...) non-capture, (?P<name>...) named
  • Non-greedy: Add ? after quantifier: .*?, +?
  • Escape: re.escape() for user input with special chars

What's Next?

Congratulations on completing the Modules & Packages module! You now know how to work with dates, math, JSON, and regular expressions. Next, let's learn about PIP - Python's package manager that lets you install thousands of third-party packages to extend Python's capabilities!