Regular Expressions (RegEx)
Regular expressions are a powerful language for pattern matching in text. With Python's
re module, you can search for patterns, validate input, extract data, and transform text.
From validating email addresses to parsing log files - regex is an essential skill for text processing,
data cleaning, and form validation. Once you master regex, you'll wonder how you ever lived without it!
Basic RegEx Functions
Python's re module provides several functions for pattern matching: search()
finds the first match, match() checks only at the start, findall() returns
all matches, split() divides text at matches, and sub() replaces matches.
These are your bread-and-butter regex operations.
Click Run to execute your code
r. This prevents Python from interpreting backslashes. Without it, "\d"
might be misinterpreted, but r"\d" is always safe. Get in the habit of using
r"pattern" for all regex.
Pattern Syntax and Metacharacters
Regex patterns use special characters (metacharacters) to match types of characters and define
repetition. \d matches digits, \w matches word characters, +
means "one or more", and * means "zero or more". These building blocks combine to
create powerful patterns.
Click Run to execute your code
\d digit, \D non-digit\w word char [a-zA-Z0-9_], \W non-word\s whitespace, \S non-whitespace. any char (except newline)^ start, $ end\b word boundary
Groups and Capturing
Parentheses create groups that capture matched text. You can extract parts of a match, use named groups for clarity, and reference captured groups in replacements. Groups are essential for extracting structured data from text - like pulling apart names, dates, or URLs.
Click Run to execute your code
* and +
are greedy - they match as much as possible. Add ? to make them non-greedy (match
minimum). For example, .* in <b>.*</b> would match too much
if you have multiple tags. Use .*? for non-greedy matching.
Practical Validation Examples
Regex shines for validation tasks: checking email formats, extracting phone numbers, validating passwords, cleaning messy text, and parsing structured data. These real-world patterns demonstrate how regex solves common programming challenges.
Click Run to execute your code
Common Mistakes
1. Forgetting raw strings
import re
# Wrong - backslash gets interpreted by Python
pattern = "\d+" # Python sees this differently!
# Correct - use raw string
pattern = r"\d+"
# Even worse with \b (word boundary)
re.findall("\bword\b", text) # \b is backspace in Python!
re.findall(r"\bword\b", text) # Correct!
2. Confusing search() and match()
import re
text = "hello world"
# match() only checks the START of string
re.match(r"world", text) # None - "world" not at start!
# search() finds anywhere in string
re.search(r"world", text) # Match found!
# To match entire string, use anchors
re.match(r".*world$", text) # Works
re.fullmatch(r"hello world", text) # Better!
3. Greedy matching grabs too much
import re
html = "bold and more"
# Wrong - greedy .* matches everything between first < and last >
re.findall(r".*", html)
# Returns: ['bold and more']
# Correct - non-greedy .*? matches minimum
re.findall(r".*?", html)
# Returns: ['bold', 'more']
4. Not escaping special characters
import re
# Wrong - . matches ANY character
re.findall(r"3.14", "3.14 and 3x14")
# Returns: ['3.14', '3x14']
# Correct - escape the dot to match literal .
re.findall(r"3\.14", "3.14 and 3x14")
# Returns: ['3.14']
# Other chars to escape: . * + ? ^ $ [ ] { } | ( ) \
# Use re.escape() for user input
user_input = "price: $5.00"
pattern = re.escape(user_input) # "price:\\ \$5\\.00"
5. Capturing when you don't need to
import re
# With capturing group - returns tuples
re.findall(r"(cat|dog)s?", "cats and dogs")
# Returns: ['cat', 'dog'] # Just the group content!
# Without capturing (non-capturing group)
re.findall(r"(?:cat|dog)s?", "cats and dogs")
# Returns: ['cats', 'dogs'] # Full matches!
# Use (?:...) when you need grouping but not capturing
Exercise: Log Parser
Task: Parse log entries to extract timestamp, level, and message.
Requirements:
- Parse log format: "[TIMESTAMP] LEVEL: message"
- Extract the timestamp (in brackets)
- Extract the log level (INFO, ERROR, WARNING)
- Extract the message after the colon
Click Run to execute your code
Show Solution
import re
def parse_log(log_line):
"""Parse a log line and return components."""
pattern = r"\[(.+?)\]\s+(INFO|ERROR|WARNING):\s+(.+)"
match = re.match(pattern, log_line)
if match:
return {
"timestamp": match.group(1),
"level": match.group(2),
"message": match.group(3)
}
return None
# Test logs
logs = [
"[2024-03-15 10:30:45] INFO: Server started successfully",
"[2024-03-15 10:31:02] ERROR: Connection refused",
"[2024-03-15 10:32:15] WARNING: High memory usage detected"
]
for log in logs:
result = parse_log(log)
if result:
print(f"Time: {result['timestamp']}")
print(f"Level: {result['level']}")
print(f"Message: {result['message']}")
print()
Summary
- Functions:
search(),match(),findall(),split(),sub() - Always use: Raw strings
r"pattern"for regex patterns - Character classes:
\d(digit),\w(word),\s(space),.(any) - Quantifiers:
*(0+),+(1+),?(0 or 1),{n,m}(range) - Anchors:
^(start),$(end),\b(word boundary) - Groups:
(...)capture,(?:...)non-capture,(?P<name>...)named - Non-greedy: Add
?after quantifier:.*?,+? - Escape:
re.escape()for user input with special chars
What's Next?
Congratulations on completing the Modules & Packages module! You now know how to work with dates, math, JSON, and regular expressions. Next, let's learn about PIP - Python's package manager that lets you install thousands of third-party packages to extend Python's capabilities!
Enjoying these tutorials?