Title
First paragraph
Second paragraph with link.
Imagine you're searching through a vast library of texts, looking for specific patterns or structures rather than exact content. You might need to find all email addresses, phone numbers with various formats, or extract specific information from structured text. This is where regular expressions come in.
Regular expressions (or "regex" for short) are powerful sequences of characters that define search patterns. Think of them as a specialized mini-language for pattern matching within text. They allow you to:
Python's re module implements regular expression operations, giving you access to this powerful pattern-matching tool. While the syntax might seem cryptic at first, mastering regular expressions will dramatically enhance your text processing capabilities.
In this lecture, we'll explore the re module and learn how to harness the power of regular expressions for various text processing tasks.
The re module provides functions and classes for working with regular expressions in Python. Let's import it and explore what it offers:
# Import the module
import re
The re module functions can be broadly categorized into several groups:
search(), match(), and findall() for finding patterns in textsub() and subn() for replacing patternssplit() for dividing strings based on pattern matchescompile() for creating reusable pattern objectsLet's start by understanding the basic syntax of regular expressions before diving into these functions.
Regular expressions use a combination of literal characters and special metacharacters to define patterns. Here's an introduction to the most common elements:
a matches the character "a"). ^ $ * + ? { } [ ] \ | ( )\) is used to escape metacharacters (e.g., \. matches a literal period)[abc]: Matches 'a', 'b', or 'c'[a-z]: Matches any lowercase letter[0-9]: Matches any digit[^abc]: Matches any character EXCEPT 'a', 'b', or 'c'\d: Matches any digit (equivalent to [0-9])\D: Matches any non-digit (equivalent to [^0-9])\w: Matches any alphanumeric character or underscore (equivalent to [a-zA-Z0-9_])\W: Matches any non-word character\s: Matches any whitespace character (space, tab, newline, etc.)\S: Matches any non-whitespace character
# Match any 3-letter word
pattern = r'\b[a-zA-Z]{3}\b'
# Match a US phone number (e.g., 123-456-7890)
pattern = r'\d{3}-\d{3}-\d{4}'
# Match an email address (simple version)
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
# Match words that start with 'py' (e.g., 'python', 'pyenv')
pattern = r'\bpy[a-zA-Z0-9_]*\b'
Note that in Python, it's a good practice to use raw strings (prefixed with r) for regular expressions to avoid unintended backslash escaping.
The re module provides several functions for finding patterns in text. Let's explore the most commonly used ones.
# re.search(pattern, string) - Returns a Match object for the first match or None
import re
text = "Python is amazing and python is easy to learn."
pattern = r'python' # Case-sensitive search for 'python'
# Search for the pattern
match = re.search(pattern, text)
if match:
print(f"Found '{match.group()}' at position {match.start()}-{match.end()}")
else:
print("Pattern not found")
# Case-insensitive search with flags
match = re.search(pattern, text, re.IGNORECASE)
if match:
print(f"Found '{match.group()}' at position {match.start()}-{match.end()}")
# re.match(pattern, string) - Matches only at the start of the string
text1 = "Python is a great language"
text2 = "I love Python programming"
# Try to match 'Python' at the beginning
match1 = re.match(r'Python', text1)
match2 = re.match(r'Python', text2)
print(f"Text 1 starts with 'Python': {match1 is not None}")
print(f"Text 2 starts with 'Python': {match2 is not None}")
# re.findall(pattern, string) - Returns a list of all matching strings
text = "The rain in Spain falls mainly in the plain."
pattern = r'\b\w*ain\b' # Words ending with 'ain'
matches = re.findall(pattern, text)
print(f"Words ending with 'ain': {matches}")
# Finding all email addresses in text
text = """
Contact us at support@example.com or sales@example.com.
For billing inquiries, email billing@example.com.
"""
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)
print(f"Found email addresses: {emails}")
# re.finditer(pattern, string) - Returns an iterator over Match objects
text = "Python was created in 1991 by Guido van Rossum."
pattern = r'\d+' # Match sequences of digits
# Find all numbers
for match in re.finditer(pattern, text):
print(f"Found number '{match.group()}' at position {match.start()}-{match.end()}")
Here's an example of using advanced regex techniques to extract data from a complex structured text format:
import re
class DataExtractor:
"""
A class for extracting structured data from complex text formats
using advanced regular expression techniques.
"""
def __init__(self):
"""Initialize with compiled regex patterns."""
# Pattern for extracting key-value pairs with nested structures
# This handles nested parentheses in values
self.kvp_pattern = re.compile(
r'(\w+)=\s*' + # Key followed by equals sign
r'(?:' + # Start of value alternatives
r'"((?:[^"\\]|\\.)*)"' + # Quoted value with escape handling
r'|' + # OR
r'\'((?:[^\'\\]|\\.)*)\'' + # Single-quoted value
r'|' + # OR
r'\(((?:[^()]|\([^()]*\))*)\)' + # Parenthesized value (with one level of nesting)
r'|' + # OR
r'([^,;()]+)' + # Unquoted, non-special value
r')' # End of value alternatives
)
# Pattern for nested lists [item1, item2, [subitem1, subitem2], item3]
self.list_pattern = re.compile(
r'\[' + # Opening bracket
r'((?:' + # Start of list content
r'[^\[\]]*' + # Non-bracket content
r'|' + # OR
r'\[(?:[^\[\]]*)\]' + # Nested list with non-bracket content
r')*)' + # End of list content
r'\]' # Closing bracket
)
# Pattern for list items (accounting for nesting)
self.list_item_pattern = re.compile(
r'(?:' + # Start of item alternatives
r'"((?:[^"\\]|\\.)*)"' + # Quoted item
r'|' + # OR
r'\'((?:[^\'\\]|\\.)*)\'' + # Single-quoted item
r'|' + # OR
r'\[((?:[^\[\]]|\[(?:[^\[\]]*)\])*)\]' + # Nested list
r'|' + # OR
r'([^,\[\]]+)' + # Unquoted, non-special item
r')' # End of item alternatives
)
def extract_key_value_pairs(self, text):
"""
Extract key-value pairs from structured text.
Args:
text: Text containing key-value pairs
Returns:
Dictionary of key-value pairs
"""
result = {}
# Find all key-value pairs
for match in self.kvp_pattern.finditer(text):
key = match.group(1)
# Determine which value alternative matched
if match.group(2) is not None:
# Double-quoted value
value = match.group(2)
elif match.group(3) is not None:
# Single-quoted value
value = match.group(3)
elif match.group(4) is not None:
# Parenthesized value
value = match.group(4)
else:
# Unquoted value
value = match.group(5).strip()
# Handle nested lists in values
if value.startswith('[') and value.endswith(']'):
value = self.parse_list(value)
result[key] = value
return result
def parse_list(self, list_text):
"""
Parse a text representation of a list.
Args:
list_text: Text of list with square brackets
Returns:
List of parsed items
"""
# Remove outer brackets
if list_text.startswith('[') and list_text.endswith(']'):
list_text = list_text[1:-1].strip()
items = []
# Split on commas that aren't inside quotes, brackets, or parentheses
depth = 0
quote_char = None
current_item = ""
for char in list_text:
if quote_char:
# Inside quotes
if char == quote_char and not (current_item and current_item[-1] == '\\'):
quote_char = None
current_item += char
elif char == '"' or char == "'":
# Start of quote
quote_char = char
current_item += char
elif char == '[' or char == '(':
# Opening bracket or parenthesis
depth += 1
current_item += char
elif char == ']' or char == ')':
# Closing bracket or parenthesis
depth -= 1
current_item += char
elif char == ',' and depth == 0:
# Comma at top level
items.append(current_item.strip())
current_item = ""
else:
current_item += char
# Add the last item
if current_item.strip():
items.append(current_item.strip())
# Process each item
processed_items = []
for item in items:
# Check for nested lists
if item.startswith('[') and item.endswith(']'):
processed_items.append(self.parse_list(item))
# Check for quoted items
elif (item.startswith('"') and item.endswith('"')) or (item.startswith("'") and item.endswith("'")):
processed_items.append(item[1:-1])
else:
processed_items.append(item)
return processed_items
def extract_structured_data(self, text):
"""
Extract structured data from text containing multiple formats.
Args:
text: Text to parse
Returns:
Dictionary of extracted data
"""
data = {}
# Extract key-value pairs
kvp_data = self.extract_key_value_pairs(text)
data.update(kvp_data)
# Extract lists
list_matches = self.list_pattern.findall(text)
if list_matches:
data['lists'] = [self.parse_list(f"[{match}]") for match in list_matches]
return data
# Example usage
extractor = DataExtractor()
# Example of complex structured text
data_text = """
user_info=(name="John Smith", age=30, interests=["programming", "music", "hiking"])
settings=(theme="dark", font_size=12, notification=true)
permissions=["read", "write", ["create", "delete"], "execute"]
raw_data="This is some \"raw\" data with escaped quotes"
complex_value=(nested=(level=2, type="advanced"), format="special")
"""
# Extract data
extracted_data = extractor.extract_structured_data(data_text)
# Print the results
print("Extracted data:")
import json
print(json.dumps(extracted_data, indent=2))
# Access specific values
if 'user_info' in extracted_data:
user_info = extracted_data['user_info']
print(f"\nUser info: {user_info}")
# Parsing nested structures manually if needed
if 'interests=' in user_info:
# Further extraction might be needed
interests_match = re.search(r'interests=\[(.*?)\]', user_info)
if interests_match:
interests_text = interests_match.group(1)
interests = [i.strip('"') for i in interests_text.split(',')]
print(f"User interests: {interests}")
This example demonstrates advanced regular expression techniques for parsing complex structured text with nested elements, quoted strings, lists, and more. It uses techniques like capturing groups, non-capturing groups, lookaheads, and complex alternation patterns to extract structured information from text that might be difficult to parse with simple regex patterns.
re.compile() for patterns you'll use multiple times
# Example of potential catastrophic backtracking
import re
import time
# A problematic pattern for nested tags - can lead to exponential backtracking
bad_pattern = re.compile(r'<([^>]*)>.*\1>')
# A better pattern for the same purpose
better_pattern = re.compile(r'<([^>]*)>.*?\1>')
# Test string with deeply nested content
test_string = '' + '' * 10 + 'content' + '' * 10 + ''
# Time the bad pattern
start_time = time.time()
bad_match = bad_pattern.search(test_string)
bad_time = time.time() - start_time
print(f"Bad pattern time: {bad_time:.6f} seconds")
# Time the better pattern
start_time = time.time()
better_match = better_pattern.search(test_string)
better_time = time.time() - start_time
print(f"Better pattern time: {better_time:.6f} seconds")
print(f"Improvement factor: {bad_time / better_time:.1f}x")
. ^ $ * + ? { } [ ] \ | ( ) need to be escaped with \ to match literallyr prefix can cause issues with backslashes.* Greedily - Can match more than intended; use .*? for non-greedy matching
# Example of escaping special characters
text = "How much is $5.99?"
# Wrong pattern (missing escape for $ and .)
wrong_pattern = re.compile(r'$5.99')
if not wrong_pattern.search(text):
print("Wrong pattern didn't match due to unescaped special characters")
# Correct pattern (with escapes)
correct_pattern = re.compile(r'\$5\.99')
if correct_pattern.search(text):
print("Correct pattern matched with escaped special characters")
# Example of raw string importance
windows_path = "C:\\Users\\John\\Documents"
# Without raw string, \U would be interpreted as a Unicode escape
try:
bad_pattern = re.compile('\\Users') # This actually becomes '\Users'
print("Matches without raw string:", bool(bad_pattern.search(windows_path)))
except re.error as e:
print(f"Error without raw string: {e}")
# With raw string, backslashes are treated literally
good_pattern = re.compile(r'\\Users')
print("Matches with raw string:", bool(good_pattern.search(windows_path)))
re.VERBOSE) or separate documentation
# Example of a well-documented complex pattern using VERBOSE flag
email_pattern = re.compile(r"""
# Local part
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
# @ symbol
@
# Domain
(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9])
(?::[0-9]+)?(?:/[\w-]*)?\])
""", re.VERBOSE | re.IGNORECASE)
# Test the pattern
test_emails = [
"simple@example.com",
"very.common@example.com",
"disposable.style.email.with+symbol@example.com",
"other.email-with-hyphen@example.com",
"fully-qualified-domain@example.com",
"user.name+tag+sorting@example.com",
"x@example.com",
"example-indeed@strange-example.com",
"example@s.example",
"invalid@example",
"A@b@c@example.com",
"a\"b(c)d,e:f;gi[j\\k]l@example.com"
]
for email in test_emails:
print(f"{email}: {'Valid' if email_pattern.match(email) else 'Invalid'}")
While the re module is powerful, there are third-party libraries that offer additional features or better performance for specific use cases.
The regex module is a drop-in replacement for re that offers additional features like Unicode property support, recursive patterns, and more.
# Install with: pip install regex
import regex
# Example: Matching balanced parentheses (a recursive pattern)
text = "((a+b)*(c+d)) + (e*(f+g))"
# This pattern would be difficult with re, but regex supports recursion
pattern = regex.compile(r'\((?:[^()]++|(?R))*\)')
matches = pattern.findall(text)
print(f"Balanced parentheses expressions: {matches}")
# Unicode properties
pattern = regex.compile(r'\p{Greek}+') # Match Greek letters
matches = pattern.findall("This contains Greek: αβγδε and Latin: abcde")
print(f"Greek words: {matches}")
# Fuzzy matching
pattern = regex.compile(r'(?:fuzzy){e<=1}') # Allow up to 1 error
matches = pattern.findall("fizzy fussy fuzzi")
print(f"Fuzzy matches for 'fuzzy': {matches}")
The re2 module provides bindings to Google's RE2 regular expression library, which guarantees linear-time matching, avoiding the catastrophic backtracking issues that can occur with re.
# Install with: pip install re2
# Note: For this to work, you need the RE2 C++ library installed
try:
import re2
# Example usage (similar to re)
pattern = re2.compile(r'\b\w+ing\b')
matches = pattern.findall("Running jumping swimming walking")
print(f"Words ending in 'ing': {matches}")
except ImportError:
print("re2 module not installed or RE2 C++ library missing")
For more complex text processing, consider these alternatives:
# BeautifulSoup example for HTML parsing
try:
from bs4 import BeautifulSoup
html = """
"""
soup = BeautifulSoup(html, 'html.parser')
# Extract all paragraphs
paragraphs = soup.find_all('p')
print(f"Paragraphs: {[p.get_text() for p in paragraphs]}")
# Extract all links
links = soup.find_all('a')
print(f"Links: {[a['href'] for a in links]}")
except ImportError:
print("BeautifulSoup not installed")
Create a validator for common data patterns like phone numbers, postal codes, and IP addresses.
import re
class PatternValidator:
"""Validator for common data patterns using regular expressions."""
def __init__(self):
"""Initialize with compiled regex patterns."""
# Phone number pattern (US format) with various formats
self.phone_pattern = re.compile(r'''
(?:
# (123) 456-7890
\(\d{3}\)\s*\d{3}[-.\s]?\d{4} |
# 123-456-7890
\d{3}[-.\s]?\d{3}[-.\s]?\d{4} |
# +1 123-456-7890
\+\d{1,2}\s*\d{3}[-.\s]?\d{3}[-.\s]?\d{4}
)
''', re.VERBOSE)
# US Zip code pattern (12345 or 12345-6789)
self.zipcode_pattern = re.compile(r'\b\d{5}(?:-\d{4})?\b')
# IP address pattern (IPv4)
self.ipv4_pattern = re.compile(r'''
\b
(?:
# Ensure each octet is between 0-255
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
)
\b
''', re.VERBOSE)
# Email pattern
self.email_pattern = re.compile(r'''
\b
[a-zA-Z0-9._%+-]+
@
[a-zA-Z0-9.-]+
\.[a-zA-Z]{2,}
\b
''', re.VERBOSE)
# URL pattern
self.url_pattern = re.compile(r'''
\b
(?:
https?:// # http:// or https://
(?:
[a-zA-Z0-9] # Domain parts
[a-zA-Z0-9-]* # Domain parts (with hyphens)
[a-zA-Z0-9]
\.
)+
[a-zA-Z]{2,} # TLD
(?:/[a-zA-Z0-9._~:/?#[\]@! '()*+,;=%-]*)? # Path
)
\b
''', re.VERBOSE)
# Credit card pattern
self.cc_pattern = re.compile(r'''
\b
(?:
4[0-9]{12}(?:[0-9]{3})? # Visa
|
5[1-5][0-9]{14} # MasterCard
|
3[47][0-9]{13} # American Express
|
3(?:0[0-5]|[68][0-9])[0-9]{11} # Diners Club
|
6(?:011|5[0-9]{2})[0-9]{12} # Discover
|
(?:2131|1800|35\d{3})\d{11} # JCB
)
\b
''', re.VERBOSE)
def validate_phone(self, phone):
"""Validate a phone number."""
return bool(self.phone_pattern.match(phone))
def validate_zipcode(self, zipcode):
"""Validate a US zip code."""
return bool(self.zipcode_pattern.match(zipcode))
def validate_ipv4(self, ip):
"""Validate an IPv4 address."""
return bool(self.ipv4_pattern.match(ip))
def validate_email(self, email):
"""Validate an email address."""
return bool(self.email_pattern.match(email))
def validate_url(self, url):
"""Validate a URL."""
return bool(self.url_pattern.match(url))
def validate_credit_card(self, cc_number):
"""Validate a credit card number (format only)."""
# Remove spaces and dashes
cc_number = re.sub(r'[\s-]', '', cc_number)
# Check pattern
if not self.cc_pattern.match(cc_number):
return False
# Luhn algorithm (checksum) - used for credit card validation
def luhn_checksum(card_number):
def digits_of(n):
return [int(d) for d in str(n)]
digits = digits_of(card_number)
odd_digits = digits[-1::-2]
even_digits = digits[-2::-2]
checksum = sum(odd_digits)
for d in even_digits:
checksum += sum(digits_of(d*2))
return checksum % 10 == 0
# Perform Luhn check
return luhn_checksum(cc_number)
def find_all_patterns(self, text):
"""Find all supported patterns in the text."""
results = {
'phones': self.phone_pattern.findall(text),
'zipcodes': self.zipcode_pattern.findall(text),
'ips': self.ipv4_pattern.findall(text),
'emails': self.email_pattern.findall(text),
'urls': self.url_pattern.findall(text),
'credit_cards': self.cc_pattern.findall(text)
}
return results
# Example usage
validator = PatternValidator()
# Test phone validation
print("Phone Validation:")
test_phones = [
"(123) 456-7890",
"123-456-7890",
"123.456.7890",
"+1 123-456-7890",
"1234567890",
"123-45-6789", # SSN format, should fail
"(123) 456-789" # Missing digit
]
for phone in test_phones:
print(f" {phone}: {'Valid' if validator.validate_phone(phone) else 'Invalid'}")
# Test zip code validation
print("\nZip Code Validation:")
test_zips = [
"12345",
"12345-6789",
"123456",
"1234",
"12345-67890"
]
for zipcode in test_zips:
print(f" {zipcode}: {'Valid' if validator.validate_zipcode(zipcode) else 'Invalid'}")
# Test IP validation
print("\nIP Address Validation:")
test_ips = [
"192.168.1.1",
"10.0.0.1",
"255.255.255.255",
"256.1.1.1",
"192.168.1",
"a.b.c.d"
]
for ip in test_ips:
print(f" {ip}: {'Valid' if validator.validate_ipv4(ip) else 'Invalid'}")
# Test credit card validation
print("\nCredit Card Validation:")
test_cards = [
"4111 1111 1111 1111", # Visa
"5500 0000 0000 0004", # MasterCard
"340000000000009", # American Express
"6011000000000004", # Discover
"1234567812345678", # Invalid
"4111111111111112" # Invalid checksum
]
for card in test_cards:
print(f" {card}: {'Valid' if validator.validate_credit_card(card) else 'Invalid'}")
# Find all patterns in a text
sample_text = """
Contact us at support@example.com or call (123) 456-7890.
Our office is located at 123 Main St, New York, NY 12345-6789.
For technical issues, connect to 192.168.1.1 or visit https://help.example.org.
For payment, we accept Visa (4111 1111 1111 1111) and MasterCard.
"""
patterns = validator.find_all_patterns(sample_text)
print("\nPatterns found in sample text:")
for pattern_type, matches in patterns.items():
if matches:
print(f" {pattern_type.capitalize()}: {matches}")
Exercise 2: Build a Text Template Engine
Create a simple template engine that replaces placeholders with values.
import re
class TemplateEngine:
"""
A simple template engine that replaces placeholders in a template
with actual values.
Supports:
- Simple placeholders: {{variable}}
- Nested attributes: {{user.name}}
- Default values: {{variable|default}}
- Filters: {{variable|uppercase}}
- Conditional blocks: {% if condition %} ... {% endif %}
- Loop blocks: {% for item in items %} ... {% endfor %}
"""
def __init__(self):
"""Initialize the template engine with compiled regex patterns."""
# Simple variable pattern: {{variable}} or {{variable|filter}}
self.var_pattern = re.compile(r'{{(\s*[\w.]+\s*(?:\|[\w]+\s*)?)}}')
# If block pattern: {% if condition %} ... {% endif %}
self.if_pattern = re.compile(
r'{%\s*if\s+([\w.]+)\s*%}(.*?)(?:{%\s*else\s*%}(.*?))?{%\s*endif\s*%}',
re.DOTALL
)
# For loop pattern: {% for item in items %} ... {% endfor %}
self.for_pattern = re.compile(
r'{%\s*for\s+([\w]+)\s+in\s+([\w.]+)\s*%}(.*?){%\s*endfor\s*%}',
re.DOTALL
)
def render(self, template, context):
"""
Render a template with the given context.
Args:
template: Template string with placeholders
context: Dictionary of values to replace placeholders
Returns:
Rendered template with placeholders replaced
"""
# Process conditional blocks first
template = self._process_conditionals(template, context)
# Process loops
template = self._process_loops(template, context)
# Process variables
template = self._process_variables(template, context)
return template
def _get_value_from_context(self, var_name, context):
"""
Get a value from the context, supporting nested attributes.
Args:
var_name: Variable name, possibly with dots (e.g., 'user.name')
context: Context dictionary
Returns:
Value from context or None if not found
"""
parts = var_name.strip().split('.')
value = context
try:
for part in parts:
value = value[part]
return value
except (KeyError, TypeError):
return None
def _process_variables(self, template, context):
"""
Replace all variable placeholders with their values.
Args:
template: Template string
context: Context dictionary
Returns:
Template with variables replaced
"""
def replace_var(match):
var_expr = match.group(1).strip()
# Check for filters
if '|' in var_expr:
var_name, filter_name = var_expr.split('|', 1)
var_name = var_name.strip()
filter_name = filter_name.strip()
# Get the base value
value = self._get_value_from_context(var_name, context)
# Apply the filter
if filter_name == 'uppercase':
return str(value).upper() if value is not None else ''
elif filter_name == 'lowercase':
return str(value).lower() if value is not None else ''
elif filter_name.startswith('default:'):
default_value = filter_name.split(':', 1)[1]
return str(value) if value is not None else default_value
else:
# Unknown filter
return str(value) if value is not None else ''
else:
# No filter
value = self._get_value_from_context(var_expr, context)
return str(value) if value is not None else ''
return self.var_pattern.sub(replace_var, template)
def _process_conditionals(self, template, context):
"""
Process if/else conditional blocks.
Args:
template: Template string
context: Context dictionary
Returns:
Template with conditional blocks processed
"""
def replace_if(match):
condition_var = match.group(1).strip()
if_body = match.group(2)
else_body = match.group(3) if match.group(3) else ''
# Evaluate the condition
condition_value = self._get_value_from_context(condition_var, context)
if condition_value:
return if_body
else:
return else_body
return self.if_pattern.sub(replace_if, template)
def _process_loops(self, template, context):
"""
Process for loop blocks.
Args:
template: Template string
context: Context dictionary
Returns:
Template with loop blocks processed
"""
def replace_for(match):
item_var = match.group(1).strip()
items_var = match.group(2).strip()
loop_body = match.group(3)
# Get the items to iterate over
items = self._get_value_from_context(items_var, context)
if not items:
return ''
# Render the loop body for each item
result = []
for item in items:
# Create a new context with the loop variable
loop_context = dict(context)
loop_context[item_var] = item
# Render the loop body with this context
rendered_body = loop_body
# Process nested loops and conditionals
rendered_body = self._process_conditionals(rendered_body, loop_context)
rendered_body = self._process_loops(rendered_body, loop_context)
# Process variables
rendered_body = self._process_variables(rendered_body, loop_context)
result.append(rendered_body)
return ''.join(result)
return self.for_pattern.sub(replace_for, template)
# Example usage
template_engine = TemplateEngine()
# Simple template
template = """
Hello, {{name}}!
{% if is_admin %}
You have admin privileges.
{% else %}
You have regular user privileges.
{% endif %}
Your profile information:
- Email: {{email|lowercase}}
- Joined: {{join_date|default:N/A}}
{% if has_friends %}
Your friends:
{% for friend in friends %}
- {{friend.name}} ({{friend.email}})
{% endfor %}
{% else %}
You don't have any friends yet.
{% endif %}
"""
# Context for the template
context = {
'name': 'John Smith',
'email': 'JOHN@EXAMPLE.COM',
'is_admin': True,
'has_friends': True,
'friends': [
{'name': 'Alice', 'email': 'alice@example.com'},
{'name': 'Bob', 'email': 'bob@example.com'},
{'name': 'Charlie', 'email': 'charlie@example.com'}
]
}
# Render the template
rendered = template_engine.render(template, context)
print(rendered)
# Another example with different context
context2 = {
'name': 'Jane Doe',
'email': 'jane@example.com',
'is_admin': False,
'has_friends': False
}
rendered2 = template_engine.render(template, context2)
print("\nSecond rendering:")
print(rendered2)
Exercise 3: Create a Custom Log Parser and Analyzer
Build a log parser that extracts and analyzes information from different log formats.
import re
from collections import defaultdict, Counter
from datetime import datetime
class LogAnalyzer:
"""
A class for parsing and analyzing various log formats
using regular expressions.
"""
def __init__(self):
"""Initialize with regex patterns for different log formats."""
# Common log format (CLF) pattern
# Example: 127.0.0.1 - - [02/Jan/2022:03:05:07 +0000] "GET /page.html HTTP/1.1" 200 1234
self.clf_pattern = re.compile(
r'(\S+)\s+-\s+-\s+\[(.*?)\]\s+"(\S+)\s+(\S+)\s+([^"]*)"\s+(\d+)\s+(\d+|-)'
)
# Combined log format pattern (CLF + referer and user agent)
self.combined_pattern = re.compile(
r'(\S+)\s+-\s+-\s+\[(.*?)\]\s+"(\S+)\s+(\S+)\s+([^"]*)"\s+(\d+)\s+(\d+|-)\s+"([^"]*)"\s+"([^"]*)"'
)
# Error log pattern
# Example: [Fri Jan 02 03:05:07 2022] [error] [client 127.0.0.1] File does not exist: /path/to/file
self.error_pattern = re.compile(
r'\[(.*?)\]\s+\[(\w+)\]\s+(?:\[client\s+(\S+)\]\s+)?(.+)'
)
# Custom application log pattern
# Example: 2022-01-02 03:05:07 INFO [module] User logged in: user123
self.app_pattern = re.compile(
r'(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+(\w+)\s+\[([^\]]+)\]\s+(.*)'
)
# JSON log pattern
self.json_pattern = re.compile(r'(\{.*\})')
def parse_line(self, line):
"""
Parse a single log line and determine its format.
Args:
line: A string containing a log entry
Returns:
A dictionary with parsed log information or None if format is unknown
"""
# Try each pattern
for format_name, pattern, parser in [
('clf', self.clf_pattern, self._parse_clf),
('combined', self.combined_pattern, self._parse_combined),
('error', self.error_pattern, self._parse_error),
('application', self.app_pattern, self._parse_app),
('json', self.json_pattern, self._parse_json)
]:
match = pattern.match(line)
if match:
parsed = parser(match)
parsed['format'] = format_name
return parsed
# Unknown format
return None
def _parse_clf(self, match):
"""Parse Common Log Format (CLF) match."""
ip, date_str, method, path, protocol, status, size = match.groups()
# Parse timestamp
timestamp = self._parse_clf_date(date_str)
return {
'ip': ip,
'timestamp': timestamp,
'datetime': date_str,
'method': method,
'path': path,
'protocol': protocol,
'status': int(status),
'size': int(size) if size != '-' else 0
}
def _parse_combined(self, match):
"""Parse Combined Log Format match."""
ip, date_str, method, path, protocol, status, size, referer, user_agent = match.groups()
# Parse the CLF part first
parsed = self._parse_clf(match)
# Add the additional fields
parsed.update({
'referer': referer if referer != '-' else '',
'user_agent': user_agent
})
return parsed
def _parse_error(self, match):
"""Parse error log match."""
date_str, level, ip, message = match.groups()
# Parse timestamp
try:
timestamp = datetime.strptime(date_str, '%a %b %d %H:%M:%S %Y')
except ValueError:
timestamp = None
return {
'timestamp': timestamp,
'datetime': date_str,
'level': level,
'ip': ip if ip else '',
'message': message
}
def _parse_app(self, match):
"""Parse application log match."""
date_str, level, module, message = match.groups()
# Parse timestamp
try:
timestamp = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S')
except ValueError:
timestamp = None
return {
'timestamp': timestamp,
'datetime': date_str,
'level': level,
'module': module,
'message': message
}
def _parse_json(self, match):
"""Parse JSON log match."""
import json
json_str = match.group(1)
try:
data = json.loads(json_str)
# Add a timestamp if it exists in a known format
if 'timestamp' in data and isinstance(data['timestamp'], str):
try:
data['timestamp'] = datetime.fromisoformat(data['timestamp'].replace('Z', '+00:00'))
except ValueError:
pass
return data
except json.JSONDecodeError:
return {'raw': json_str}
def _parse_clf_date(self, date_str):
"""Parse CLF date format."""
# CLF date format: 02/Jan/2022:03:05:07 +0000
try:
# Remove timezone for simplicity
date_part = date_str.split(' ')[0]
return datetime.strptime(date_part, '%d/%b/%Y:%H:%M:%S')
except ValueError:
return None
def parse_file(self, file_path):
"""
Parse a log file.
Args:
file_path: Path to the log file
Returns:
List of parsed log entries
"""
entries = []
try:
with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
entry = self.parse_line(line)
if entry:
entry['line_number'] = line_num
entry['raw'] = line
entries.append(entry)
else:
# Unknown format
entries.append({
'format': 'unknown',
'line_number': line_num,
'raw': line
})
except Exception as e:
print(f"Error parsing log file: {e}")
return entries
def analyze_logs(self, entries):
"""
Analyze log entries to extract useful information.
Args:
entries: List of parsed log entries
Returns:
Dictionary with analysis results
"""
results = {
'counts': {
'total': len(entries),
'by_format': Counter(),
'by_status': Counter(),
'by_method': Counter(),
'by_level': Counter(),
'by_date': Counter(),
'by_hour': Counter(),
'by_ip': Counter()
},
'status_codes': {
'success': 0, # 2xx
'redirect': 0, # 3xx
'client_error': 0, # 4xx
'server_error': 0 # 5xx
},
'paths': {
'most_visited': Counter()
},
'errors': []
}
# Collect statistics
for entry in entries:
# Count by format
results['counts']['by_format'][entry.get('format', 'unknown')] += 1
# Web server specific stats
if entry.get('format') in ('clf', 'combined'):
# Count by status code
status = entry.get('status')
if status:
results['counts']['by_status'][status] += 1
# Categorize status codes
if 200 <= status < 300:
results['status_codes']['success'] += 1
elif 300 <= status < 400:
results['status_codes']['redirect'] += 1
elif 400 <= status < 500:
results['status_codes']['client_error'] += 1
elif 500 <= status < 600:
results['status_codes']['server_error'] += 1
# Count by HTTP method
method = entry.get('method')
if method:
results['counts']['by_method'][method] += 1
# Count most visited paths
path = entry.get('path')
if path:
results['paths']['most_visited'][path] += 1
# Count by IP
ip = entry.get('ip')
if ip:
results['counts']['by_ip'][ip] += 1
# Application log specific stats
elif entry.get('format') in ('application', 'error'):
# Count by log level
level = entry.get('level')
if level:
results['counts']['by_level'][level] += 1
# Collect errors
if level in ('ERROR', 'FATAL', 'error'):
results['errors'].append(entry)
# Count by date and hour
timestamp = entry.get('timestamp')
if timestamp:
date_str = timestamp.strftime('%Y-%m-%d')
hour_str = timestamp.strftime('%H')
results['counts']['by_date'][date_str] += 1
results['counts']['by_hour'][hour_str] += 1
# Calculate most common items
results['most_common'] = {
'ips': results['counts']['by_ip'].most_common(10),
'paths': results['paths']['most_visited'].most_common(10),
'status_codes': results['counts']['by_status'].most_common(),
'methods': results['counts']['by_method'].most_common(),
'levels': results['counts']['by_level'].most_common()
}
return results
def generate_report(self, analysis):
"""
Generate a human-readable report from analysis results.
Args:
analysis: Analysis results from analyze_logs
Returns:
String containing the report
"""
report = []
report.append("Log Analysis Report")
report.append("=" * 80)
# Basic stats
report.append(f"Total entries: {analysis['counts']['total']}")
# By format
report.append("\nLog Formats:")
for format_name, count in analysis['counts']['by_format'].most_common():
report.append(f" {format_name}: {count}")
# HTTP stats (if applicable)
if analysis['counts']['by_status']:
# Status code categories
report.append("\nStatus Code Categories:")
for category, count in analysis['status_codes'].items():
if count > 0:
report.append(f" {category}: {count}")
# Most common status codes
report.append("\nMost Common Status Codes:")
for status, count in analysis['most_common']['status_codes']:
report.append(f" {status}: {count}")
# Most common methods
if analysis['most_common']['methods']:
report.append("\nHTTP Methods:")
for method, count in analysis['most_common']['methods']:
report.append(f" {method}: {count}")
# Most visited paths
report.append("\nMost Visited Paths:")
for path, count in analysis['most_common']['paths'][:5]: # Top 5
report.append(f" {path}: {count}")
# Application log stats (if applicable)
if analysis['counts']['by_level']:
report.append("\nLog Levels:")
for level, count in analysis['most_common']['levels']:
report.append(f" {level}: {count}")
# Show recent errors
if analysis['errors']:
report.append("\nRecent Errors:")
for error in analysis['errors'][-5:]: # Show last 5 errors
timestamp = error.get('datetime', '')
message = error.get('message', '')
report.append(f" [{timestamp}] {message}")
# Time distribution
report.append("\nEntries by Hour:")
for hour in sorted(analysis['counts']['by_hour'].keys()):
count = analysis['counts']['by_hour'][hour]
bar = "#" * (count // max(1, analysis['counts']['total'] // 100))
report.append(f" {hour}:00 - {hour}:59: {count} {bar}")
# IP statistics
report.append("\nTop IPs:")
for ip, count in analysis['most_common']['ips'][:5]: # Top 5
report.append(f" {ip}: {count}")
return "\n".join(report)
# Example usage
analyzer = LogAnalyzer()
# Example log entries
log_entries = [
'127.0.0.1 - - [02/Jan/2022:03:05:07 +0000] "GET /index.html HTTP/1.1" 200 1234',
'127.0.0.1 - - [02/Jan/2022:03:05:08 +0000] "GET /css/style.css HTTP/1.1" 200 567',
'192.168.1.1 - - [02/Jan/2022:03:05:10 +0000] "POST /api/login HTTP/1.1" 401 123',
'127.0.0.1 - - [02/Jan/2022:03:05:15 +0000] "GET /nonexistent.html HTTP/1.1" 404 345',
'127.0.0.1 - - [02/Jan/2022:03:05:20 +0000] "GET /index.html HTTP/1.1" 200 1234 "http://example.com" "Mozilla/5.0"',
'[Fri Jan 02 03:05:25 2022] [error] [client 127.0.0.1] File does not exist: /var/www/html/favicon.ico',
'2022-01-02 03:05:30 INFO [auth] User logged in: user123',
'2022-01-02 03:05:35 ERROR [database] Connection failed: Timeout',
'{"timestamp": "2022-01-02T03:05:40Z", "level": "info", "message": "API request received", "method": "GET", "endpoint": "/api/status"}'
]
# Parse each log entry
parsed_entries = []
for entry in log_entries:
parsed = analyzer.parse_line(entry)
if parsed:
parsed['raw'] = entry
parsed_entries.append(parsed)
else:
print(f"Failed to parse: {entry}")
# Analyze the logs
analysis_results = analyzer.analyze_logs(parsed_entries)
# Generate and print a report
report = analyzer.generate_report(analysis_results)
print(report)
# You can also parse a log file directly
# log_file = 'path/to/logfile.log'
# log_entries = analyzer.parse_file(log_file)
# analysis = analyzer.analyze_logs(log_entries)
# report = analyzer.generate_report(analysis)
Further Resources
Official Documentation
Books and Tutorials
- Regular-Expressions.info - Comprehensive regex tutorial
- Real Python: Regular Expressions in Python
- Mastering Regular Expressions by Jeffrey Friedl
Online Tools
- Regex101 - Interactive regex tester with explanation
- RegExr - Another excellent regex testing tool
- Debuggex - Visual regex debugger
Advanced Topics
- Third-party regex module
- Regular Expression Matching Can Be Simple And Fast - Article on regex implementation algorithms
- RexEgg - Advanced regex techniques and tricks