Python Standard Library: re for Regular Expressions

Introduction to Regular Expressions

Imagine you're searching through a vast library of texts, looking for specific patterns or structures rather than exact content. You might need to find all email addresses, phone numbers with various formats, or extract specific information from structured text. This is where regular expressions come in.

Regular expressions (or "regex" for short) are powerful sequences of characters that define search patterns. Think of them as a specialized mini-language for pattern matching within text. They allow you to:

Search for patterns in text
Validate that strings match a certain pattern
Extract information from structured text
Replace text based on patterns
Split text into segments using pattern boundaries

Python's re module implements regular expression operations, giving you access to this powerful pattern-matching tool. While the syntax might seem cryptic at first, mastering regular expressions will dramatically enhance your text processing capabilities.

In this lecture, we'll explore the re module and learn how to harness the power of regular expressions for various text processing tasks.

The re Module: Pattern Matching Toolkit

The re module provides functions and classes for working with regular expressions in Python. Let's import it and explore what it offers:


# Import the module
import re

The re module functions can be broadly categorized into several groups:

Pattern matching functions: Functions like search(), match(), and findall() for finding patterns in text
Modification functions: Functions like sub() and subn() for replacing patterns
Splitting function: split() for dividing strings based on pattern matches
Compilation function: compile() for creating reusable pattern objects

Let's start by understanding the basic syntax of regular expressions before diving into these functions.

Regular Expression Syntax

Regular expressions use a combination of literal characters and special metacharacters to define patterns. Here's an introduction to the most common elements:

Basic Characters

Literal characters: Most characters match themselves (e.g., a matches the character "a")
Metacharacters: Characters with special meanings: . ^ $ * + ? { } [ ] \ | ( )
Escape character: Backslash (\) is used to escape metacharacters (e.g., \. matches a literal period)

Character Classes

Period (.): Matches any character except newline
Character sets [...]: Match any one of the characters inside the brackets
- [abc]: Matches 'a', 'b', or 'c'
- [a-z]: Matches any lowercase letter
- [0-9]: Matches any digit
- [^abc]: Matches any character EXCEPT 'a', 'b', or 'c'
Predefined character classes:
- \d: Matches any digit (equivalent to [0-9])
- \D: Matches any non-digit (equivalent to [^0-9])
- \w: Matches any alphanumeric character or underscore (equivalent to [a-zA-Z0-9_])
- \W: Matches any non-word character
- \s: Matches any whitespace character (space, tab, newline, etc.)
- \S: Matches any non-whitespace character

Anchors and Boundaries

Start of string (^): Matches the start of a string
End of string ($): Matches the end of a string
Word boundary (\b): Matches the boundary between a word and a non-word character
Non-word boundary (\B): Matches any position that is not a word boundary

Quantifiers

Zero or more (*): Matches 0 or more occurrences of the preceding character
One or more (+): Matches 1 or more occurrences of the preceding character
Zero or one (?): Matches 0 or 1 occurrence of the preceding character
Exactly n {n}: Matches exactly n occurrences
At least n {n,}: Matches n or more occurrences
Between n and m {n,m}: Matches between n and m occurrences

Groups and Alternation

Grouping (...): Groups patterns together and captures matches
Non-capturing group (?:...): Groups patterns but doesn't capture matches
Alternation (|): Matches either the pattern before or after the | (like OR)

Examples of Simple Patterns


# Match any 3-letter word
pattern = r'\b[a-zA-Z]{3}\b'

# Match a US phone number (e.g., 123-456-7890)
pattern = r'\d{3}-\d{3}-\d{4}'

# Match an email address (simple version)
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Match words that start with 'py' (e.g., 'python', 'pyenv')
pattern = r'\bpy[a-zA-Z0-9_]*\b'

Note that in Python, it's a good practice to use raw strings (prefixed with r) for regular expressions to avoid unintended backslash escaping.

Basic Pattern Matching Functions

The re module provides several functions for finding patterns in text. Let's explore the most commonly used ones.

re.search(): Find First Match Anywhere


# re.search(pattern, string) - Returns a Match object for the first match or None
import re

text = "Python is amazing and python is easy to learn."
pattern = r'python'  # Case-sensitive search for 'python'

# Search for the pattern
match = re.search(pattern, text)

if match:
    print(f"Found '{match.group()}' at position {match.start()}-{match.end()}")
else:
    print("Pattern not found")

# Case-insensitive search with flags
match = re.search(pattern, text, re.IGNORECASE)
if match:
    print(f"Found '{match.group()}' at position {match.start()}-{match.end()}")

re.match(): Match Only at the Beginning


# re.match(pattern, string) - Matches only at the start of the string
text1 = "Python is a great language"
text2 = "I love Python programming"

# Try to match 'Python' at the beginning
match1 = re.match(r'Python', text1)
match2 = re.match(r'Python', text2)

print(f"Text 1 starts with 'Python': {match1 is not None}")
print(f"Text 2 starts with 'Python': {match2 is not None}")

re.findall(): Find All Non-overlapping Matches


# re.findall(pattern, string) - Returns a list of all matching strings
text = "The rain in Spain falls mainly in the plain."
pattern = r'\b\w*ain\b'  # Words ending with 'ain'

matches = re.findall(pattern, text)
print(f"Words ending with 'ain': {matches}")

# Finding all email addresses in text
text = """
Contact us at support@example.com or sales@example.com.
For billing inquiries, email billing@example.com.
"""

email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)
print(f"Found email addresses: {emails}")

re.finditer(): Iterator Over Matches


# re.finditer(pattern, string) - Returns an iterator over Match objects
text = "Python was created in 1991 by Guido van Rossum."
pattern = r'\d+'  # Match sequences of digits

# Find all numbers
for match in re.finditer(pattern, text):
    print(f"Found number '{match.group()}' at position {match.start()}-{match.end()}")

Real-World Example: Advanced Data Extraction from Structured Text

Here's an example of using advanced regex techniques to extract data from a complex structured text format:


import re

class DataExtractor:
    """
    A class for extracting structured data from complex text formats
    using advanced regular expression techniques.
    """
    
    def __init__(self):
        """Initialize with compiled regex patterns."""
        # Pattern for extracting key-value pairs with nested structures
        # This handles nested parentheses in values
        self.kvp_pattern = re.compile(
            r'(\w+)=\s*' +           # Key followed by equals sign
            r'(?:' +                  # Start of value alternatives
            r'"((?:[^"\\]|\\.)*)"' +  # Quoted value with escape handling
            r'|' +                    # OR
            r'\'((?:[^\'\\]|\\.)*)\'' + # Single-quoted value
            r'|' +                    # OR
            r'\(((?:[^()]|\([^()]*\))*)\)' + # Parenthesized value (with one level of nesting)
            r'|' +                    # OR
            r'([^,;()]+)' +           # Unquoted, non-special value
            r')'                      # End of value alternatives
        )
        
        # Pattern for nested lists [item1, item2, [subitem1, subitem2], item3]
        self.list_pattern = re.compile(
            r'\[' +                  # Opening bracket
            r'((?:' +                # Start of list content
            r'[^\[\]]*' +            # Non-bracket content
            r'|' +                   # OR
            r'\[(?:[^\[\]]*)\]' +    # Nested list with non-bracket content
            r')*)' +                 # End of list content
            r'\]'                    # Closing bracket
        )
        
        # Pattern for list items (accounting for nesting)
        self.list_item_pattern = re.compile(
            r'(?:' +                 # Start of item alternatives
            r'"((?:[^"\\]|\\.)*)"' + # Quoted item
            r'|' +                   # OR
            r'\'((?:[^\'\\]|\\.)*)\'' + # Single-quoted item
            r'|' +                   # OR
            r'\[((?:[^\[\]]|\[(?:[^\[\]]*)\])*)\]' + # Nested list
            r'|' +                   # OR
            r'([^,\[\]]+)' +         # Unquoted, non-special item
            r')'                     # End of item alternatives
        )
    
    def extract_key_value_pairs(self, text):
        """
        Extract key-value pairs from structured text.
        
        Args:
            text: Text containing key-value pairs
            
        Returns:
            Dictionary of key-value pairs
        """
        result = {}
        
        # Find all key-value pairs
        for match in self.kvp_pattern.finditer(text):
            key = match.group(1)
            
            # Determine which value alternative matched
            if match.group(2) is not None:
                # Double-quoted value
                value = match.group(2)
            elif match.group(3) is not None:
                # Single-quoted value
                value = match.group(3)
            elif match.group(4) is not None:
                # Parenthesized value
                value = match.group(4)
            else:
                # Unquoted value
                value = match.group(5).strip()
            
            # Handle nested lists in values
            if value.startswith('[') and value.endswith(']'):
                value = self.parse_list(value)
            
            result[key] = value
        
        return result
    
    def parse_list(self, list_text):
        """
        Parse a text representation of a list.
        
        Args:
            list_text: Text of list with square brackets
            
        Returns:
            List of parsed items
        """
        # Remove outer brackets
        if list_text.startswith('[') and list_text.endswith(']'):
            list_text = list_text[1:-1].strip()
        
        items = []
        
        # Split on commas that aren't inside quotes, brackets, or parentheses
        depth = 0
        quote_char = None
        current_item = ""
        
        for char in list_text:
            if quote_char:
                # Inside quotes
                if char == quote_char and not (current_item and current_item[-1] == '\\'):
                    quote_char = None
                current_item += char
            elif char == '"' or char == "'":
                # Start of quote
                quote_char = char
                current_item += char
            elif char == '[' or char == '(':
                # Opening bracket or parenthesis
                depth += 1
                current_item += char
            elif char == ']' or char == ')':
                # Closing bracket or parenthesis
                depth -= 1
                current_item += char
            elif char == ',' and depth == 0:
                # Comma at top level
                items.append(current_item.strip())
                current_item = ""
            else:
                current_item += char
        
        # Add the last item
        if current_item.strip():
            items.append(current_item.strip())
        
        # Process each item
        processed_items = []
        for item in items:
            # Check for nested lists
            if item.startswith('[') and item.endswith(']'):
                processed_items.append(self.parse_list(item))
            # Check for quoted items
            elif (item.startswith('"') and item.endswith('"')) or (item.startswith("'") and item.endswith("'")):
                processed_items.append(item[1:-1])
            else:
                processed_items.append(item)
        
        return processed_items
    
    def extract_structured_data(self, text):
        """
        Extract structured data from text containing multiple formats.
        
        Args:
            text: Text to parse
            
        Returns:
            Dictionary of extracted data
        """
        data = {}
        
        # Extract key-value pairs
        kvp_data = self.extract_key_value_pairs(text)
        data.update(kvp_data)
        
        # Extract lists
        list_matches = self.list_pattern.findall(text)
        if list_matches:
            data['lists'] = [self.parse_list(f"[{match}]") for match in list_matches]
        
        return data

# Example usage
extractor = DataExtractor()

# Example of complex structured text
data_text = """
user_info=(name="John Smith", age=30, interests=["programming", "music", "hiking"])
settings=(theme="dark", font_size=12, notification=true)
permissions=["read", "write", ["create", "delete"], "execute"]
raw_data="This is some \"raw\" data with escaped quotes"
complex_value=(nested=(level=2, type="advanced"), format="special")
"""

# Extract data
extracted_data = extractor.extract_structured_data(data_text)

# Print the results
print("Extracted data:")
import json
print(json.dumps(extracted_data, indent=2))

# Access specific values
if 'user_info' in extracted_data:
    user_info = extracted_data['user_info']
    print(f"\nUser info: {user_info}")
    
    # Parsing nested structures manually if needed
    if 'interests=' in user_info:
        # Further extraction might be needed
        interests_match = re.search(r'interests=\[(.*?)\]', user_info)
        if interests_match:
            interests_text = interests_match.group(1)
            interests = [i.strip('"') for i in interests_text.split(',')]
            print(f"User interests: {interests}")

This example demonstrates advanced regular expression techniques for parsing complex structured text with nested elements, quoted strings, lists, and more. It uses techniques like capturing groups, non-capturing groups, lookaheads, and complex alternation patterns to extract structured information from text that might be difficult to parse with simple regex patterns.

Common Pitfalls and Best Practices

Performance Considerations

Catastrophic Backtracking - Certain regex patterns can cause exponential performance degradation on some inputs
Greedy vs. Lazy Quantifiers - Be mindful of which one you use, as it affects both matching behavior and performance
Anchoring - Use anchors (^ and $) when appropriate to limit the search space
Compile Patterns - Use re.compile() for patterns you'll use multiple times


# Example of potential catastrophic backtracking
import re
import time

# A problematic pattern for nested tags - can lead to exponential backtracking
bad_pattern = re.compile(r'<([^>]*)>.*')

# A better pattern for the same purpose
better_pattern = re.compile(r'<([^>]*)>.*?')

# Test string with deeply nested content
test_string = '' + '' * 10 + 'content' + '' * 10 + ''

# Time the bad pattern
start_time = time.time()
bad_match = bad_pattern.search(test_string)
bad_time = time.time() - start_time
print(f"Bad pattern time: {bad_time:.6f} seconds")

# Time the better pattern
start_time = time.time()
better_match = better_pattern.search(test_string)
better_time = time.time() - start_time
print(f"Better pattern time: {better_time:.6f} seconds")
print(f"Improvement factor: {bad_time / better_time:.1f}x")

Common Mistakes

Forgetting to Escape Special Characters - Characters like . ^ $ * + ? { } [ ] \ | ( ) need to be escaped with \ to match literally
Not Using Raw Strings - Forgetting the r prefix can cause issues with backslashes
Using .* Greedily - Can match more than intended; use .*? for non-greedy matching
Overusing Regular Expressions - Sometimes simple string methods are more appropriate


# Example of escaping special characters
text = "How much is $5.99?"

# Wrong pattern (missing escape for $ and .)
wrong_pattern = re.compile(r'$5.99')
if not wrong_pattern.search(text):
    print("Wrong pattern didn't match due to unescaped special characters")

# Correct pattern (with escapes)
correct_pattern = re.compile(r'\$5\.99')
if correct_pattern.search(text):
    print("Correct pattern matched with escaped special characters")

# Example of raw string importance
windows_path = "C:\\Users\\John\\Documents"

# Without raw string, \U would be interpreted as a Unicode escape
try:
    bad_pattern = re.compile('\\Users')  # This actually becomes '\Users'
    print("Matches without raw string:", bool(bad_pattern.search(windows_path)))
except re.error as e:
    print(f"Error without raw string: {e}")

# With raw string, backslashes are treated literally
good_pattern = re.compile(r'\\Users')
print("Matches with raw string:", bool(good_pattern.search(windows_path)))

Best Practices

Start Simple - Build and test regex patterns incrementally
Use Tools - Regex testing tools like regex101.com help visualize and debug patterns
Document Complex Patterns - Use comments (with re.VERBOSE) or separate documentation
Consider Alternatives - For complex parsing, consider specialized parsers instead of regex
Test Edge Cases - Test with empty strings, unexpected input, and boundary conditions


# Example of a well-documented complex pattern using VERBOSE flag
email_pattern = re.compile(r"""
    # Local part
    (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
    |"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
      |\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
      
    # @ symbol
    @
    
    # Domain
    (?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
    |\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9])
       (?::[0-9]+)?(?:/[\w-]*)?\])
""", re.VERBOSE | re.IGNORECASE)

# Test the pattern
test_emails = [
    "simple@example.com",
    "very.common@example.com",
    "disposable.style.email.with+symbol@example.com",
    "other.email-with-hyphen@example.com",
    "fully-qualified-domain@example.com",
    "user.name+tag+sorting@example.com",
    "x@example.com",
    "example-indeed@strange-example.com",
    "example@s.example",
    "invalid@example",
    "A@b@c@example.com",
    "a\"b(c)d,e:f;gi[j\\k]l@example.com"
]

for email in test_emails:
    print(f"{email}: {'Valid' if email_pattern.match(email) else 'Invalid'}")

Beyond the re Module: Third-Party Alternatives

While the re module is powerful, there are third-party libraries that offer additional features or better performance for specific use cases.

regex Module

The regex module is a drop-in replacement for re that offers additional features like Unicode property support, recursive patterns, and more.


# Install with: pip install regex
import regex

# Example: Matching balanced parentheses (a recursive pattern)
text = "((a+b)*(c+d)) + (e*(f+g))"

# This pattern would be difficult with re, but regex supports recursion
pattern = regex.compile(r'\((?:[^()]++|(?R))*\)')
matches = pattern.findall(text)
print(f"Balanced parentheses expressions: {matches}")

# Unicode properties
pattern = regex.compile(r'\p{Greek}+')  # Match Greek letters
matches = pattern.findall("This contains Greek: αβγδε and Latin: abcde")
print(f"Greek words: {matches}")

# Fuzzy matching
pattern = regex.compile(r'(?:fuzzy){e<=1}')  # Allow up to 1 error
matches = pattern.findall("fizzy fussy fuzzi")
print(f"Fuzzy matches for 'fuzzy': {matches}")

re2 Module

The re2 module provides bindings to Google's RE2 regular expression library, which guarantees linear-time matching, avoiding the catastrophic backtracking issues that can occur with re.


# Install with: pip install re2
# Note: For this to work, you need the RE2 C++ library installed
try:
    import re2
    
    # Example usage (similar to re)
    pattern = re2.compile(r'\b\w+ing\b')
    matches = pattern.findall("Running jumping swimming walking")
    print(f"Words ending in 'ing': {matches}")
except ImportError:
    print("re2 module not installed or RE2 C++ library missing")

Specialized Parsing Libraries

For more complex text processing, consider these alternatives:

BeautifulSoup - For HTML/XML parsing
PyParsing - For creating parsers for complex language syntax
NLTK - For natural language processing


# BeautifulSoup example for HTML parsing
try:
    from bs4 import BeautifulSoup
    
    html = """
    
        Title
        First paragraph
        Second paragraph with link.
    
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    
    # Extract all paragraphs
    paragraphs = soup.find_all('p')
    print(f"Paragraphs: {[p.get_text() for p in paragraphs]}")
    
    # Extract all links
    links = soup.find_all('a')
    print(f"Links: {[a['href'] for a in links]}")
except ImportError:
    print("BeautifulSoup not installed")

Practice Exercises

Exercise 1: Create a Pattern Validator

Create a validator for common data patterns like phone numbers, postal codes, and IP addresses.


import re

class PatternValidator:
    """Validator for common data patterns using regular expressions."""
    
    def __init__(self):
        """Initialize with compiled regex patterns."""
        # Phone number pattern (US format) with various formats
        self.phone_pattern = re.compile(r'''
            (?:
                # (123) 456-7890
                \(\d{3}\)\s*\d{3}[-.\s]?\d{4} |
                
                # 123-456-7890
                \d{3}[-.\s]?\d{3}[-.\s]?\d{4} |
                
                # +1 123-456-7890
                \+\d{1,2}\s*\d{3}[-.\s]?\d{3}[-.\s]?\d{4}
            )
        ''', re.VERBOSE)
        
        # US Zip code pattern (12345 or 12345-6789)
        self.zipcode_pattern = re.compile(r'\b\d{5}(?:-\d{4})?\b')
        
        # IP address pattern (IPv4)
        self.ipv4_pattern = re.compile(r'''
            \b
            (?:
                # Ensure each octet is between 0-255
                (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
                (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
                (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
                (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
            )
            \b
        ''', re.VERBOSE)
        
        # Email pattern
        self.email_pattern = re.compile(r'''
            \b
            [a-zA-Z0-9._%+-]+
            @
            [a-zA-Z0-9.-]+
            \.[a-zA-Z]{2,}
            \b
        ''', re.VERBOSE)
        
        # URL pattern
        self.url_pattern = re.compile(r'''
            \b
            (?:
                https?://                   # http:// or https://
                (?:
                    [a-zA-Z0-9]             # Domain parts
                    [a-zA-Z0-9-]*           # Domain parts (with hyphens)
                    [a-zA-Z0-9]
                    \.
                )+
                [a-zA-Z]{2,}                # TLD
                (?:/[a-zA-Z0-9._~:/?#[\]@!            '()*+,;=%-]*)?    # Path
            )
            \b
        ''', re.VERBOSE)
        
        # Credit card pattern
        self.cc_pattern = re.compile(r'''
            \b
            (?:
                4[0-9]{12}(?:[0-9]{3})?                # Visa
                |
                5[1-5][0-9]{14}                        # MasterCard
                |
                3[47][0-9]{13}                         # American Express
                |
                3(?:0[0-5]|[68][0-9])[0-9]{11}         # Diners Club
                |
                6(?:011|5[0-9]{2})[0-9]{12}            # Discover
                |
                (?:2131|1800|35\d{3})\d{11}            # JCB
            )
            \b
        ''', re.VERBOSE)
    
    def validate_phone(self, phone):
        """Validate a phone number."""
        return bool(self.phone_pattern.match(phone))
    
    def validate_zipcode(self, zipcode):
        """Validate a US zip code."""
        return bool(self.zipcode_pattern.match(zipcode))
    
    def validate_ipv4(self, ip):
        """Validate an IPv4 address."""
        return bool(self.ipv4_pattern.match(ip))
    
    def validate_email(self, email):
        """Validate an email address."""
        return bool(self.email_pattern.match(email))
    
    def validate_url(self, url):
        """Validate a URL."""
        return bool(self.url_pattern.match(url))
    
    def validate_credit_card(self, cc_number):
        """Validate a credit card number (format only)."""
        # Remove spaces and dashes
        cc_number = re.sub(r'[\s-]', '', cc_number)
        
        # Check pattern
        if not self.cc_pattern.match(cc_number):
            return False
        
        # Luhn algorithm (checksum) - used for credit card validation
        def luhn_checksum(card_number):
            def digits_of(n):
                return [int(d) for d in str(n)]
            
            digits = digits_of(card_number)
            odd_digits = digits[-1::-2]
            even_digits = digits[-2::-2]
            checksum = sum(odd_digits)
            for d in even_digits:
                checksum += sum(digits_of(d*2))
            return checksum % 10 == 0
        
        # Perform Luhn check
        return luhn_checksum(cc_number)
    
    def find_all_patterns(self, text):
        """Find all supported patterns in the text."""
        results = {
            'phones': self.phone_pattern.findall(text),
            'zipcodes': self.zipcode_pattern.findall(text),
            'ips': self.ipv4_pattern.findall(text),
            'emails': self.email_pattern.findall(text),
            'urls': self.url_pattern.findall(text),
            'credit_cards': self.cc_pattern.findall(text)
        }
        return results

# Example usage
validator = PatternValidator()

# Test phone validation
print("Phone Validation:")
test_phones = [
    "(123) 456-7890",
    "123-456-7890",
    "123.456.7890",
    "+1 123-456-7890",
    "1234567890",
    "123-45-6789",  # SSN format, should fail
    "(123) 456-789"  # Missing digit
]

for phone in test_phones:
    print(f"  {phone}: {'Valid' if validator.validate_phone(phone) else 'Invalid'}")

# Test zip code validation
print("\nZip Code Validation:")
test_zips = [
    "12345",
    "12345-6789",
    "123456",
    "1234",
    "12345-67890"
]

for zipcode in test_zips:
    print(f"  {zipcode}: {'Valid' if validator.validate_zipcode(zipcode) else 'Invalid'}")

# Test IP validation
print("\nIP Address Validation:")
test_ips = [
    "192.168.1.1",
    "10.0.0.1",
    "255.255.255.255",
    "256.1.1.1",
    "192.168.1",
    "a.b.c.d"
]

for ip in test_ips:
    print(f"  {ip}: {'Valid' if validator.validate_ipv4(ip) else 'Invalid'}")

# Test credit card validation
print("\nCredit Card Validation:")
test_cards = [
    "4111 1111 1111 1111",  # Visa
    "5500 0000 0000 0004",  # MasterCard
    "340000000000009",      # American Express
    "6011000000000004",     # Discover
    "1234567812345678",     # Invalid
    "4111111111111112"      # Invalid checksum
]

for card in test_cards:
    print(f"  {card}: {'Valid' if validator.validate_credit_card(card) else 'Invalid'}")

# Find all patterns in a text
sample_text = """
Contact us at support@example.com or call (123) 456-7890.
Our office is located at 123 Main St, New York, NY 12345-6789.
For technical issues, connect to 192.168.1.1 or visit https://help.example.org.
For payment, we accept Visa (4111 1111 1111 1111) and MasterCard.
"""

patterns = validator.find_all_patterns(sample_text)
print("\nPatterns found in sample text:")
for pattern_type, matches in patterns.items():
    if matches:
        print(f"  {pattern_type.capitalize()}: {matches}")

Exercise 2: Build a Text Template Engine

Create a simple template engine that replaces placeholders with values.


import re

class TemplateEngine:
    """
    A simple template engine that replaces placeholders in a template
    with actual values.
    
    Supports:
    - Simple placeholders: {{variable}}
    - Nested attributes: {{user.name}}
    - Default values: {{variable|default}}
    - Filters: {{variable|uppercase}}
    - Conditional blocks: {% if condition %} ... {% endif %}
    - Loop blocks: {% for item in items %} ... {% endfor %}
    """
    
    def __init__(self):
        """Initialize the template engine with compiled regex patterns."""
        # Simple variable pattern: {{variable}} or {{variable|filter}}
        self.var_pattern = re.compile(r'{{(\s*[\w.]+\s*(?:\|[\w]+\s*)?)}}')
        
        # If block pattern: {% if condition %} ... {% endif %}
        self.if_pattern = re.compile(
            r'{%\s*if\s+([\w.]+)\s*%}(.*?)(?:{%\s*else\s*%}(.*?))?{%\s*endif\s*%}',
            re.DOTALL
        )
        
        # For loop pattern: {% for item in items %} ... {% endfor %}
        self.for_pattern = re.compile(
            r'{%\s*for\s+([\w]+)\s+in\s+([\w.]+)\s*%}(.*?){%\s*endfor\s*%}',
            re.DOTALL
        )
    
    def render(self, template, context):
        """
        Render a template with the given context.
        
        Args:
            template: Template string with placeholders
            context: Dictionary of values to replace placeholders
            
        Returns:
            Rendered template with placeholders replaced
        """
        # Process conditional blocks first
        template = self._process_conditionals(template, context)
        
        # Process loops
        template = self._process_loops(template, context)
        
        # Process variables
        template = self._process_variables(template, context)
        
        return template
    
    def _get_value_from_context(self, var_name, context):
        """
        Get a value from the context, supporting nested attributes.
        
        Args:
            var_name: Variable name, possibly with dots (e.g., 'user.name')
            context: Context dictionary
            
        Returns:
            Value from context or None if not found
        """
        parts = var_name.strip().split('.')
        value = context
        
        try:
            for part in parts:
                value = value[part]
            return value
        except (KeyError, TypeError):
            return None
    
    def _process_variables(self, template, context):
        """
        Replace all variable placeholders with their values.
        
        Args:
            template: Template string
            context: Context dictionary
            
        Returns:
            Template with variables replaced
        """
        def replace_var(match):
            var_expr = match.group(1).strip()
            
            # Check for filters
            if '|' in var_expr:
                var_name, filter_name = var_expr.split('|', 1)
                var_name = var_name.strip()
                filter_name = filter_name.strip()
                
                # Get the base value
                value = self._get_value_from_context(var_name, context)
                
                # Apply the filter
                if filter_name == 'uppercase':
                    return str(value).upper() if value is not None else ''
                elif filter_name == 'lowercase':
                    return str(value).lower() if value is not None else ''
                elif filter_name.startswith('default:'):
                    default_value = filter_name.split(':', 1)[1]
                    return str(value) if value is not None else default_value
                else:
                    # Unknown filter
                    return str(value) if value is not None else ''
            else:
                # No filter
                value = self._get_value_from_context(var_expr, context)
                return str(value) if value is not None else ''
        
        return self.var_pattern.sub(replace_var, template)
    
    def _process_conditionals(self, template, context):
        """
        Process if/else conditional blocks.
        
        Args:
            template: Template string
            context: Context dictionary
            
        Returns:
            Template with conditional blocks processed
        """
        def replace_if(match):
            condition_var = match.group(1).strip()
            if_body = match.group(2)
            else_body = match.group(3) if match.group(3) else ''
            
            # Evaluate the condition
            condition_value = self._get_value_from_context(condition_var, context)
            
            if condition_value:
                return if_body
            else:
                return else_body
        
        return self.if_pattern.sub(replace_if, template)
    
    def _process_loops(self, template, context):
        """
        Process for loop blocks.
        
        Args:
            template: Template string
            context: Context dictionary
            
        Returns:
            Template with loop blocks processed
        """
        def replace_for(match):
            item_var = match.group(1).strip()
            items_var = match.group(2).strip()
            loop_body = match.group(3)
            
            # Get the items to iterate over
            items = self._get_value_from_context(items_var, context)
            
            if not items:
                return ''
            
            # Render the loop body for each item
            result = []
            for item in items:
                # Create a new context with the loop variable
                loop_context = dict(context)
                loop_context[item_var] = item
                
                # Render the loop body with this context
                rendered_body = loop_body
                
                # Process nested loops and conditionals
                rendered_body = self._process_conditionals(rendered_body, loop_context)
                rendered_body = self._process_loops(rendered_body, loop_context)
                
                # Process variables
                rendered_body = self._process_variables(rendered_body, loop_context)
                
                result.append(rendered_body)
            
            return ''.join(result)
        
        return self.for_pattern.sub(replace_for, template)

# Example usage
template_engine = TemplateEngine()

# Simple template
template = """
Hello, {{name}}!

{% if is_admin %}
You have admin privileges.
{% else %}
You have regular user privileges.
{% endif %}

Your profile information:
- Email: {{email|lowercase}}
- Joined: {{join_date|default:N/A}}

{% if has_friends %}
Your friends:
{% for friend in friends %}
- {{friend.name}} ({{friend.email}})
{% endfor %}
{% else %}
You don't have any friends yet.
{% endif %}
"""

# Context for the template
context = {
    'name': 'John Smith',
    'email': 'JOHN@EXAMPLE.COM',
    'is_admin': True,
    'has_friends': True,
    'friends': [
        {'name': 'Alice', 'email': 'alice@example.com'},
        {'name': 'Bob', 'email': 'bob@example.com'},
        {'name': 'Charlie', 'email': 'charlie@example.com'}
    ]
}

# Render the template
rendered = template_engine.render(template, context)
print(rendered)

# Another example with different context
context2 = {
    'name': 'Jane Doe',
    'email': 'jane@example.com',
    'is_admin': False,
    'has_friends': False
}

rendered2 = template_engine.render(template, context2)
print("\nSecond rendering:")
print(rendered2)

Exercise 3: Create a Custom Log Parser and Analyzer

Build a log parser that extracts and analyzes information from different log formats.


import re
from collections import defaultdict, Counter
from datetime import datetime

class LogAnalyzer:
    """
    A class for parsing and analyzing various log formats
    using regular expressions.
    """
    
    def __init__(self):
        """Initialize with regex patterns for different log formats."""
        # Common log format (CLF) pattern
        # Example: 127.0.0.1 - - [02/Jan/2022:03:05:07 +0000] "GET /page.html HTTP/1.1" 200 1234
        self.clf_pattern = re.compile(
            r'(\S+)\s+-\s+-\s+\[(.*?)\]\s+"(\S+)\s+(\S+)\s+([^"]*)"\s+(\d+)\s+(\d+|-)'
        )
        
        # Combined log format pattern (CLF + referer and user agent)
        self.combined_pattern = re.compile(
            r'(\S+)\s+-\s+-\s+\[(.*?)\]\s+"(\S+)\s+(\S+)\s+([^"]*)"\s+(\d+)\s+(\d+|-)\s+"([^"]*)"\s+"([^"]*)"'
        )
        
        # Error log pattern
        # Example: [Fri Jan 02 03:05:07 2022] [error] [client 127.0.0.1] File does not exist: /path/to/file
        self.error_pattern = re.compile(
            r'\[(.*?)\]\s+\[(\w+)\]\s+(?:\[client\s+(\S+)\]\s+)?(.+)'
        )
        
        # Custom application log pattern
        # Example: 2022-01-02 03:05:07 INFO [module] User logged in: user123
        self.app_pattern = re.compile(
            r'(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+(\w+)\s+\[([^\]]+)\]\s+(.*)'
        )
        
        # JSON log pattern
        self.json_pattern = re.compile(r'(\{.*\})')
    
    def parse_line(self, line):
        """
        Parse a single log line and determine its format.
        
        Args:
            line: A string containing a log entry
            
        Returns:
            A dictionary with parsed log information or None if format is unknown
        """
        # Try each pattern
        for format_name, pattern, parser in [
            ('clf', self.clf_pattern, self._parse_clf),
            ('combined', self.combined_pattern, self._parse_combined),
            ('error', self.error_pattern, self._parse_error),
            ('application', self.app_pattern, self._parse_app),
            ('json', self.json_pattern, self._parse_json)
        ]:
            match = pattern.match(line)
            if match:
                parsed = parser(match)
                parsed['format'] = format_name
                return parsed
        
        # Unknown format
        return None
    
    def _parse_clf(self, match):
        """Parse Common Log Format (CLF) match."""
        ip, date_str, method, path, protocol, status, size = match.groups()
        
        # Parse timestamp
        timestamp = self._parse_clf_date(date_str)
        
        return {
            'ip': ip,
            'timestamp': timestamp,
            'datetime': date_str,
            'method': method,
            'path': path,
            'protocol': protocol,
            'status': int(status),
            'size': int(size) if size != '-' else 0
        }
    
    def _parse_combined(self, match):
        """Parse Combined Log Format match."""
        ip, date_str, method, path, protocol, status, size, referer, user_agent = match.groups()
        
        # Parse the CLF part first
        parsed = self._parse_clf(match)
        
        # Add the additional fields
        parsed.update({
            'referer': referer if referer != '-' else '',
            'user_agent': user_agent
        })
        
        return parsed
    
    def _parse_error(self, match):
        """Parse error log match."""
        date_str, level, ip, message = match.groups()
        
        # Parse timestamp
        try:
            timestamp = datetime.strptime(date_str, '%a %b %d %H:%M:%S %Y')
        except ValueError:
            timestamp = None
        
        return {
            'timestamp': timestamp,
            'datetime': date_str,
            'level': level,
            'ip': ip if ip else '',
            'message': message
        }
    
    def _parse_app(self, match):
        """Parse application log match."""
        date_str, level, module, message = match.groups()
        
        # Parse timestamp
        try:
            timestamp = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S')
        except ValueError:
            timestamp = None
        
        return {
            'timestamp': timestamp,
            'datetime': date_str,
            'level': level,
            'module': module,
            'message': message
        }
    
    def _parse_json(self, match):
        """Parse JSON log match."""
        import json
        
        json_str = match.group(1)
        try:
            data = json.loads(json_str)
            # Add a timestamp if it exists in a known format
            if 'timestamp' in data and isinstance(data['timestamp'], str):
                try:
                    data['timestamp'] = datetime.fromisoformat(data['timestamp'].replace('Z', '+00:00'))
                except ValueError:
                    pass
            return data
        except json.JSONDecodeError:
            return {'raw': json_str}
    
    def _parse_clf_date(self, date_str):
        """Parse CLF date format."""
        # CLF date format: 02/Jan/2022:03:05:07 +0000
        try:
            # Remove timezone for simplicity
            date_part = date_str.split(' ')[0]
            return datetime.strptime(date_part, '%d/%b/%Y:%H:%M:%S')
        except ValueError:
            return None
    
    def parse_file(self, file_path):
        """
        Parse a log file.
        
        Args:
            file_path: Path to the log file
            
        Returns:
            List of parsed log entries
        """
        entries = []
        
        try:
            with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
                for line_num, line in enumerate(f, 1):
                    line = line.strip()
                    if not line:
                        continue
                    
                    entry = self.parse_line(line)
                    if entry:
                        entry['line_number'] = line_num
                        entry['raw'] = line
                        entries.append(entry)
                    else:
                        # Unknown format
                        entries.append({
                            'format': 'unknown',
                            'line_number': line_num,
                            'raw': line
                        })
        except Exception as e:
            print(f"Error parsing log file: {e}")
        
        return entries
    
    def analyze_logs(self, entries):
        """
        Analyze log entries to extract useful information.
        
        Args:
            entries: List of parsed log entries
            
        Returns:
            Dictionary with analysis results
        """
        results = {
            'counts': {
                'total': len(entries),
                'by_format': Counter(),
                'by_status': Counter(),
                'by_method': Counter(),
                'by_level': Counter(),
                'by_date': Counter(),
                'by_hour': Counter(),
                'by_ip': Counter()
            },
            'status_codes': {
                'success': 0,    # 2xx
                'redirect': 0,   # 3xx
                'client_error': 0, # 4xx
                'server_error': 0  # 5xx
            },
            'paths': {
                'most_visited': Counter()
            },
            'errors': []
        }
        
        # Collect statistics
        for entry in entries:
            # Count by format
            results['counts']['by_format'][entry.get('format', 'unknown')] += 1
            
            # Web server specific stats
            if entry.get('format') in ('clf', 'combined'):
                # Count by status code
                status = entry.get('status')
                if status:
                    results['counts']['by_status'][status] += 1
                    
                    # Categorize status codes
                    if 200 <= status < 300:
                        results['status_codes']['success'] += 1
                    elif 300 <= status < 400:
                        results['status_codes']['redirect'] += 1
                    elif 400 <= status < 500:
                        results['status_codes']['client_error'] += 1
                    elif 500 <= status < 600:
                        results['status_codes']['server_error'] += 1
                
                # Count by HTTP method
                method = entry.get('method')
                if method:
                    results['counts']['by_method'][method] += 1
                
                # Count most visited paths
                path = entry.get('path')
                if path:
                    results['paths']['most_visited'][path] += 1
                
                # Count by IP
                ip = entry.get('ip')
                if ip:
                    results['counts']['by_ip'][ip] += 1
            
            # Application log specific stats
            elif entry.get('format') in ('application', 'error'):
                # Count by log level
                level = entry.get('level')
                if level:
                    results['counts']['by_level'][level] += 1
                
                # Collect errors
                if level in ('ERROR', 'FATAL', 'error'):
                    results['errors'].append(entry)
            
            # Count by date and hour
            timestamp = entry.get('timestamp')
            if timestamp:
                date_str = timestamp.strftime('%Y-%m-%d')
                hour_str = timestamp.strftime('%H')
                results['counts']['by_date'][date_str] += 1
                results['counts']['by_hour'][hour_str] += 1
        
        # Calculate most common items
        results['most_common'] = {
            'ips': results['counts']['by_ip'].most_common(10),
            'paths': results['paths']['most_visited'].most_common(10),
            'status_codes': results['counts']['by_status'].most_common(),
            'methods': results['counts']['by_method'].most_common(),
            'levels': results['counts']['by_level'].most_common()
        }
        
        return results
    
    def generate_report(self, analysis):
        """
        Generate a human-readable report from analysis results.
        
        Args:
            analysis: Analysis results from analyze_logs
            
        Returns:
            String containing the report
        """
        report = []
        report.append("Log Analysis Report")
        report.append("=" * 80)
        
        # Basic stats
        report.append(f"Total entries: {analysis['counts']['total']}")
        
        # By format
        report.append("\nLog Formats:")
        for format_name, count in analysis['counts']['by_format'].most_common():
            report.append(f"  {format_name}: {count}")
        
        # HTTP stats (if applicable)
        if analysis['counts']['by_status']:
            # Status code categories
            report.append("\nStatus Code Categories:")
            for category, count in analysis['status_codes'].items():
                if count > 0:
                    report.append(f"  {category}: {count}")
            
            # Most common status codes
            report.append("\nMost Common Status Codes:")
            for status, count in analysis['most_common']['status_codes']:
                report.append(f"  {status}: {count}")
            
            # Most common methods
            if analysis['most_common']['methods']:
                report.append("\nHTTP Methods:")
                for method, count in analysis['most_common']['methods']:
                    report.append(f"  {method}: {count}")
            
            # Most visited paths
            report.append("\nMost Visited Paths:")
            for path, count in analysis['most_common']['paths'][:5]:  # Top 5
                report.append(f"  {path}: {count}")
        
        # Application log stats (if applicable)
        if analysis['counts']['by_level']:
            report.append("\nLog Levels:")
            for level, count in analysis['most_common']['levels']:
                report.append(f"  {level}: {count}")
            
            # Show recent errors
            if analysis['errors']:
                report.append("\nRecent Errors:")
                for error in analysis['errors'][-5:]:  # Show last 5 errors
                    timestamp = error.get('datetime', '')
                    message = error.get('message', '')
                    report.append(f"  [{timestamp}] {message}")
        
        # Time distribution
        report.append("\nEntries by Hour:")
        for hour in sorted(analysis['counts']['by_hour'].keys()):
            count = analysis['counts']['by_hour'][hour]
            bar = "#" * (count // max(1, analysis['counts']['total'] // 100))
            report.append(f"  {hour}:00 - {hour}:59: {count} {bar}")
        
        # IP statistics
        report.append("\nTop IPs:")
        for ip, count in analysis['most_common']['ips'][:5]:  # Top 5
            report.append(f"  {ip}: {count}")
        
        return "\n".join(report)

# Example usage
analyzer = LogAnalyzer()

# Example log entries
log_entries = [
    '127.0.0.1 - - [02/Jan/2022:03:05:07 +0000] "GET /index.html HTTP/1.1" 200 1234',
    '127.0.0.1 - - [02/Jan/2022:03:05:08 +0000] "GET /css/style.css HTTP/1.1" 200 567',
    '192.168.1.1 - - [02/Jan/2022:03:05:10 +0000] "POST /api/login HTTP/1.1" 401 123',
    '127.0.0.1 - - [02/Jan/2022:03:05:15 +0000] "GET /nonexistent.html HTTP/1.1" 404 345',
    '127.0.0.1 - - [02/Jan/2022:03:05:20 +0000] "GET /index.html HTTP/1.1" 200 1234 "http://example.com" "Mozilla/5.0"',
    '[Fri Jan 02 03:05:25 2022] [error] [client 127.0.0.1] File does not exist: /var/www/html/favicon.ico',
    '2022-01-02 03:05:30 INFO [auth] User logged in: user123',
    '2022-01-02 03:05:35 ERROR [database] Connection failed: Timeout',
    '{"timestamp": "2022-01-02T03:05:40Z", "level": "info", "message": "API request received", "method": "GET", "endpoint": "/api/status"}'
]

# Parse each log entry
parsed_entries = []
for entry in log_entries:
    parsed = analyzer.parse_line(entry)
    if parsed:
        parsed['raw'] = entry
        parsed_entries.append(parsed)
    else:
        print(f"Failed to parse: {entry}")

# Analyze the logs
analysis_results = analyzer.analyze_logs(parsed_entries)

# Generate and print a report
report = analyzer.generate_report(analysis_results)
print(report)

# You can also parse a log file directly
# log_file = 'path/to/logfile.log'
# log_entries = analyzer.parse_file(log_file)
# analysis = analyzer.analyze_logs(log_entries)
# report = analyzer.generate_report(analysis)

Further Resources

Official Documentation

Books and Tutorials

Regular-Expressions.info - Comprehensive regex tutorial
Real Python: Regular Expressions in Python
Mastering Regular Expressions by Jeffrey Friedl

Online Tools

Regex101 - Interactive regex tester with explanation
RegExr - Another excellent regex testing tool
Debuggex - Visual regex debugger

Advanced Topics

Third-party regex module
Regular Expression Matching Can Be Simple And Fast - Article on regex implementation algorithms
RexEgg - Advanced regex techniques and tricks

Real-World Example: Log Parser

Here's a practical example of using regular expressions to parse log file entries:

This example demonstrates how regular expressions can be used to parse structured log entries. The parser extracts timestamps, log levels, messages, usernames, and IP addresses from each log entry. This is a common task in log analysis and monitoring systems.

Pattern	Description	Example
`.`	Matches any character except newline	`a.c` matches "abc", "a2c", "a-c", etc.
`^`	Matches start of string	`^hello` matches "hello world" but not "say hello"
`$`	Matches end of string	`world$` matches "hello world" but not "world class"
`\`	Escapes special characters	`\.` matches a literal period

Pattern	Description	Example
`*`	Matches 0 or more occurrences	`ab*c` matches "ac", "abc", "abbc", etc.
`+`	Matches 1 or more occurrences	`ab+c` matches "abc", "abbc", but not "ac"
`?`	Matches 0 or 1 occurrence	`colou?r` matches "color" or "colour"
`{n}`	Matches exactly n occurrences	`\d{3}` matches "123", "456", etc.
`{n,}`	Matches n or more occurrences	`\d{2,}` matches "12", "345", etc.
`{n,m}`	Matches between n and m occurrences	`\d{2,4}` matches "12", "123", "1234"
`*?`, `+?`, `??`	Non-greedy versions of *, +, ?	`a.*?b` matches "ab", "acb" in "acbdb"

Pattern	Description	Example
`(xyz)`	Capturing group	`(abc)+` matches "abc", "abcabc", etc.
`(?:xyz)`	Non-capturing group	`(?:abc)+` same as above but doesn't capture
`x\|y`	Alternation (x or y)	`cat\|dog` matches "cat" or "dog"
`(?P<name>xyz)`	Named capturing group	`(?P<year>\d{4})` captures year as a named group
`\1`, `\2`, etc.	Backreference to a capturing group	`(abc)\1` matches "abcabc"
`(?P=name)`	Backreference to a named group	`(?P<char>a)(?P=char)` matches "aa"

Pattern	Description	Example
`(?=xyz)`	Positive lookahead	`a(?=b)` matches "a" only if followed by "b"
`(?!xyz)`	Negative lookahead	`a(?!b)` matches "a" only if not followed by "b"
`(?<=xyz)`	Positive lookbehind	`(?<=a)b` matches "b" only if preceded by "a"
`(?<!xyz)`	Negative lookbehind	`(?<!a)b` matches "b" only if not preceded by "a"

Flag	Description	Example
`re.I` or `re.IGNORECASE`	Case-insensitive matching	`re.search('a', 'A', re.I)` matches
`re.M` or `re.MULTILINE`	^ and $ match start/end of each line	`re.search('^a', 'b\na', re.M)` matches
`re.S` or `re.DOTALL`	Dot (.) matches newline too	`re.search('a.b', 'a\nb', re.S)` matches
`re.X` or `re.VERBOSE`	Allows formatted regex with comments	See verbose pattern examples
`re.A` or `re.ASCII`	\w, \W, \b, \B, \s, \S match ASCII only	Affects behavior with Unicode characters
`re.U` or `re.UNICODE`	\w, \W, \b, \B, \s, \S match based on Unicode	Default in Python 3

Pattern	Description	Example
`[abc]`	Matches any of the characters inside brackets	`gr[ae]y` matches "gray" or "grey"
`[^abc]`	Matches any character NOT inside brackets	`[^0-9]` matches any non-digit
`[a-z]`	Matches any character in the range	`[a-z]` matches any lowercase letter
`\d`	Matches any digit (equivalent to `[0-9]`)	`\d{3}` matches "123", "456", etc.
`\D`	Matches any non-digit	`\D+` matches "abc", "xyz", etc.
`\w`	Matches any word character (alphanumeric + underscore)	`\w+` matches "abc123", "python_3", etc.
`\W`	Matches any non-word character	`\W+` matches " + = ", "!@#", etc.
`\s`	Matches any whitespace character	`\s+` matches spaces, tabs, newlines
`\S`	Matches any non-whitespace character	`\S+` matches "abc", "123", etc.

Introduction to Regular Expressions

The re Module: Pattern Matching Toolkit

Regular Expression Syntax

Basic Characters

Character Classes

Anchors and Boundaries

Quantifiers

Groups and Alternation

Examples of Simple Patterns

Basic Pattern Matching Functions

re.search(): Find First Match Anywhere

re.match(): Match Only at the Beginning

re.findall(): Find All Non-overlapping Matches

re.finditer(): Iterator Over Matches

Real-World Example: Advanced Data Extraction from Structured Text

Common Pitfalls and Best Practices

Performance Considerations

Common Mistakes

Best Practices

Beyond the re Module: Third-Party Alternatives

regex Module

re2 Module

Specialized Parsing Libraries

Title

Practice Exercises

Exercise 1: Create a Pattern Validator

Exercise 2: Build a Text Template Engine

Exercise 3: Create a Custom Log Parser and Analyzer

Further Resources

Official Documentation

Books and Tutorials

Online Tools

Advanced Topics

Real-World Example: Log Parser

Text Replacement and Modification

re.sub(): Replace Patterns

Using Backreferences in Replacements

Using Functions for Dynamic Replacements

re.subn(): Count Replacements

Real-World Example: Text Anonymizer

Splitting Text with Regular Expressions

Basic Splitting

Capturing Delimiters

Splitting by Multiple Patterns

Real-World Example: CSV Parser with Complex Delimiters

Compiled Regular Expressions

Creating and Using Compiled Patterns

Compiling with Flags

Common Regex Flags

Verbose Regular Expressions

Real-World Example: Form Validator

Working with Match Objects

Accessing Match Information

Named Groups

Iterating Over All Matches

Real-World Example: HTML Parser

Welcome to My Page

Useful Links

Advanced Regular Expression Techniques

Lookahead and Lookbehind Assertions

Non-greedy (Lazy) Matching

Regular Expression Quick Reference

Basic Patterns

Character Classes

Quantifiers

Groups and Alternation

Lookahead and Lookbehind

Flags (Modifiers)

Common Regular Expression Patterns

Data Validation Patterns

Email Address

Phone Numbers

URLs

IP Addresses

Dates

Password Strength

Text Processing Patterns

HTML Tags

CSV Parsing

Log Parsing