Python Strings and String Manipulation

Week 2: Python Fundamentals - Text Processing

Session Overview

Welcome to our deep dive into Python strings and string manipulation! Today, we'll explore how Python handles text data and the powerful tools it provides for manipulating strings. Understanding string operations is fundamental to many programming tasks, from simple text processing to complex data extraction and transformation.

String Fundamentals

Strings in Python are sequences of characters enclosed in quotes. They are immutable, meaning once created, they cannot be changed.

Creating Strings

Python offers multiple ways to create strings:

# Single quotes
single_quoted = 'Hello, World!'

# Double quotes
double_quoted = "Hello, World!"

# Triple quotes for multi-line strings
multi_line = """This is a string that
spans multiple lines,
which makes it more readable
in the code."""

# Triple single quotes work too
also_multi_line = '''Another
multi-line
string.'''

Single and double quotes are functionally identical. Choose based on the content of your string to avoid escape characters:

# When the string contains single quotes
message = "Don't worry about using escape characters here"

# When the string contains double quotes
quote = 'She said, "Python is amazing!"'

String Immutability

Python strings are immutable, meaning you cannot change individual characters directly:

name = "Python"
# This will cause an error:
# name[0] = "J"

# Instead, create a new string
new_name = "J" + name[1:]  # Results in "Jython"

Analogy: Strings as Necklaces of Beads

Think of a Python string like a necklace of letter beads:

  • Each character is like a bead on the necklace
  • You can examine each bead (read characters)
  • You can count the beads (get the length)
  • You can make a copy of part of the necklace (slicing)
  • You can join two necklaces (concatenation)
  • But you cannot replace a bead once the necklace is made (immutability)
  • To "change" a necklace, you must create a new one

This analogy helps explain why string operations always return new strings rather than modifying existing ones.

Accessing String Characters

Indexing

Python uses zero-based indexing to access individual characters in a string:

message = "Hello, Python!"

# Positive indexing (from the beginning)
first_char = message[0]  # 'H'
fifth_char = message[4]  # 'o'

# Negative indexing (from the end)
last_char = message[-1]   # '!'
second_last = message[-2]  # 'n'

Here's a visual representation of indexing:

 H  e  l  l  o  ,     P  y  t  h  o  n  !
 0  1  2  3  4  5  6  7  8  9  10 11 12 13  (Positive indices)
-14-13-12-11-10-9 -8 -7 -6 -5 -4 -3 -2 -1  (Negative indices)
            

Slicing

Slicing allows you to extract a substring by specifying a range of indices:

message = "Hello, Python!"

# Basic slicing [start:end] (end index is exclusive)
hello = message[0:5]    # "Hello"
python = message[7:13]  # "Python"

# Omitting start or end index
beginning = message[:5]  # "Hello" (starts from 0)
end = message[7:]       # "Python!" (goes to the end)

# Using negative indices in slicing
last_word = message[-7:-1]  # "Python"

# Step parameter [start:end:step]
every_other = message[0:14:2]  # "Hlo yhn"
reversed_string = message[::-1]  # "!nohtyP ,olleH"

Remember that slicing always returns a new string without modifying the original.

Basic String Operations

String Concatenation

You can join strings using the + operator:

first_name = "John"
last_name = "Doe"

# Concatenation with + operator
full_name = first_name + " " + last_name  # "John Doe"

# Multiple concatenations
greeting = "Hello, " + full_name + "!"    # "Hello, John Doe!"

String Repetition

You can repeat a string using the * operator:

# Repeating a string
separator = "-" * 20  # "--------------------"
padding = " " * 5     # "     " (5 spaces)

# Practical example
title = "MENU"
menu_header = separator + "\n" + padding + title + "\n" + separator

print(menu_header)
# Output:
# --------------------
#      MENU
# --------------------

String Length

Get the length of a string using the len() function:

message = "Hello, Python!"
length = len(message)  # 14

Checking Membership

You can check if a substring exists in a string using the 'in' operator:

message = "Hello, Python!"

contains_python = "Python" in message    # True
contains_java = "Java" in message        # False
not_contains_java = "Java" not in message  # True

String Methods

Python provides a rich set of built-in methods for string manipulation. Here are some of the most useful ones:

Case Conversion

message = "Hello, Python!"

# Case conversion
upper_case = message.upper()      # "HELLO, PYTHON!"
lower_case = message.lower()      # "hello, python!"
title_case = message.title()      # "Hello, Python!"
swapped_case = message.swapcase() # "hELLO, pYTHON!"
capitalized = "python is amazing".capitalize()  # "Python is amazing"

Stripping Whitespace

# Whitespace includes spaces, tabs, and newlines
text = "   Too much whitespace   \n"

# Remove whitespace from both ends
stripped = text.strip()  # "Too much whitespace"

# Remove from left/right side only
left_stripped = text.lstrip()  # "Too much whitespace   \n"
right_stripped = text.rstrip()  # "   Too much whitespace"

# Strip specific characters
custom_stripped = "###python###".strip('#')  # "python"

Searching and Replacing

text = "Python is a great programming language. Python is versatile."

# Find the first occurrence
first_position = text.find("Python")  # 0
second_position = text.find("Python", 1)  # 35

# Find with a specified range
position_in_range = text.find("Python", 10, 40)  # 35

# Find all occurrences
all_positions = [i for i in range(len(text)) if text.startswith("Python", i)]
# [0, 35]

# Count occurrences
count = text.count("Python")  # 2

# Replace
replaced_once = text.replace("Python", "Ruby", 1)
# "Ruby is a great programming language. Python is versatile."

replaced_all = text.replace("Python", "Ruby")
# "Ruby is a great programming language. Ruby is versatile."

Splitting and Joining

# Splitting a string into a list
sentence = "Python is amazing and powerful"
words = sentence.split()  # ["Python", "is", "amazing", "and", "powerful"]

# Splitting with a specific delimiter
csv_data = "apple,banana,cherry,date"
fruits = csv_data.split(',')  # ["apple", "banana", "cherry", "date"]

# Splitting with limit
limited_split = csv_data.split(',', 2)  # ["apple", "banana", "cherry,date"]

# Joining a list into a string
joined_words = " ".join(words)  # "Python is amazing and powerful"
joined_fruits = ", ".join(fruits)  # "apple, banana, cherry, date"

# Multiple delimiters using string module
import string
text = "Hello! How are you? I'm fine, thank you."
import re
sentences = re.split(r'[.!?]+', text)
# ['Hello', ' How are you', " I'm fine, thank you", '']

Checking String Properties

# Checking string properties
print("abc123".isalnum())  # True (only letters and numbers)
print("abc".isalpha())     # True (only letters)
print("123".isdigit())     # True (only digits)
print("   ".isspace())     # True (only whitespace)
print("Title Case".istitle())  # True (each word starts with uppercase)
print("UPPER".isupper())   # True (all uppercase)
print("lower".islower())   # True (all lowercase)

# Starting and ending
print("Python".startswith("Py"))  # True
print("Python".endswith("on"))    # True

Alignment and Padding

# Left, right, and center alignment
left_aligned = "Python".ljust(10)      # "Python    "
right_aligned = "Python".rjust(10)     # "    Python"
centered = "Python".center(10)         # "  Python  "

# With custom fill character
right_aligned_custom = "Python".rjust(10, '-')  # "----Python"
centered_custom = "Python".center(10, '*')      # "**Python**"

# Zero padding for numbers
formatted_number = "42".zfill(5)  # "00042"

String Formatting

Python offers several methods for formatting strings by inserting values:

F-Strings (Python 3.6+)

F-strings provide a concise and readable way to embed expressions in string literals:

name = "Alice"
age = 30

# Basic f-string
greeting = f"Hello, {name}! You are {age} years old."
# "Hello, Alice! You are 30 years old."

# Expressions in f-strings
greeting = f"Hello, {name.upper()}! In 5 years, you'll be {age + 5}."
# "Hello, ALICE! In 5 years, you'll be 35."

# Formatting specifiers
pi = 3.14159265359
formatted = f"Pi rounded to 2 decimal places: {pi:.2f}"
# "Pi rounded to 2 decimal places: 3.14"

# Padding and alignment
for i in range(1, 4):
    print(f"{i:2} - {i**2:3}")
# " 1 -   1"
# " 2 -   4"
# " 3 -   9"

# Dictionary values
person = {'name': 'Bob', 'age': 25}
formatted = f"His name is {person['name']} and he's {person['age']}."
# "His name is Bob and he's 25."

str.format() Method

The format() method is another way to format strings:

name = "Alice"
age = 30

# Basic formatting
greeting = "Hello, {}! You are {} years old.".format(name, age)

# Positional arguments
greeting = "Hello, {0}! You are {1} years old. Goodbye, {0}!".format(name, age)

# Named arguments
greeting = "Hello, {name}! You are {age} years old.".format(name=name, age=age)

# Accessing object attributes and dictionary items
person = {'name': 'Bob', 'age': 25}
formatted = "His name is {0[name]} and he's {0[age]}.".format(person)
# "His name is Bob and he's 25."

% Formatting (Legacy Style)

This older style is still found in legacy code:

name = "Alice"
age = 30

# Basic formatting
greeting = "Hello, %s! You are %d years old." % (name, age)

# Named placeholders
greeting = "Hello, %(name)s! You are %(age)d years old." % {'name': name, 'age': age}

# Formatting specifiers
pi = 3.14159
formatted = "Pi rounded to 2 decimal places: %.2f" % pi  # "Pi rounded to 2 decimal places: 3.14"

Analogy: String Formatting as Filling in a Template

Think of string formatting like filling in a template form:

  • F-strings are like having a digital form that can auto-calculate fields
  • The format() method is like a form with numbered blanks you can reference
  • The % operator is like an older paper form with limited field types

Just as you would choose a template that best fits your needs, you can choose the formatting method that works best for your specific situation, with f-strings generally being the most modern and convenient option.

Advanced String Operations

Regular Expressions

For complex pattern matching and manipulation, Python's re module provides regular expression support:

import re

text = "Contact us: support@example.com or sales-team@company.co.uk"

# Finding all email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
# ['support@example.com', 'sales-team@company.co.uk']

# Replacing with regex
censored = re.sub(r'[a-zA-Z0-9._%+-]+@', '***@', text)
# "Contact us: ***@example.com or ***@company.co.uk"

# Splitting with regex
parts = re.split(r'[ :]+', "apple : banana : cherry")
# ['apple', 'banana', 'cherry']

# Validating patterns
def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

print(is_valid_email("user@example.com"))  # True
print(is_valid_email("invalid-email"))     # False

String Translation

Translate characters using the str.translate() method:

# Create a translation table
translation_table = str.maketrans({
    'a': '@',
    'e': '3',
    'i': '1',
    'o': '0',
    's': '$'
})

# Apply translation
text = "Hello, this is a secret message"
leetspeak = text.translate(translation_table)
# "H3ll0, th1$ 1$ @ $3cr3t m3$$@g3"

# Remove characters with translate
remove_punctuation = str.maketrans('', '', '.,;:!?')
cleaned_text = "Hello, World! How are you?".translate(remove_punctuation)
# "Hello World How are you"

String Comparison

# Case-sensitive comparison
print("Python" == "python")  # False

# Case-insensitive comparison
print("Python".lower() == "python".lower())  # True

# Unicode normalization for comparison
import unicodedata
def normalized_equals(str1, str2):
    """Compare strings ignoring case and combining characters."""
    norm1 = unicodedata.normalize('NFKD', str1.lower())
    norm2 = unicodedata.normalize('NFKD', str2.lower())
    return norm1 == norm2

print(normalized_equals("Café", "cafe"))  # True

Working with Unicode

# Unicode characters
print("Unicode symbols: ♠ ♥ ♦ ♣ ★ ☺")

# Converting between characters and code points
character = 'A'
code_point = ord(character)  # 65
back_to_char = chr(code_point)  # 'A'

# Emoji
print("Emoji support: 🐍 👍 🚀")

# Getting the Unicode name
import unicodedata
snake_emoji = "🐍"
print(unicodedata.name(snake_emoji))  # "SNAKE"

Practical String Manipulation Examples

Text Cleaning

def clean_text(text):
    """Remove extra whitespace, normalize case, and remove punctuation."""
    import re
    
    # Strip whitespace and convert to lowercase
    text = text.strip().lower()
    
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    return text

dirty_text = "  Hello,   World!  How's   it going? "
cleaned = clean_text(dirty_text)
# "hello world hows it going"

Word Counter

def count_words(text):
    """Count word frequency in text."""
    # Clean the text
    text = clean_text(text)
    
    # Split into words
    words = text.split()
    
    # Count frequencies
    word_count = {}
    for word in words:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    
    return word_count

sample_text = """Python is amazing. Python is versatile.
                Python has many libraries for different purposes."""
                
word_frequencies = count_words(sample_text)
# {'python': 3, 'is': 2, 'amazing': 1, 'versatile': 1, 'has': 1, 'many': 1, 
#  'libraries': 1, 'for': 1, 'different': 1, 'purposes': 1}

Simple Template Engine

def render_template(template, context):
    """Replace placeholders in a template with values from context."""
    result = template
    for key, value in context.items():
        placeholder = '{{' + key + '}}'
        result = result.replace(placeholder, str(value))
    return result

# Template
email_template = """
Hello {{name}},

Thank you for your purchase of {{product}} on {{date}}.
Your order number is {{order_number}}.

Best regards,
{{company}} Support Team
"""

# Context
order_data = {
    'name': 'Alice Smith',
    'product': 'Python Masterclass',
    'date': '2025-04-15',
    'order_number': 'ORD-12345',
    'company': 'CodeLearners'
}

# Render
email_content = render_template(email_template, order_data)
print(email_content)

URL Parsing

def parse_url(url):
    """Extract components from a URL."""
    import re
    
    # Pattern for URL components
    pattern = r'^(https?://)?([^/]+)(/.*)?$'
    match = re.match(pattern, url)
    
    if not match:
        return None
    
    protocol = match.group(1) or ''
    domain = match.group(2)
    path = match.group(3) or ''
    
    # Extract query parameters if present
    query_params = {}
    if '?' in path:
        path_parts = path.split('?', 1)
        path = path_parts[0]
        query_string = path_parts[1]
        
        # Parse query string
        for param in query_string.split('&'):
            if '=' in param:
                key, value = param.split('=', 1)
                query_params[key] = value
    
    return {
        'protocol': protocol.rstrip('://'),
        'domain': domain,
        'path': path,
        'query_params': query_params
    }

url = "https://example.com/products?category=books&sort=price"
components = parse_url(url)
print(components)
# {'protocol': 'https', 'domain': 'example.com', 'path': '/products', 
#  'query_params': {'category': 'books', 'sort': 'price'}}

Performance Tips for String Operations

String Concatenation

When concatenating many strings, use join() instead of the + operator:

# Inefficient (creates a new string object each time)
result = ""
for i in range(1000):
    result += str(i)

# More efficient (builds the list in memory, then joins once)
parts = []
for i in range(1000):
    parts.append(str(i))
result = "".join(parts)

# Even better with a list comprehension
result = "".join(str(i) for i in range(1000))

String Processing of Large Files

# Process large files line by line instead of loading the whole content
def count_lines(file_path):
    count = 0
    with open(file_path, 'r') as f:
        for line in f:  # Reads one line at a time
            count += 1
    return count

Avoid Redundant Conversions

# Redundant operations
def inefficient(number):
    return int(str(number) + str(3))

# More efficient
def efficient(number):
    return number * 10 + 3

Practice Exercises

Exercise 1: String Basics

  1. Create a string with your full name
  2. Extract your first and last name using slicing
  3. Convert your name to uppercase, lowercase, and title case
  4. Calculate the length of your full name (including spaces)
  5. Replace your first name with "Mr." or "Ms."

Exercise 2: String Formatting

  1. Create variables for a product name, price, and quantity
  2. Format a nice-looking receipt line using f-strings
  3. Format the same receipt with str.format()
  4. Create a table of products with aligned columns

Exercise 3: Advanced String Processing

Write a function that accepts a text string and:

  1. Counts the total number of characters, words, and sentences
  2. Finds the five most common words
  3. Computes the average word length
  4. Returns a dictionary with all these statistics

Exercise 4: Password Validator

Create a function that checks if a password meets the following criteria:

The function should return True if the password is valid and False otherwise.

Wrapping Up and Next Steps

Today we've explored Python's powerful string manipulation capabilities, from basic operations to advanced techniques. Strings are fundamental to nearly all programming tasks, and mastering these concepts will serve you well in your Python journey.

Key Takeaways

Where to Go from Here

  1. Practice string manipulation by working on text processing projects
  2. Explore the re module further for advanced pattern matching
  3. Learn about Unicode and internationalization for handling text in different languages
  4. Dive into natural language processing libraries like NLTK or spaCy that build on these fundamentals

Additional Resources

In our next session, we'll build on these string manipulation skills as we explore Python's data structures and how to effectively organize and process more complex information.