← Back to Blog

Python String Methods Every Data Scientist Uses

Why String Methods Matter in Data Science

Data is messy, and much of that mess is in text. Customer names with extra whitespace, dates in inconsistent formats, categorical labels with typos — string manipulation is a daily task for data scientists. In interviews, string questions test your Python fluency and your ability to clean real-world data.

Essential String Methods

strip(), lstrip(), rstrip()

Remove whitespace (or specified characters) from the edges of a string:

name = "  Alice Johnson  "
name.strip()    # "Alice Johnson"
name.lstrip()   # "Alice Johnson  "
name.rstrip()   # "  Alice Johnson"

# Remove specific characters
price = "$99.99$"
price.strip("$")  # "99.99"

Data cleaning use case: Column names often have leading/trailing spaces after reading from CSV or Excel files:

df.columns = [col.strip() for col in df.columns]

split() and join()

split() breaks a string into a list; join() does the reverse:

# Split
text = "Python,SQL,Pandas,NumPy"
skills = text.split(",")  # ['Python', 'SQL', 'Pandas', 'NumPy']

# Split with maxsplit
log = "2025-03-15 ERROR Database connection failed"
parts = log.split(" ", 2)  # ['2025-03-15', 'ERROR', 'Database connection failed']

# Join
", ".join(skills)  # "Python, SQL, Pandas, NumPy"
"-".join(["2025", "03", "15"])  # "2025-03-15"

replace()

Substitute substrings:

text = "Hello World"
text.replace("World", "Python")  # "Hello Python"

# Remove characters
phone = "(555) 123-4567"
phone.replace("(", "").replace(")", "").replace("-", "").replace(" ", "")
# "5551234567"

lower(), upper(), title(), capitalize()

name = "john DOE"
name.lower()       # "john doe"
name.upper()       # "JOHN DOE"
name.title()       # "John Doe"
name.capitalize()  # "John doe"

Interview pattern: Case-insensitive comparison:

# Wrong (sometimes)
if user_input == "yes":  # Misses "Yes", "YES", etc.

# Right
if user_input.lower() == "yes":

startswith() and endswith()

filename = "report_2025.csv"
filename.endswith(".csv")      # True
filename.startswith("report")  # True

# Multiple options (pass a tuple)
filename.endswith((".csv", ".tsv", ".xlsx"))  # True

find() and index()

Both locate a substring, but they differ in error handling:

text = "Hello World"
text.find("World")   # 6  (returns index)
text.find("Python")  # -1 (returns -1 if not found)

text.index("World")  # 6  (returns index)
text.index("Python") # ValueError! (raises exception if not found)

Interview tip: Use find() when the substring might not exist; use index() when absence is an error.

isdigit(), isalpha(), isalnum()

"12345".isdigit()   # True
"hello".isalpha()   # True
"abc123".isalnum()  # True
"hello!".isalpha()  # False

Useful for input validation:

# Filter to only numeric strings
values = ["42", "N/A", "100", "unknown", "7"]
numeric = [v for v in values if v.isdigit()]
# ['42', '100', '7']

F-Strings (Formatted String Literals)

F-strings are the modern way to format strings in Python (3.6+):

name = "Alice"
score = 92.567

# Basic insertion
f"Hello, {name}!"  # "Hello, Alice!"

# Expressions
f"Score: {score:.1f}%"  # "Score: 92.6%"
f"{'Pass' if score >= 60 else 'Fail'}"  # "Pass"

# Formatting numbers
revenue = 1234567.89
f"${revenue:,.2f}"  # "$1,234,567.89"
f"{0.1543:.1%}"     # "15.4%"

Alignment and Padding

for name, score in [("Alice", 95), ("Bob", 87), ("Charlie", 92)]:
    print(f"{name:<10} {score:>5}")
# Alice          95
# Bob            87
# Charlie        92

Regular Expressions (re module)

For complex patterns, use the re module:

import re

# Find all email addresses
text = "Contact us at [email protected] or [email protected]"
emails = re.findall(r'[\w.]+@[\w.]+', text)
# ['[email protected]', '[email protected]']

# Extract numbers
data = "Revenue: $1,234,567 in Q1 2025"
numbers = re.findall(r'[\d,]+', data)
# ['1,234,567', '1', '2025']

# Replace patterns
text = "Call 555-1234 or 555-5678"
cleaned = re.sub(r'\d{3}-\d{4}', '[REDACTED]', text)
# "Call [REDACTED] or [REDACTED]"

Common Regex Patterns for Data Science

# Date extraction
dates = re.findall(r'\d{4}-\d{2}-\d{2}', log_text)

# URL extraction
urls = re.findall(r'https?://[\S]+', web_text)

# Clean HTML tags
clean = re.sub(r'<[^>]+>', '', html_text)

# Extract specific groups
match = re.search(r'(\d+)\s*(kg|lbs)', "Weight: 75 kg")
if match:
    value, unit = match.groups()  # ('75', 'kg')

Real Data Cleaning Examples

Standardizing Phone Numbers

def clean_phone(phone):
    # Remove all non-digit characters and format consistently
    digits = re.sub(r'\D', '', phone)
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    return digits

phones = ["(555) 123-4567", "555.123.4567", "555-123-4567", "5551234567"]
[clean_phone(p) for p in phones]
# All become: "(555) 123-4567"

Parsing Log Files

log = "2025-03-15 14:30:22 ERROR [auth] Login failed for user_id=12345"

# Extract timestamp
timestamp = log[:19]

# Extract log level
level = log.split()[2]

# Extract user_id
user_id = re.search(r'user_id=(\d+)', log).group(1)

Cleaning Column Names

def clean_column_name(col):
    # Convert messy column names to snake_case
    col = col.strip().lower()
    col = re.sub(r'[^a-z0-9]+', '_', col)
    col = col.strip('_')
    return col

messy_cols = ["First Name", "Last-Name", "E Mail Address", "Phone #"]
clean_cols = [clean_column_name(c) for c in messy_cols]
# ['first_name', 'last_name', 'e_mail_address', 'phone']

Working with Pandas String Methods

import pandas as pd

# Vectorized string operations
df['name_clean'] = df['name'].str.strip().str.title()
df['domain'] = df['email'].str.split('@').str[1]
df['has_gmail'] = df['email'].str.contains('gmail', case=False)

# Extract with regex
df['zip_code'] = df['address'].str.extract(r'(\d{5})$')

Common Interview Questions

Question 1: Reverse Words

Reverse the words in a string (not the characters).

def reverse_words(s):
    return " ".join(s.split()[::-1])

reverse_words("Hello World Python")  # "Python World Hello"

Question 2: Valid Palindrome

Check if a string is a palindrome, considering only alphanumeric characters.

def is_palindrome(s):
    cleaned = "".join(c.lower() for c in s if c.isalnum())
    return cleaned == cleaned[::-1]

is_palindrome("A man, a plan, a canal: Panama")  # True

Question 3: Most Common Character

Find the most frequent character in a string.

from collections import Counter

def most_common(s):
    counts = Counter(s.replace(" ", "").lower())
    return counts.most_common(1)[0]

most_common("Hello World")  # ('l', 3)

Practice String Problems

Ready to practice? Work through our Python string method problems with real interview questions and detailed solutions.

Practice Makes Perfect

Ready to test your skills?

Practice real String Methods interview questions from top companies — with solutions.

Get interview tips in your inbox

Join data scientists preparing smarter. No spam, unsubscribe anytime.