Python String Methods Every Data Scientist Uses
Why String Methods Matter in Data Science
Data is messy, and much of that mess is in text. Customer names with extra whitespace, dates in inconsistent formats, categorical labels with typos — string manipulation is a daily task for data scientists. In interviews, string questions test your Python fluency and your ability to clean real-world data.
Essential String Methods
strip(), lstrip(), rstrip()
Remove whitespace (or specified characters) from the edges of a string:
name = " Alice Johnson "
name.strip() # "Alice Johnson"
name.lstrip() # "Alice Johnson "
name.rstrip() # " Alice Johnson"
# Remove specific characters
price = "$99.99$"
price.strip("$") # "99.99"
Data cleaning use case: Column names often have leading/trailing spaces after reading from CSV or Excel files:
df.columns = [col.strip() for col in df.columns]
split() and join()
split() breaks a string into a list; join() does the reverse:
# Split
text = "Python,SQL,Pandas,NumPy"
skills = text.split(",") # ['Python', 'SQL', 'Pandas', 'NumPy']
# Split with maxsplit
log = "2025-03-15 ERROR Database connection failed"
parts = log.split(" ", 2) # ['2025-03-15', 'ERROR', 'Database connection failed']
# Join
", ".join(skills) # "Python, SQL, Pandas, NumPy"
"-".join(["2025", "03", "15"]) # "2025-03-15"
replace()
Substitute substrings:
text = "Hello World"
text.replace("World", "Python") # "Hello Python"
# Remove characters
phone = "(555) 123-4567"
phone.replace("(", "").replace(")", "").replace("-", "").replace(" ", "")
# "5551234567"
lower(), upper(), title(), capitalize()
name = "john DOE"
name.lower() # "john doe"
name.upper() # "JOHN DOE"
name.title() # "John Doe"
name.capitalize() # "John doe"
Interview pattern: Case-insensitive comparison:
# Wrong (sometimes)
if user_input == "yes": # Misses "Yes", "YES", etc.
# Right
if user_input.lower() == "yes":
startswith() and endswith()
filename = "report_2025.csv"
filename.endswith(".csv") # True
filename.startswith("report") # True
# Multiple options (pass a tuple)
filename.endswith((".csv", ".tsv", ".xlsx")) # True
find() and index()
Both locate a substring, but they differ in error handling:
text = "Hello World"
text.find("World") # 6 (returns index)
text.find("Python") # -1 (returns -1 if not found)
text.index("World") # 6 (returns index)
text.index("Python") # ValueError! (raises exception if not found)
Interview tip: Use find() when the substring might not exist; use index() when absence is an error.
isdigit(), isalpha(), isalnum()
"12345".isdigit() # True
"hello".isalpha() # True
"abc123".isalnum() # True
"hello!".isalpha() # False
Useful for input validation:
# Filter to only numeric strings
values = ["42", "N/A", "100", "unknown", "7"]
numeric = [v for v in values if v.isdigit()]
# ['42', '100', '7']
F-Strings (Formatted String Literals)
F-strings are the modern way to format strings in Python (3.6+):
name = "Alice"
score = 92.567
# Basic insertion
f"Hello, {name}!" # "Hello, Alice!"
# Expressions
f"Score: {score:.1f}%" # "Score: 92.6%"
f"{'Pass' if score >= 60 else 'Fail'}" # "Pass"
# Formatting numbers
revenue = 1234567.89
f"${revenue:,.2f}" # "$1,234,567.89"
f"{0.1543:.1%}" # "15.4%"
Alignment and Padding
for name, score in [("Alice", 95), ("Bob", 87), ("Charlie", 92)]:
print(f"{name:<10} {score:>5}")
# Alice 95
# Bob 87
# Charlie 92
Regular Expressions (re module)
For complex patterns, use the re module:
import re
# Find all email addresses
text = "Contact us at [email protected] or [email protected]"
emails = re.findall(r'[\w.]+@[\w.]+', text)
# ['[email protected]', '[email protected]']
# Extract numbers
data = "Revenue: $1,234,567 in Q1 2025"
numbers = re.findall(r'[\d,]+', data)
# ['1,234,567', '1', '2025']
# Replace patterns
text = "Call 555-1234 or 555-5678"
cleaned = re.sub(r'\d{3}-\d{4}', '[REDACTED]', text)
# "Call [REDACTED] or [REDACTED]"
Common Regex Patterns for Data Science
# Date extraction
dates = re.findall(r'\d{4}-\d{2}-\d{2}', log_text)
# URL extraction
urls = re.findall(r'https?://[\S]+', web_text)
# Clean HTML tags
clean = re.sub(r'<[^>]+>', '', html_text)
# Extract specific groups
match = re.search(r'(\d+)\s*(kg|lbs)', "Weight: 75 kg")
if match:
value, unit = match.groups() # ('75', 'kg')
Real Data Cleaning Examples
Standardizing Phone Numbers
def clean_phone(phone):
# Remove all non-digit characters and format consistently
digits = re.sub(r'\D', '', phone)
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
return digits
phones = ["(555) 123-4567", "555.123.4567", "555-123-4567", "5551234567"]
[clean_phone(p) for p in phones]
# All become: "(555) 123-4567"
Parsing Log Files
log = "2025-03-15 14:30:22 ERROR [auth] Login failed for user_id=12345"
# Extract timestamp
timestamp = log[:19]
# Extract log level
level = log.split()[2]
# Extract user_id
user_id = re.search(r'user_id=(\d+)', log).group(1)
Cleaning Column Names
def clean_column_name(col):
# Convert messy column names to snake_case
col = col.strip().lower()
col = re.sub(r'[^a-z0-9]+', '_', col)
col = col.strip('_')
return col
messy_cols = ["First Name", "Last-Name", "E Mail Address", "Phone #"]
clean_cols = [clean_column_name(c) for c in messy_cols]
# ['first_name', 'last_name', 'e_mail_address', 'phone']
Working with Pandas String Methods
import pandas as pd
# Vectorized string operations
df['name_clean'] = df['name'].str.strip().str.title()
df['domain'] = df['email'].str.split('@').str[1]
df['has_gmail'] = df['email'].str.contains('gmail', case=False)
# Extract with regex
df['zip_code'] = df['address'].str.extract(r'(\d{5})$')
Common Interview Questions
Question 1: Reverse Words
Reverse the words in a string (not the characters).
def reverse_words(s):
return " ".join(s.split()[::-1])
reverse_words("Hello World Python") # "Python World Hello"
Question 2: Valid Palindrome
Check if a string is a palindrome, considering only alphanumeric characters.
def is_palindrome(s):
cleaned = "".join(c.lower() for c in s if c.isalnum())
return cleaned == cleaned[::-1]
is_palindrome("A man, a plan, a canal: Panama") # True
Question 3: Most Common Character
Find the most frequent character in a string.
from collections import Counter
def most_common(s):
counts = Counter(s.replace(" ", "").lower())
return counts.most_common(1)[0]
most_common("Hello World") # ('l', 3)
Practice String Problems
Ready to practice? Work through our Python string method problems with real interview questions and detailed solutions.
Ready to test your skills?
Practice real String Methods interview questions from top companies — with solutions.
Get interview tips in your inbox
Join data scientists preparing smarter. No spam, unsubscribe anytime.