Pandas GroupBy: The Complete Tutorial
GroupBy Is Essential for Data Science
The pandas groupby() operation is the Python equivalent of SQL's GROUP BY — and it's tested just as frequently in interviews. Whether you're at a whiteboard or in a Jupyter notebook, you need to be fluent with split-apply-combine operations.
The Split-Apply-Combine Pattern
Every groupby operation follows three steps:
- Split — divide the DataFrame into groups based on column values
- Apply — perform a computation on each group
- Combine — merge the results back together
import pandas as pd
# Sample data
df = pd.DataFrame({
'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'Sales'],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'salary': [70000, 65000, 95000, 92000, 68000]
})
# Split-Apply-Combine
result = df.groupby('department')['salary'].mean()
# department
# Engineering 93500.0
# Sales 67666.7
Basic Aggregation
Single Column, Single Aggregation
df.groupby('department')['salary'].sum()
df.groupby('department')['salary'].mean()
df.groupby('department')['salary'].count()
Multiple Aggregations with .agg()
summary = (df.groupby('department')['salary']
.agg(['mean', 'min', 'max', 'count']))
Named Aggregations (The Modern Way)
This is the cleanest syntax and what interviewers prefer to see:
summary = (df.groupby('department')
.agg(
avg_salary=('salary', 'mean'),
headcount=('name', 'count'),
max_salary=('salary', 'max'),
min_salary=('salary', 'min')
)
.reset_index())
Custom Aggregation Functions
# Salary range per department
summary = (df.groupby('department')
.agg(
salary_range=('salary', lambda x: x.max() - x.min()),
median_salary=('salary', 'median')
))
Grouping by Multiple Columns
df.groupby(['department', 'level'])['salary'].mean()
# With reset_index for a flat DataFrame
result = (df.groupby(['region', 'product_category'])
.agg(total_revenue=('revenue', 'sum'),
order_count=('order_id', 'count'))
.reset_index()
.sort_values('total_revenue', ascending=False))
Transform: Group Stats Without Collapsing
transform() returns a Series the same length as the input — perfect for adding group-level statistics to each row:
# Add department average salary to each row
df['dept_avg'] = df.groupby('department')['salary'].transform('mean')
# Calculate how much each person deviates from their department average
df['salary_vs_dept'] = df['salary'] - df.groupby('department')['salary'].transform('mean')
# Normalize within groups (z-score)
df['salary_zscore'] = df.groupby('department')['salary'].transform(
lambda x: (x - x.mean()) / x.std()
)
Interview tip: transform is how you do window functions in pandas. If an interviewer asks for "the equivalent of OVER (PARTITION BY ...)", use transform.
Filter: Keep or Remove Entire Groups
filter() keeps or removes entire groups based on a group-level condition:
# Keep only departments with more than 3 employees
large_depts = df.groupby('department').filter(lambda x: len(x) > 3)
# Keep only departments where average salary > 80000
high_paying = df.groupby('department').filter(lambda x: x['salary'].mean() > 80000)
Real-World Interview Patterns
Pattern 1: Rank Within Groups
# Top 3 earners per department
df['rank'] = df.groupby('department')['salary'].rank(ascending=False, method='dense')
top_3 = df[df['rank'] <= 3].sort_values(['department', 'rank'])
Pattern 2: Cumulative Sum Within Groups
# Running total of sales per salesperson
df = df.sort_values('date')
df['cumulative_sales'] = df.groupby('salesperson')['amount'].cumsum()
Pattern 3: Percentage of Group Total
# Each person's share of department salary budget
df['dept_total'] = df.groupby('department')['salary'].transform('sum')
df['salary_pct'] = (df['salary'] / df['dept_total'] * 100).round(1)
Pattern 4: First/Last Per Group
# Most recent order per customer
latest = (df.sort_values('order_date')
.groupby('customer_id')
.last()
.reset_index())
# Or using idxmax
idx = df.groupby('customer_id')['order_date'].idxmax()
latest = df.loc[idx]
Pattern 5: Time-Based Grouping
# Monthly revenue
df['month'] = df['date'].dt.to_period('M')
monthly = df.groupby('month')['revenue'].sum()
# Weekly active users
df['week'] = df['login_date'].dt.isocalendar().week
wau = df.groupby('week')['user_id'].nunique()
Common Mistakes
- Forgetting reset_index() — groupby results have the group columns as the index
- Using apply when agg works —
applyis slower and less readable for standard aggregations - Not handling NaN —
groupbyexcludes NaN keys by default (usedropna=Falseto include them) - Modifying the original DataFrame — use
.copy()if you need to preserve the original
GroupBy vs SQL Comparison
| SQL | Pandas |
|---|---|
GROUP BY col |
df.groupby('col') |
COUNT(*) |
.size() or ('col', 'count') |
COUNT(DISTINCT col) |
.nunique() |
HAVING COUNT(*) > 5 |
.filter(lambda x: len(x) > 5) |
| Window function | .transform() |
Practice Pandas Problems
Sharpen your pandas skills with our pandas interview problems — real questions from top data science interviews with detailed solutions.
Ready to test your skills?
Practice real Pandas interview questions from top companies — with solutions.
Get interview tips in your inbox
Join data scientists preparing smarter. No spam, unsubscribe anytime.