Pandas GroupBy: The Complete Tutorial

March 3, 2026

GroupBy Is Essential for Data Science

The pandas groupby() operation is the Python equivalent of SQL's GROUP BY — and it's tested just as frequently in interviews. Whether you're at a whiteboard or in a Jupyter notebook, you need to be fluent with split-apply-combine operations.

The Split-Apply-Combine Pattern

Every groupby operation follows three steps:

Split — divide the DataFrame into groups based on column values
Apply — perform a computation on each group
Combine — merge the results back together

import pandas as pd

# Sample data
df = pd.DataFrame({
    'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'Sales'],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'salary': [70000, 65000, 95000, 92000, 68000]
})

# Split-Apply-Combine
result = df.groupby('department')['salary'].mean()
# department
# Engineering    93500.0
# Sales          67666.7

Basic Aggregation

Single Column, Single Aggregation

df.groupby('department')['salary'].sum()
df.groupby('department')['salary'].mean()
df.groupby('department')['salary'].count()

Multiple Aggregations with .agg()

summary = (df.groupby('department')['salary']
             .agg(['mean', 'min', 'max', 'count']))

Named Aggregations (The Modern Way)

This is the cleanest syntax and what interviewers prefer to see:

summary = (df.groupby('department')
             .agg(
                 avg_salary=('salary', 'mean'),
                 headcount=('name', 'count'),
                 max_salary=('salary', 'max'),
                 min_salary=('salary', 'min')
             )
             .reset_index())

Custom Aggregation Functions

# Salary range per department
summary = (df.groupby('department')
             .agg(
                 salary_range=('salary', lambda x: x.max() - x.min()),
                 median_salary=('salary', 'median')
             ))

Grouping by Multiple Columns

df.groupby(['department', 'level'])['salary'].mean()

# With reset_index for a flat DataFrame
result = (df.groupby(['region', 'product_category'])
            .agg(total_revenue=('revenue', 'sum'),
                 order_count=('order_id', 'count'))
            .reset_index()
            .sort_values('total_revenue', ascending=False))

Transform: Group Stats Without Collapsing

transform() returns a Series the same length as the input — perfect for adding group-level statistics to each row:

# Add department average salary to each row
df['dept_avg'] = df.groupby('department')['salary'].transform('mean')

# Calculate how much each person deviates from their department average
df['salary_vs_dept'] = df['salary'] - df.groupby('department')['salary'].transform('mean')

# Normalize within groups (z-score)
df['salary_zscore'] = df.groupby('department')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)

Interview tip: transform is how you do window functions in pandas. If an interviewer asks for "the equivalent of OVER (PARTITION BY ...)", use transform.

Filter: Keep or Remove Entire Groups

filter() keeps or removes entire groups based on a group-level condition:

# Keep only departments with more than 3 employees
large_depts = df.groupby('department').filter(lambda x: len(x) > 3)

# Keep only departments where average salary > 80000
high_paying = df.groupby('department').filter(lambda x: x['salary'].mean() > 80000)

Real-World Interview Patterns

Pattern 1: Rank Within Groups

# Top 3 earners per department
df['rank'] = df.groupby('department')['salary'].rank(ascending=False, method='dense')
top_3 = df[df['rank'] <= 3].sort_values(['department', 'rank'])

Pattern 2: Cumulative Sum Within Groups

# Running total of sales per salesperson
df = df.sort_values('date')
df['cumulative_sales'] = df.groupby('salesperson')['amount'].cumsum()

Pattern 3: Percentage of Group Total

# Each person's share of department salary budget
df['dept_total'] = df.groupby('department')['salary'].transform('sum')
df['salary_pct'] = (df['salary'] / df['dept_total'] * 100).round(1)

Pattern 4: First/Last Per Group

# Most recent order per customer
latest = (df.sort_values('order_date')
            .groupby('customer_id')
            .last()
            .reset_index())

# Or using idxmax
idx = df.groupby('customer_id')['order_date'].idxmax()
latest = df.loc[idx]

Pattern 5: Time-Based Grouping

# Monthly revenue
df['month'] = df['date'].dt.to_period('M')
monthly = df.groupby('month')['revenue'].sum()

# Weekly active users
df['week'] = df['login_date'].dt.isocalendar().week
wau = df.groupby('week')['user_id'].nunique()

Common Mistakes

Forgetting reset_index() — groupby results have the group columns as the index
Using apply when agg works — apply is slower and less readable for standard aggregations
Not handling NaN — groupby excludes NaN keys by default (use dropna=False to include them)
Modifying the original DataFrame — use .copy() if you need to preserve the original

GroupBy vs SQL Comparison

SQL	Pandas
`GROUP BY col`	`df.groupby('col')`
`COUNT(*)`	`.size()` or `('col', 'count')`
`COUNT(DISTINCT col)`	`.nunique()`
`HAVING COUNT(*) > 5`	`.filter(lambda x: len(x) > 5)`
Window function	`.transform()`

Practice Pandas Problems

Sharpen your pandas skills with our pandas interview problems — real questions from top data science interviews with detailed solutions.

Practice Makes Perfect

Ready to test your skills?

Practice real Pandas interview questions from top companies — with solutions.

Practice Pandas Problems Browse all problems

Get interview tips in your inbox

Join data scientists preparing smarter. No spam, unsubscribe anytime.

← Back to Blog