← Back to Blog

Pandas GroupBy: The Complete Tutorial

GroupBy Is Essential for Data Science

The pandas groupby() operation is the Python equivalent of SQL's GROUP BY — and it's tested just as frequently in interviews. Whether you're at a whiteboard or in a Jupyter notebook, you need to be fluent with split-apply-combine operations.

The Split-Apply-Combine Pattern

Every groupby operation follows three steps:

  1. Split — divide the DataFrame into groups based on column values
  2. Apply — perform a computation on each group
  3. Combine — merge the results back together
import pandas as pd

# Sample data
df = pd.DataFrame({
    'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'Sales'],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'salary': [70000, 65000, 95000, 92000, 68000]
})

# Split-Apply-Combine
result = df.groupby('department')['salary'].mean()
# department
# Engineering    93500.0
# Sales          67666.7

Basic Aggregation

Single Column, Single Aggregation

df.groupby('department')['salary'].sum()
df.groupby('department')['salary'].mean()
df.groupby('department')['salary'].count()

Multiple Aggregations with .agg()

summary = (df.groupby('department')['salary']
             .agg(['mean', 'min', 'max', 'count']))

Named Aggregations (The Modern Way)

This is the cleanest syntax and what interviewers prefer to see:

summary = (df.groupby('department')
             .agg(
                 avg_salary=('salary', 'mean'),
                 headcount=('name', 'count'),
                 max_salary=('salary', 'max'),
                 min_salary=('salary', 'min')
             )
             .reset_index())

Custom Aggregation Functions

# Salary range per department
summary = (df.groupby('department')
             .agg(
                 salary_range=('salary', lambda x: x.max() - x.min()),
                 median_salary=('salary', 'median')
             ))

Grouping by Multiple Columns

df.groupby(['department', 'level'])['salary'].mean()

# With reset_index for a flat DataFrame
result = (df.groupby(['region', 'product_category'])
            .agg(total_revenue=('revenue', 'sum'),
                 order_count=('order_id', 'count'))
            .reset_index()
            .sort_values('total_revenue', ascending=False))

Transform: Group Stats Without Collapsing

transform() returns a Series the same length as the input — perfect for adding group-level statistics to each row:

# Add department average salary to each row
df['dept_avg'] = df.groupby('department')['salary'].transform('mean')

# Calculate how much each person deviates from their department average
df['salary_vs_dept'] = df['salary'] - df.groupby('department')['salary'].transform('mean')

# Normalize within groups (z-score)
df['salary_zscore'] = df.groupby('department')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)

Interview tip: transform is how you do window functions in pandas. If an interviewer asks for "the equivalent of OVER (PARTITION BY ...)", use transform.

Filter: Keep or Remove Entire Groups

filter() keeps or removes entire groups based on a group-level condition:

# Keep only departments with more than 3 employees
large_depts = df.groupby('department').filter(lambda x: len(x) > 3)

# Keep only departments where average salary > 80000
high_paying = df.groupby('department').filter(lambda x: x['salary'].mean() > 80000)

Real-World Interview Patterns

Pattern 1: Rank Within Groups

# Top 3 earners per department
df['rank'] = df.groupby('department')['salary'].rank(ascending=False, method='dense')
top_3 = df[df['rank'] <= 3].sort_values(['department', 'rank'])

Pattern 2: Cumulative Sum Within Groups

# Running total of sales per salesperson
df = df.sort_values('date')
df['cumulative_sales'] = df.groupby('salesperson')['amount'].cumsum()

Pattern 3: Percentage of Group Total

# Each person's share of department salary budget
df['dept_total'] = df.groupby('department')['salary'].transform('sum')
df['salary_pct'] = (df['salary'] / df['dept_total'] * 100).round(1)

Pattern 4: First/Last Per Group

# Most recent order per customer
latest = (df.sort_values('order_date')
            .groupby('customer_id')
            .last()
            .reset_index())

# Or using idxmax
idx = df.groupby('customer_id')['order_date'].idxmax()
latest = df.loc[idx]

Pattern 5: Time-Based Grouping

# Monthly revenue
df['month'] = df['date'].dt.to_period('M')
monthly = df.groupby('month')['revenue'].sum()

# Weekly active users
df['week'] = df['login_date'].dt.isocalendar().week
wau = df.groupby('week')['user_id'].nunique()

Common Mistakes

  1. Forgetting reset_index() — groupby results have the group columns as the index
  2. Using apply when agg worksapply is slower and less readable for standard aggregations
  3. Not handling NaNgroupby excludes NaN keys by default (use dropna=False to include them)
  4. Modifying the original DataFrame — use .copy() if you need to preserve the original

GroupBy vs SQL Comparison

SQL Pandas
GROUP BY col df.groupby('col')
COUNT(*) .size() or ('col', 'count')
COUNT(DISTINCT col) .nunique()
HAVING COUNT(*) > 5 .filter(lambda x: len(x) > 5)
Window function .transform()

Practice Pandas Problems

Sharpen your pandas skills with our pandas interview problems — real questions from top data science interviews with detailed solutions.

Practice Makes Perfect

Ready to test your skills?

Practice real Pandas interview questions from top companies — with solutions.

Get interview tips in your inbox

Join data scientists preparing smarter. No spam, unsubscribe anytime.