Machine Learning Interview Questions: A Practical Guide

March 6, 2026

How ML Interviews Work

Machine learning interviews for data science roles focus on intuition and practical understanding — not mathematical proofs. You need to explain concepts clearly, know the trade-offs between approaches, and demonstrate that you can make sound modeling decisions.

The Bias-Variance Tradeoff

This is the single most important concept in ML interviews.

Bias — error from overly simplistic assumptions. A linear model fitting a curved relationship has high bias (underfitting).

Variance — error from sensitivity to training data fluctuations. A deep decision tree that memorizes training data has high variance (overfitting).

The tradeoff: As model complexity increases, bias decreases but variance increases. The sweet spot minimizes total error.

Interview question: Your model has 95% training accuracy but 60% test accuracy. What's happening and how do you fix it?

Answer: High variance (overfitting). Solutions: - More training data - Regularization (L1/L2) - Reduce model complexity (fewer features, shallower trees) - Cross-validation for hyperparameter tuning - Ensemble methods (bagging reduces variance)

Model Selection

When to Use What

Algorithm	Best For	Pros	Cons
Logistic Regression	Binary classification, interpretability	Fast, interpretable, works well with linear boundaries	Can't capture non-linear relationships
Decision Trees	Non-linear relationships, feature importance	Interpretable, handles mixed data types	Overfits easily, unstable
Random Forest	General purpose classification/regression	Robust, handles non-linearity, feature importance	Less interpretable, slower than single trees
Gradient Boosting (XGBoost)	Competitions, tabular data	Often highest accuracy, handles missing values	Can overfit, many hyperparameters
Linear Regression	Continuous target, interpretability	Simple, fast, well-understood	Assumes linear relationship
K-Nearest Neighbors	Small datasets, non-parametric problems	Simple, no training phase	Slow at prediction time, curse of dimensionality

Interview question: When would you choose logistic regression over a random forest?

Choose logistic regression when: - You need interpretable coefficients (regulated industries) - The relationship is approximately linear - You have limited data (simpler models generalize better) - Speed matters (both training and inference)

Feature Engineering

Feature engineering often matters more than model selection. Key techniques:

Encoding Categorical Variables

# One-hot encoding (for nominal categories)
pd.get_dummies(df, columns=['color'])

# Label encoding (for ordinal categories)
df['size_encoded'] = df['size'].map({'S': 1, 'M': 2, 'L': 3, 'XL': 4})

# Target encoding (for high-cardinality categories)
# Mean of target for each category value

Handling Missing Data

Drop rows/columns — only if very few missing values
Mean/median imputation — simple but can distort distributions
Mode imputation — for categorical features
Indicator variable — add a binary "is_missing" feature
Model-based imputation — use other features to predict missing values

Feature Scaling

StandardScaler — zero mean, unit variance. Use for linear models, SVMs, KNN
MinMaxScaler — scales to [0, 1]. Use when you need bounded values
Tree-based models don't need scaling — they split on thresholds, not distances

Evaluation Metrics

Classification Metrics

Accuracy — percentage of correct predictions. Misleading for imbalanced classes.

Precision — of all positive predictions, how many were actually positive? - High precision matters when false positives are costly (spam filter)

Recall — of all actual positives, how many did we catch? - High recall matters when false negatives are costly (disease detection)

F1 Score — harmonic mean of precision and recall. Use when you need to balance both.

AUC-ROC — measures discrimination ability across all thresholds. Good for comparing models.

Interview question: You're building a fraud detection model. Which metric do you optimize?

Answer: Recall (or F1) — missing fraud (false negative) is more costly than flagging legitimate transactions (false positive). But you'd also monitor precision to avoid blocking too many good transactions.

Regression Metrics

MSE / RMSE — penalizes large errors heavily
MAE — more robust to outliers
R² — proportion of variance explained (0 to 1)
MAPE — percentage error, useful for business interpretation

Cross-Validation

Interview question: Why use cross-validation instead of a single train/test split?

A single split gives one estimate of model performance, which could be lucky or unlucky. K-fold cross-validation: 1. Splits data into K folds 2. Trains K models, each using a different fold as the test set 3. Averages the K performance scores

This gives a more reliable performance estimate with a confidence interval.

When to use what: - K-fold (K=5 or 10) — standard approach for most problems - Stratified K-fold — preserves class distribution in each fold (use for imbalanced data) - Time-series split — respects temporal ordering (never train on future data)

Regularization

Regularization prevents overfitting by penalizing model complexity:

L1 (Lasso) — drives some coefficients to exactly zero (feature selection)
L2 (Ridge) — shrinks all coefficients toward zero (prevents any single feature from dominating)
ElasticNet — combination of L1 and L2
Dropout — randomly drops neurons during training (neural networks)
Early stopping — stop training when validation loss starts increasing

Common Interview Scenarios

Imbalanced Classes

You have 99% negative and 1% positive examples. How do you handle this?

Don't use accuracy — a model predicting all negatives gets 99%
Resampling: oversample minority (SMOTE) or undersample majority
Class weights: penalize misclassification of minority class more heavily
Use appropriate metrics: precision, recall, F1, AUC
Threshold tuning: adjust decision threshold based on business needs

Feature Importance

How do you determine which features matter most?

Coefficient magnitude (linear models) — after scaling features
Tree-based importance — Gini importance or permutation importance
SHAP values — model-agnostic, theoretically grounded
Correlation analysis — simple but doesn't capture non-linear relationships
Recursive feature elimination — systematically remove least important features

Practice ML Problems

Explore our machine learning interview problems for hands-on practice with real questions from top companies.

Practice Makes Perfect

Ready to test your skills?

Practice real Machine Learning interview questions from top companies — with solutions.

Practice Machine Learning Problems Browse all problems

Get interview tips in your inbox

Join data scientists preparing smarter. No spam, unsubscribe anytime.

← Back to Blog