Machine Learning Interview Questions: A Practical Guide
How ML Interviews Work
Machine learning interviews for data science roles focus on intuition and practical understanding — not mathematical proofs. You need to explain concepts clearly, know the trade-offs between approaches, and demonstrate that you can make sound modeling decisions.
The Bias-Variance Tradeoff
This is the single most important concept in ML interviews.
Bias — error from overly simplistic assumptions. A linear model fitting a curved relationship has high bias (underfitting).
Variance — error from sensitivity to training data fluctuations. A deep decision tree that memorizes training data has high variance (overfitting).
The tradeoff: As model complexity increases, bias decreases but variance increases. The sweet spot minimizes total error.
Interview question: Your model has 95% training accuracy but 60% test accuracy. What's happening and how do you fix it?
Answer: High variance (overfitting). Solutions: - More training data - Regularization (L1/L2) - Reduce model complexity (fewer features, shallower trees) - Cross-validation for hyperparameter tuning - Ensemble methods (bagging reduces variance)
Model Selection
When to Use What
| Algorithm | Best For | Pros | Cons |
|---|---|---|---|
| Logistic Regression | Binary classification, interpretability | Fast, interpretable, works well with linear boundaries | Can't capture non-linear relationships |
| Decision Trees | Non-linear relationships, feature importance | Interpretable, handles mixed data types | Overfits easily, unstable |
| Random Forest | General purpose classification/regression | Robust, handles non-linearity, feature importance | Less interpretable, slower than single trees |
| Gradient Boosting (XGBoost) | Competitions, tabular data | Often highest accuracy, handles missing values | Can overfit, many hyperparameters |
| Linear Regression | Continuous target, interpretability | Simple, fast, well-understood | Assumes linear relationship |
| K-Nearest Neighbors | Small datasets, non-parametric problems | Simple, no training phase | Slow at prediction time, curse of dimensionality |
Interview question: When would you choose logistic regression over a random forest?
Choose logistic regression when: - You need interpretable coefficients (regulated industries) - The relationship is approximately linear - You have limited data (simpler models generalize better) - Speed matters (both training and inference)
Feature Engineering
Feature engineering often matters more than model selection. Key techniques:
Encoding Categorical Variables
# One-hot encoding (for nominal categories)
pd.get_dummies(df, columns=['color'])
# Label encoding (for ordinal categories)
df['size_encoded'] = df['size'].map({'S': 1, 'M': 2, 'L': 3, 'XL': 4})
# Target encoding (for high-cardinality categories)
# Mean of target for each category value
Handling Missing Data
- Drop rows/columns — only if very few missing values
- Mean/median imputation — simple but can distort distributions
- Mode imputation — for categorical features
- Indicator variable — add a binary "is_missing" feature
- Model-based imputation — use other features to predict missing values
Feature Scaling
- StandardScaler — zero mean, unit variance. Use for linear models, SVMs, KNN
- MinMaxScaler — scales to [0, 1]. Use when you need bounded values
- Tree-based models don't need scaling — they split on thresholds, not distances
Evaluation Metrics
Classification Metrics
Accuracy — percentage of correct predictions. Misleading for imbalanced classes.
Precision — of all positive predictions, how many were actually positive? - High precision matters when false positives are costly (spam filter)
Recall — of all actual positives, how many did we catch? - High recall matters when false negatives are costly (disease detection)
F1 Score — harmonic mean of precision and recall. Use when you need to balance both.
AUC-ROC — measures discrimination ability across all thresholds. Good for comparing models.
Interview question: You're building a fraud detection model. Which metric do you optimize?
Answer: Recall (or F1) — missing fraud (false negative) is more costly than flagging legitimate transactions (false positive). But you'd also monitor precision to avoid blocking too many good transactions.
Regression Metrics
- MSE / RMSE — penalizes large errors heavily
- MAE — more robust to outliers
- R² — proportion of variance explained (0 to 1)
- MAPE — percentage error, useful for business interpretation
Cross-Validation
Interview question: Why use cross-validation instead of a single train/test split?
A single split gives one estimate of model performance, which could be lucky or unlucky. K-fold cross-validation: 1. Splits data into K folds 2. Trains K models, each using a different fold as the test set 3. Averages the K performance scores
This gives a more reliable performance estimate with a confidence interval.
When to use what: - K-fold (K=5 or 10) — standard approach for most problems - Stratified K-fold — preserves class distribution in each fold (use for imbalanced data) - Time-series split — respects temporal ordering (never train on future data)
Regularization
Regularization prevents overfitting by penalizing model complexity:
- L1 (Lasso) — drives some coefficients to exactly zero (feature selection)
- L2 (Ridge) — shrinks all coefficients toward zero (prevents any single feature from dominating)
- ElasticNet — combination of L1 and L2
- Dropout — randomly drops neurons during training (neural networks)
- Early stopping — stop training when validation loss starts increasing
Common Interview Scenarios
Imbalanced Classes
You have 99% negative and 1% positive examples. How do you handle this?
- Don't use accuracy — a model predicting all negatives gets 99%
- Resampling: oversample minority (SMOTE) or undersample majority
- Class weights: penalize misclassification of minority class more heavily
- Use appropriate metrics: precision, recall, F1, AUC
- Threshold tuning: adjust decision threshold based on business needs
Feature Importance
How do you determine which features matter most?
- Coefficient magnitude (linear models) — after scaling features
- Tree-based importance — Gini importance or permutation importance
- SHAP values — model-agnostic, theoretically grounded
- Correlation analysis — simple but doesn't capture non-linear relationships
- Recursive feature elimination — systematically remove least important features
Practice ML Problems
Explore our machine learning interview problems for hands-on practice with real questions from top companies.
Ready to test your skills?
Practice real Machine Learning interview questions from top companies — with solutions.
Get interview tips in your inbox
Join data scientists preparing smarter. No spam, unsubscribe anytime.