Ensemble Learning Methods
40+ essential concepts in ensemble learning methods
What You'll Learn
Master ensemble learning with 40+ flashcards covering bagging, boosting, stacking, random forests, gradient boosting, AdaBoost, XGBoost, and more.
Key Topics
- Bagging and Bootstrap Aggregating for variance reduction
- Boosting: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost
- Stacking and meta-learner design
- Random Forest: feature subsampling and OOB error
- Bias-variance tradeoff in ensemble methods
- Hyperparameter tuning: learning rate, subsampling, regularization
- Feature importance: MDI and permutation importance
- Diversity and the Condorcet Jury Theorem
Looking for more machine learning resources? Visit the Explore page to browse related decks or use the Create Your Own Deck flow to customize this set.
How to study this deck
Start with a quick skim of the questions, then launch study mode to flip cards until you can answer each prompt without hesitation. Revisit tricky cards using shuffle or reverse order, and schedule a follow-up review within 48 hours to reinforce retention.
Preview: Ensemble Learning Methods
Question
What is ensemble learning?
Answer
A machine learning paradigm that combines multiple models (weak learners) to produce a stronger, more accurate model than any individual model alone.
Question
What is a weak learner?
Answer
A model that performs only slightly better than random guessing (e.g., a decision stump). Ensemble methods combine many weak learners to form a strong learner.
Question
What is bagging (Bootstrap Aggregating)?
Answer
An ensemble technique that trains multiple models on different random subsets of the training data (sampled with replacement) and averages their predictions to reduce variance.
Question
What is boosting?
Answer
An ensemble technique that trains models sequentially, where each new model focuses on correcting the errors of the previous ones, reducing bias and variance.
Question
What is stacking (stacked generalization)?
Answer
An ensemble method that trains a meta-learner (blender) on the predictions of multiple base models. The meta-learner learns how to best combine the base model outputs.
Question
What is a Random Forest?
Answer
An ensemble of decision trees trained with bagging + random feature subsets at each split. Reduces variance compared to a single decision tree and is robust to overfitting.
Question
How does Random Forest select features at each split?
Answer
At each node split, a random subset of features (typically √p for classification or p/3 for regression, where p = total features) is considered, adding diversity among trees.
Question
What is out-of-bag (OOB) error?
Answer
An internal validation estimate in bagging/Random Forest. Each tree is evaluated on the ~37% of training samples not included in its bootstrap sample, providing a free cross-validation estimate.
Question
What is AdaBoost (Adaptive Boosting)?
Answer
A boosting algorithm that assigns higher weights to misclassified samples after each iteration. Subsequent weak learners focus more on hard examples. Final prediction is a weighted vote.
Question
How does AdaBoost update sample weights?
Answer
After each weak learner is trained, misclassified samples have their weights increased and correctly classified samples have their weights decreased, so the next learner focuses on mistakes.
Question
What is Gradient Boosting?
Answer
A boosting method that fits new models to the residual errors (negative gradients of the loss function) of the ensemble at each step, iteratively reducing the loss.
Question
What is the key difference between AdaBoost and Gradient Boosting?
Answer
AdaBoost re-weights samples; Gradient Boosting fits new models directly to the residuals (pseudo-residuals / negative gradients). Gradient Boosting is more general and supports arbitrary loss functions.
Question
What is XGBoost?
Answer
An optimized, regularized gradient boosting library. It adds L1/L2 regularization, handles missing values natively, supports parallel tree construction, and uses second-order gradient information.
Question
What regularization does XGBoost add over standard Gradient Boosting?
Answer
XGBoost adds L1 (alpha) and L2 (lambda) regularization terms on leaf weights in the objective function, plus a complexity penalty on the number of leaves (gamma), reducing overfitting.
Question
What is LightGBM?
Answer
A gradient boosting framework by Microsoft that uses histogram-based splitting and leaf-wise (best-first) tree growth instead of level-wise, making it faster and more memory-efficient than XGBoost on large datasets.
Question
What is CatBoost?
Answer
A gradient boosting library by Yandex that handles categorical features natively using ordered target statistics, reducing target leakage and often requiring less preprocessing.
Question
What is voting (majority voting) in ensemble learning?
Answer
An ensemble technique that combines predictions by majority vote (classification) or averaging (regression). Hard voting uses predicted classes; soft voting averages predicted probabilities.
Question
What is the difference between hard voting and soft voting?
Answer
Hard voting counts the most frequent predicted class. Soft voting averages class probabilities from each model and picks the class with the highest average probability — often more accurate when models are well-calibrated.
Question
How does bagging reduce variance?
Answer
By averaging predictions of models trained on different bootstrap samples, random errors cancel out. The variance of the average of n independent models is σ²/n, so more trees → lower variance.
Question
Why does boosting reduce bias?
Answer
Each new model in boosting corrects the systematic errors (bias) of the current ensemble. By iteratively focusing on hard examples, the ensemble can learn complex decision boundaries.
Question
What is the bias-variance tradeoff in ensemble methods?
Answer
Bagging primarily reduces variance with little change to bias. Boosting primarily reduces bias but can increase variance (risk of overfitting). Stacking can reduce both.
Question
What is the learning rate (shrinkage) in boosting?
Answer
A hyperparameter (η) that scales the contribution of each new tree, slowing learning to improve generalization. Lower learning rate requires more trees but typically yields better performance.
Question
What is the shrinkage-trees tradeoff in gradient boosting?
Answer
Lower learning rate (shrinkage) + more trees often outperforms high learning rate + few trees, as shrinkage acts as a regularizer. Requires early stopping to avoid overfitting.
Question
What is early stopping in boosting?
Answer
Training is halted when validation set performance stops improving, preventing overfitting. Requires a held-out validation set or cross-validation to monitor performance.
Question
What is feature importance in Random Forest?
Answer
Measured by mean decrease in impurity (MDI) or mean decrease in accuracy (MDA/permutation importance). MDI can be biased toward high-cardinality features; permutation importance is more reliable.
Question
What is permutation importance?
Answer
A model-agnostic feature importance method that measures the drop in model performance when a feature's values are randomly shuffled, breaking its relationship with the target.
Question
What is a decision stump?
Answer
A decision tree with depth 1 (a single split). Commonly used as the weak learner in AdaBoost due to its simplicity and fast training.
Question
What is the base estimator in a boosting algorithm?
Answer
The type of weak learner used in each boosting round. Shallow decision trees (stumps or depth 3-5) are most common because they are fast, have low variance, and are easy to combine.
Question
What is subsampling (stochastic gradient boosting)?
Answer
Using a random fraction of training samples (without replacement) to fit each tree in gradient boosting. Adds randomness like bagging, reducing overfitting and often improving generalization.
Question
What is column subsampling in XGBoost/LightGBM?
Answer
Randomly selecting a fraction of features for each tree or each split (similar to Random Forest). Reduces correlation among trees and speeds up training.
Question
What is the objective function in XGBoost?
Answer
Loss(predictions, labels) + Ω(tree complexity), where Ω includes L2 regularization on leaf weights (λ) and a penalty per leaf (γ). Uses Taylor expansion to approximate the loss for efficient optimization.
Question
What is GBDT (Gradient Boosted Decision Trees)?
Answer
The general class of gradient boosting methods that use decision trees as base learners. Includes implementations like XGBoost, LightGBM, CatBoost, and scikit-learn's GradientBoostingClassifier.
Question
What is the difference between level-wise and leaf-wise tree growth?
Answer
Level-wise (XGBoost default): grows trees level by level, balanced. Leaf-wise (LightGBM default): always splits the leaf with the highest loss reduction, creating deeper, asymmetric trees — faster but more prone to overfitting on small data.
Question
What is DART (Dropouts meet Multiple Additive Regression Trees)?
Answer
A boosting variant that randomly drops trees from the ensemble during training (like dropout in neural networks) to prevent over-specialization of later trees and reduce overfitting.
Question
What is a blender / meta-learner in stacking?
Answer
The second-level model in stacking that takes the out-of-fold predictions of base models as input features and learns the optimal way to combine them.
Question
Why use out-of-fold predictions in stacking?
Answer
To prevent the meta-learner from overfitting to training data. Base models are trained on k-1 folds and predict on the held-out fold, so the meta-learner trains on predictions the base models never saw during fitting.
Question
What is diversity in ensemble learning and why does it matter?
Answer
Diversity refers to the degree to which ensemble members make different errors. Higher diversity leads to better error cancellation and improved ensemble performance. Achieved via different algorithms, features, hyperparameters, or data subsets.
Question
What is the Condorcet Jury Theorem as it relates to ensembles?
Answer
If each classifier has >50% accuracy and errors are independent, combining them by majority vote increases accuracy toward 100% as the ensemble size grows. Underscores the value of diversity.
Question
What is the main risk of boosting compared to bagging?
Answer
Boosting is more prone to overfitting, especially with noisy data, because it increases the weight of misclassified (possibly noisy) samples. Bagging is more robust to noise.
Question
When should you prefer Random Forest over Gradient Boosting?
Answer
Random Forest is preferred when: training speed is critical, data is noisy, tuning time is limited, or interpretability matters. Gradient Boosting typically achieves higher accuracy but requires more hyperparameter tuning.