Ensemble Learning Methods

40+ essential concepts in ensemble learning methods

What You'll Learn

Master ensemble learning with 40+ flashcards covering bagging, boosting, stacking, random forests, gradient boosting, AdaBoost, XGBoost, and more.

Key Topics

  • Bagging and Bootstrap Aggregating for variance reduction
  • Boosting: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost
  • Stacking and meta-learner design
  • Random Forest: feature subsampling and OOB error
  • Bias-variance tradeoff in ensemble methods
  • Hyperparameter tuning: learning rate, subsampling, regularization
  • Feature importance: MDI and permutation importance
  • Diversity and the Condorcet Jury Theorem

Looking for more machine learning resources? Visit the Explore page to browse related decks or use the Create Your Own Deck flow to customize this set.

How to study this deck

Start with a quick skim of the questions, then launch study mode to flip cards until you can answer each prompt without hesitation. Revisit tricky cards using shuffle or reverse order, and schedule a follow-up review within 48 hours to reinforce retention.

Preview: Ensemble Learning Methods

Question

What is ensemble learning?

Answer

A machine learning paradigm that combines multiple models (weak learners) to produce a stronger, more accurate model than any individual model alone.

Question

What is a weak learner?

Answer

A model that performs only slightly better than random guessing (e.g., a decision stump). Ensemble methods combine many weak learners to form a strong learner.

Question

What is bagging (Bootstrap Aggregating)?

Answer

An ensemble technique that trains multiple models on different random subsets of the training data (sampled with replacement) and averages their predictions to reduce variance.

Question

What is boosting?

Answer

An ensemble technique that trains models sequentially, where each new model focuses on correcting the errors of the previous ones, reducing bias and variance.

Question

What is stacking (stacked generalization)?

Answer

An ensemble method that trains a meta-learner (blender) on the predictions of multiple base models. The meta-learner learns how to best combine the base model outputs.

Question

What is a Random Forest?

Answer

An ensemble of decision trees trained with bagging + random feature subsets at each split. Reduces variance compared to a single decision tree and is robust to overfitting.

Question

How does Random Forest select features at each split?

Answer

At each node split, a random subset of features (typically √p for classification or p/3 for regression, where p = total features) is considered, adding diversity among trees.

Question

What is out-of-bag (OOB) error?

Answer

An internal validation estimate in bagging/Random Forest. Each tree is evaluated on the ~37% of training samples not included in its bootstrap sample, providing a free cross-validation estimate.

Question

What is AdaBoost (Adaptive Boosting)?

Answer

A boosting algorithm that assigns higher weights to misclassified samples after each iteration. Subsequent weak learners focus more on hard examples. Final prediction is a weighted vote.

Question

How does AdaBoost update sample weights?

Answer

After each weak learner is trained, misclassified samples have their weights increased and correctly classified samples have their weights decreased, so the next learner focuses on mistakes.

Question

What is Gradient Boosting?

Answer

A boosting method that fits new models to the residual errors (negative gradients of the loss function) of the ensemble at each step, iteratively reducing the loss.

Question

What is the key difference between AdaBoost and Gradient Boosting?

Answer

AdaBoost re-weights samples; Gradient Boosting fits new models directly to the residuals (pseudo-residuals / negative gradients). Gradient Boosting is more general and supports arbitrary loss functions.

Question

What is XGBoost?

Answer

An optimized, regularized gradient boosting library. It adds L1/L2 regularization, handles missing values natively, supports parallel tree construction, and uses second-order gradient information.

Question

What regularization does XGBoost add over standard Gradient Boosting?

Answer

XGBoost adds L1 (alpha) and L2 (lambda) regularization terms on leaf weights in the objective function, plus a complexity penalty on the number of leaves (gamma), reducing overfitting.

Question

What is LightGBM?

Answer

A gradient boosting framework by Microsoft that uses histogram-based splitting and leaf-wise (best-first) tree growth instead of level-wise, making it faster and more memory-efficient than XGBoost on large datasets.

Question

What is CatBoost?

Answer

A gradient boosting library by Yandex that handles categorical features natively using ordered target statistics, reducing target leakage and often requiring less preprocessing.

Question

What is voting (majority voting) in ensemble learning?

Answer

An ensemble technique that combines predictions by majority vote (classification) or averaging (regression). Hard voting uses predicted classes; soft voting averages predicted probabilities.

Question

What is the difference between hard voting and soft voting?

Answer

Hard voting counts the most frequent predicted class. Soft voting averages class probabilities from each model and picks the class with the highest average probability — often more accurate when models are well-calibrated.

Question

How does bagging reduce variance?

Answer

By averaging predictions of models trained on different bootstrap samples, random errors cancel out. The variance of the average of n independent models is σ²/n, so more trees → lower variance.

Question

Why does boosting reduce bias?

Answer

Each new model in boosting corrects the systematic errors (bias) of the current ensemble. By iteratively focusing on hard examples, the ensemble can learn complex decision boundaries.

Question

What is the bias-variance tradeoff in ensemble methods?

Answer

Bagging primarily reduces variance with little change to bias. Boosting primarily reduces bias but can increase variance (risk of overfitting). Stacking can reduce both.

Question

What is the learning rate (shrinkage) in boosting?

Answer

A hyperparameter (η) that scales the contribution of each new tree, slowing learning to improve generalization. Lower learning rate requires more trees but typically yields better performance.

Question

What is the shrinkage-trees tradeoff in gradient boosting?

Answer

Lower learning rate (shrinkage) + more trees often outperforms high learning rate + few trees, as shrinkage acts as a regularizer. Requires early stopping to avoid overfitting.

Question

What is early stopping in boosting?

Answer

Training is halted when validation set performance stops improving, preventing overfitting. Requires a held-out validation set or cross-validation to monitor performance.

Question

What is feature importance in Random Forest?

Answer

Measured by mean decrease in impurity (MDI) or mean decrease in accuracy (MDA/permutation importance). MDI can be biased toward high-cardinality features; permutation importance is more reliable.

Question

What is permutation importance?

Answer

A model-agnostic feature importance method that measures the drop in model performance when a feature's values are randomly shuffled, breaking its relationship with the target.

Question

What is a decision stump?

Answer

A decision tree with depth 1 (a single split). Commonly used as the weak learner in AdaBoost due to its simplicity and fast training.

Question

What is the base estimator in a boosting algorithm?

Answer

The type of weak learner used in each boosting round. Shallow decision trees (stumps or depth 3-5) are most common because they are fast, have low variance, and are easy to combine.

Question

What is subsampling (stochastic gradient boosting)?

Answer

Using a random fraction of training samples (without replacement) to fit each tree in gradient boosting. Adds randomness like bagging, reducing overfitting and often improving generalization.

Question

What is column subsampling in XGBoost/LightGBM?

Answer

Randomly selecting a fraction of features for each tree or each split (similar to Random Forest). Reduces correlation among trees and speeds up training.

Question

What is the objective function in XGBoost?

Answer

Loss(predictions, labels) + Ω(tree complexity), where Ω includes L2 regularization on leaf weights (λ) and a penalty per leaf (γ). Uses Taylor expansion to approximate the loss for efficient optimization.

Question

What is GBDT (Gradient Boosted Decision Trees)?

Answer

The general class of gradient boosting methods that use decision trees as base learners. Includes implementations like XGBoost, LightGBM, CatBoost, and scikit-learn's GradientBoostingClassifier.

Question

What is the difference between level-wise and leaf-wise tree growth?

Answer

Level-wise (XGBoost default): grows trees level by level, balanced. Leaf-wise (LightGBM default): always splits the leaf with the highest loss reduction, creating deeper, asymmetric trees — faster but more prone to overfitting on small data.

Question

What is DART (Dropouts meet Multiple Additive Regression Trees)?

Answer

A boosting variant that randomly drops trees from the ensemble during training (like dropout in neural networks) to prevent over-specialization of later trees and reduce overfitting.

Question

What is a blender / meta-learner in stacking?

Answer

The second-level model in stacking that takes the out-of-fold predictions of base models as input features and learns the optimal way to combine them.

Question

Why use out-of-fold predictions in stacking?

Answer

To prevent the meta-learner from overfitting to training data. Base models are trained on k-1 folds and predict on the held-out fold, so the meta-learner trains on predictions the base models never saw during fitting.

Question

What is diversity in ensemble learning and why does it matter?

Answer

Diversity refers to the degree to which ensemble members make different errors. Higher diversity leads to better error cancellation and improved ensemble performance. Achieved via different algorithms, features, hyperparameters, or data subsets.

Question

What is the Condorcet Jury Theorem as it relates to ensembles?

Answer

If each classifier has >50% accuracy and errors are independent, combining them by majority vote increases accuracy toward 100% as the ensemble size grows. Underscores the value of diversity.

Question

What is the main risk of boosting compared to bagging?

Answer

Boosting is more prone to overfitting, especially with noisy data, because it increases the weight of misclassified (possibly noisy) samples. Bagging is more robust to noise.

Question

When should you prefer Random Forest over Gradient Boosting?

Answer

Random Forest is preferred when: training speed is critical, data is noisy, tuning time is limited, or interpretability matters. Gradient Boosting typically achieves higher accuracy but requires more hyperparameter tuning.