What is the primary purpose of using ensemble methods in machine learning?
- To combine multiple weak models to form a strong model
- To focus on a single algorithm
- To reduce computational complexity
- To use only the best model
Ensemble methods combine the predictions from multiple weak models to form a more robust and accurate model. By leveraging the strength of multiple models, they typically achieve better generalization and performance than using a single model.
While performing Cross-Validation, you notice a significant discrepancy between training and validation performance in each fold. What might be the reason, and how would you address it?
- All of the above
- Data leakage; ensure proper separation between training and validation
- Overfitting; reduce model complexity
- Underfitting; increase model complexity
A significant discrepancy between training and validation performance could result from overfitting, underfitting, or data leakage. Addressing it requires identifying the underlying issue and taking appropriate action, such as reducing/increasing model complexity for overfitting/underfitting or ensuring proper separation between training and validation to prevent leakage.
Overfitting is a condition where a model learns the _________ of the training data, leading to poor generalization.
- features
- noise
- patterns
- variance
Overfitting occurs when a model learns the noise of the training data, which doesn't generalize well to unseen data.
The ___________ clustering algorithm groups together the data points that are densely packed, separating them from sparse areas.
- DBSCAN
- Gaussian Mixture Model
- Hierarchical
- K-Means
The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm groups together densely packed data points and separates them from sparse areas, classifying outliers as noise.
What does the term "multicollinearity" mean in the context of regression?
- High correlation between predictor variables
- Multiple regression models
- Multiple target variables
- Multiplying the coefficients
Multicollinearity refers to a situation where predictor variables in a regression model are highly correlated with each other, which can make it challenging to interpret the individual effects of predictors.
You are tasked with reducing the dimensionality of a dataset with multiple classes, and the within-class variance is very high. How would LDA help in this scenario?
- LDA would be ineffective due to high within-class variance
- LDA would increase the dimensionality
- LDA would only focus on between-class variance
- LDA would reduce dimensionality while preserving class separation
Despite high within-class variance, LDA would "reduce dimensionality while preserving class separation" by projecting data into a space that maximizes between-class variance.
Explain the importance of feature selection and engineering in building a Machine Learning model.
- Enhances clustering; Reduces training time
- Enhances prediction; Increases complexity
- Improves model performance; Reduces complexity
- Improves training speed; Affects accuracy negatively
Feature selection and engineering are vital for improving model performance and reducing complexity. They help in choosing the most relevant features and transforming them for optimal model learning, thus potentially increasing accuracy and efficiency.
What is the Odds Ratio in the context of Logistic Regression?
- A clustering metric
- A data preprocessing technique
- A measurement of how changes in one variable affect the odds of a particular outcome
- A type of loss function
The Odds Ratio is a measure that quantifies how a change in one variable affects the odds of a particular outcome. In Logistic Regression, it is often used to interpret the coefficients of the predictors.
How does the ROC Curve illustrate the performance of a binary classification model?
- Plots accuracy vs. error rate, shows overall performance
- Plots precision vs. recall, shows trade-off between sensitivity and specificity
- Plots true positive rate vs. false positive rate, shows trade-off between sensitivity and specificity
- nan
The ROC Curve plots the true positive rate against the false positive rate for different threshold values. This illustrates the trade-off between sensitivity (true positive rate) and specificity (true negative rate), helping to choose the threshold that best balances these two aspects.
In what scenarios would you prefer LDA over PCA?
- When class labels are irrelevant
- When class separation is the priority
- When data is nonlinear
- When maximizing total variance is the priority
You would prefer LDA over PCA "when class separation is the priority." While PCA focuses on capturing the maximum variance, LDA aims to find the directions that maximize the separation between different classes.
What is the primary goal of Linear Discriminant Analysis (LDA) in machine learning?
- Clustering data
- Maximizing between-class variance and minimizing within-class variance
- Maximizing within-class variance
- Minimizing between-class variance
LDA aims to "maximize between-class variance and minimize within-class variance," allowing for optimal separation between different classes in the dataset. This results in better class discrimination and improved classification performance.
You have fitted a Simple Linear Regression model and discovered heteroscedasticity in the residuals. What impact could this have, and how might you correct it?
- Always Leads to Overfitting, No Correction Possible
- Biased Estimates, Increase Sample Size
- Inefficiency in Estimates, Transform the Dependent Variable
- No Impact, No Correction Required
Heteroscedasticity could lead to inefficiency in the estimates, making them less reliable. Transforming the dependent variable or using weighted least squares can help correct this issue.