What is the Elbow Method in the context of K-Means clustering?
- A centroid initialization technique
- A clustering visualization tool
- A method to determine the number of clusters
- A way to calculate distance between points
The Elbow Method in K-Means clustering is used to find the optimal number of clusters by plotting the variance as a function of the number of clusters and finding the "elbow" point.
What are the advantages and limitations of using Ridge regression over ordinary linear regression?
- Increases bias, Reduces variance, Reduces multicollinearity, Can cause overfitting
- Increases bias, Reduces variance, Tackles multicollinearity, Can cause underfitting
- Reduces overfitting, Increases variance, Lower bias, Lower variance
- Reduces overfitting, Tackles multicollinearity, Lower bias, Lower variance
Ridge regression helps in reducing overfitting by penalizing large coefficients through L2 regularization. It can handle multicollinearity but increases bias, potentially leading to underfitting. Ordinary linear regression lacks these regularization properties.
In a situation where the features in your dataset are at very different scales, which regularization technique would you choose and why?
- L1 Regularization because of complexity
- L1 Regularization because of sparsity
- L2 Regularization because of scalability
- L2 Regularization because of sensitivity to noise
L2 Regularization (Ridge) would be chosen when features are at different scales because it scales the coefficients without completely eliminating them, preserving information. It can prevent overfitting while considering all features.
Explain the mathematical difference between MSE and RMSE and their interpretation.
- MSE is the square of RMSE; RMSE is less interpretable
- MSE is the square root of RMSE; RMSE emphasizes larger errors more
- RMSE is the square of MSE; MSE provides values in the original unit
- RMSE is the square root of MSE; MSE is in squared units
The Mean Squared Error (MSE) measures the average of the squared differences between the predicted values and the actual values, resulting in squared units. The Root Mean Squared Error (RMSE) is the square root of MSE, thus providing a value in the same unit as the original data. RMSE is often considered more interpretable for this reason.
If a point in DBSCAN has fewer than MinPts within its Epsilon neighborhood, it's considered a _________ point.
- border point
- cluster
- core point
- noise point
If a point in DBSCAN has fewer than MinPts within its Epsilon neighborhood, it's considered a noise point. Noise points are those that are not part of any cluster and are isolated or in low-density regions.
How does the bagging technique reduce the variance in a model?
- By averaging the predictions of multiple models trained on different subsets of data
- By focusing on the mean prediction
- By increasing complexity
- By reducing the number of features
Bagging reduces variance by averaging the predictions of multiple models, each trained on a different random subset of the data (with replacement). This averaging process smooths out individual variations, leading to a more stable and robust model.
Why is clustering considered an unsupervised learning method?
- Because it groups data without the need for labeled responses
- Because it predicts continuous outcomes
- Because it requires labeled data
- Because it uses decision trees
Clustering is considered unsupervised because it finds patterns and groups data without using labeled responses or guidance.
Imagine you're working on a binary classification problem, and the model is performing well in terms of accuracy but poorly in terms of recall. What might be the issue and how would you address it?
- Issue with data imbalance; Use resampling techniques
- Issue with precision; Improve accuracy
- Threshold is too high; Lower the threshold
- Threshold is too low; Increase the threshold
The issue might be that the threshold for classification is set too high, causing true positives to be misclassified as false negatives, reducing recall. Lowering the threshold may help in improving recall without sacrificing too much precision.
You have a highly imbalanced dataset with rare positive cases. Which performance metric would be the most informative, and why?
- AUC, as it provides a comprehensive evaluation of the model
- Accuracy, as it gives overall performance
- F1-Score, as it balances Precision and Recall
- Precision, as it focuses on false positives
In a highly imbalanced dataset, F1-Score is often most informative as it balances Precision and Recall. Accuracy might be misleading, and while AUC and Precision are useful, F1-Score provides a better overall sense of how well the model handles both classes.
You are dealing with a dataset having many irrelevant features. How would you apply Lasso regression to deal with this scenario?
- By increasing the degree of the polynomial
- By using L1 regularization
- By using L2 regularization
- By using both L1 and L2 regularization
Lasso regression applies L1 regularization, which can shrink the coefficients of irrelevant features to exactly zero. This effectively performs feature selection, removing the irrelevant features from the model and simplifying it.
How does dimensionality reduction help in reducing the risk of overfitting?
- All of the above
- By reducing noise
- By removing irrelevant features
- By simplifying the model
Dimensionality reduction helps in reducing the risk of overfitting by removing irrelevant features (reducing complexity), reducing noise (avoiding fitting to noise), and simplifying the model (making it more generalized).
What is the effect of increasing the regularization parameter in Ridge and Lasso regression?
- Decrease in bias and increase in variance
- Increase in bias and decrease in variance
- Increase in both bias and variance
- No change in bias and variance
Increasing the regularization parameter leads to greater regularization strength, resulting in an increase in bias and a decrease in variance, thus constraining the model complexity.