You have a dataset with clusters of varying densities. How would you configure the Epsilon and MinPts in DBSCAN to handle this?
- Increase Epsilon; Decrease MinPts
- Increase both Epsilon and MinPts
- Reduce both Epsilon and MinPts
- Use a different clustering algorithm
DBSCAN's Epsilon and MinPts are global parameters that apply to all clusters. If clusters have varying densities, tuning these parameters to fit one density might not suit others, leading to misclustering. In such a scenario, a different clustering algorithm that can handle varying densities might be more appropriate.
What is the main difference between Ridge and Lasso regularization?
- Both use L1 penalty
- Both use L2 penalty
- Ridge uses L1 penalty, Lasso uses L2 penalty
- Ridge uses L2 penalty, Lasso uses L1 penalty
Ridge regularization uses an L2 penalty, which shrinks coefficients but keeps them non-zero, while Lasso uses an L1 penalty, leading to some coefficients being exactly zero.
In PCA, the Eigenvectors are also known as the ________ of the data.
- components
- directions
- eigendata
- principal directions
In PCA, the Eigenvectors, also known as the "principal directions," define the directions in which the data varies the most. They form the axes of the new feature space and capture the essential structure of the data.
What is the intercept in Simple Linear Regression, and how is it interpreted?
- Maximum Value of Y
- Minimum Value of X
- Start of the Line on X-axis
- Value of Y when X is Zero
The intercept in Simple Linear Regression is the value of the dependent variable (Y) when the independent variable (X) is zero. It represents where the regression line crosses the Y-axis.
Why might one prefer to use MAE over MSE in evaluating a regression model?
- MAE considers the direction of errors
- MAE gives more weight to larger errors
- MAE is less sensitive to outliers
- MAE is more computationally expensive
One might prefer to use Mean Absolute Error (MAE) over Mean Squared Error (MSE) because MAE is less sensitive to outliers. While MSE squares the differences and thus gives more weight to larger errors, MAE takes the absolute value of the differences, providing an equal weighting. This makes MAE more robust when there are outliers or when one doesn't want to overly penalize larger deviations from the true values.
What challenges might arise when using Hierarchical Clustering on very large datasets?
- Computationally intensive and requires high memory
- Less accurate and requires more hyperparameters
- Less sensitive to distance metrics and more prone to noise
- Prone to overfitting and less interpretable
Hierarchical Clustering can be computationally intensive and require a lot of memory, especially when dealing with very large datasets. The algorithm has to compute and store a distance matrix, which has a size of O(n^2), where n is the number of data points. This can lead to challenges in computational efficiency and memory usage, making it less suitable for large-scale applications.
Imagine a scenario where you want to assess the stability of a statistical estimator. How would Bootstrapping help in this context?
- By fixing the bias in the estimator
- By increasing the size of the dataset
- By repeating the sampling process with replacement and calculating the variance
- By repeating the sampling process without replacement
Bootstrapping assesses the stability of a statistical estimator by repeating the sampling process with replacement and calculating variance, standard error, or other statistics. By creating numerous "bootstrap samples," it allows insights into the estimator's distribution, thereby providing a measure of its stability and reliability.
Why might pruning be necessary in the construction of a Decision Tree?
- Determine Leaf Nodes
- Increase Complexity
- Increase Size
- Reduce Overfitting
Pruning is necessary to remove unnecessary branches, simplifying the model and reducing the risk of overfitting the training data.
What is Bootstrapping, and how does it differ from Cross-Validation?
- A method for resampling data with replacement
- A technique for training ensemble models
- A technique to reduce bias
- A type of Cross-Validation
Bootstrapping is a method for resampling data with replacement, used to estimate statistics about a population from a sample. It differs from Cross-Validation, where data is split without replacement to validate the model. Bootstrapping is more about estimating the properties of an estimator, while Cross-Validation assesses the model's performance.
What are the main challenges in training a Machine Learning model with imbalanced datasets?
- Computational complexity
- Dimensionality reduction
- Lack of suitable algorithms
- Overfitting to the majority class
Training on imbalanced datasets can lead to models that are biased towards the majority class, since they have seen more examples of it. This can make the model perform poorly on the minority class.