What are the key differences between Hierarchical Clustering and K-Means Clustering?
- Algorithm Complexity
- Cluster Number & Structure
- Data Type
- Learning Type
Hierarchical Clustering builds a tree-like structure and does not require a predefined number of clusters, whereas K-Means requires the number of clusters in advance and builds non-hierarchical clusters.
You are using Bootstrapping to estimate the confidence interval for a model parameter. Explain how the process works.
- By calculating the mean and standard deviation without resampling
- By randomly selecting without replacement from the dataset
- By resampling with replacement and calculating empirical quantiles of the distribution
- By splitting the data into training and validation sets
Bootstrapping to estimate the confidence interval for a model parameter involves resampling with replacement from the original data, calculating the parameter for each resampled dataset, and then determining empirical quantiles of the parameter's distribution. It allows the estimation of confidence intervals even when the underlying distribution is unknown.
A business stakeholder wants to use a very high-degree Polynomial Regression for forecasting, arguing that it fits the historical data perfectly. How would you explain the risks of this approach and suggest a more robust method?
- Encourage the high-degree approach
- Explain the risk of overfitting and suggest using cross-validation or regularization
- Focus only on training data
- Ignore the stakeholder's suggestion
The high-degree approach is prone to overfitting and may not generalize well to future data. Explaining this risk and suggesting more robust methods such as cross-validation or regularization can help in building a more reliable forecasting model.
When interpreting a dendrogram in Hierarchical Clustering, the height of the _________ represents the distance at which clusters are merged.
- Branches
- Leaves
- Lines
- Nodes
In a dendrogram, the height of the branches represents the distance at which clusters are merged. The higher the branch, the greater the distance, indicating that the clusters being merged are less similar. This information can guide the selection of the number of clusters and provides insights into the underlying structure of the data.
What type of problems is Logistic Regression mainly used to solve?
- Binary classification problems
- Clustering problems
- Regression problems
- Unsupervised learning problems
Logistic Regression is mainly used to solve binary classification problems, where the goal is to classify instances into one of two classes.
In classification, the ________ metric is often used to evaluate the balance between precision and recall.
- Accuracy
- F1 Score
- Mean Squared Error
- R-squared
The F1 Score is the harmonic mean of precision and recall, providing a single metric that balances the trade-off between these two important metrics.
What are the main types of Machine Learning?
- Reinforcement, Unsupervised
- Supervised, Semi-supervised
- Supervised, Unsupervised
- Supervised, Unsupervised, Reinforcement
The main types of Machine Learning are Supervised Learning (learning with labeled data), Unsupervised Learning (learning without labeled data), and Reinforcement Learning (learning by interacting with an environment). These types facilitate different learning processes and are applied in various domains.
You've applied PCA but the variance explained by the first few components is very low. What could be the underlying issue and how might you remedy it?
- The data has no variance, so PCA is not applicable
- The data is not centered, so you should center it before applying PCA
- The data is too complex for PCA, so you should switch algorithms
- The eigenvalues have been miscalculated and you should recalculate them
If the variance explained by the first few components is very low, it may be because the data is not centered. Centering the data by subtracting the mean is a necessary preprocessing step for PCA.
In the context of model evaluation, Bootstrapping can be used to assess the _________ of a statistical estimator or a machine learning model.
- bias
- robustness
- stability
- variance
In the context of model evaluation, Bootstrapping can be used to assess the stability of a statistical estimator or a machine learning model. By repeatedly resampling with replacement and observing the changes in estimates, one can gain insights into the stability and reliability of the model or estimator.
Imagine you're using DBSCAN for spatial data clustering, but the clusters are not forming as expected. What steps would you take to analyze and fix the situation?
- All of the above
- Analyze feature scaling; Adjust Epsilon and MinPts
- Apply a linear transformation to the data
- Increase the dimensionality of the data
Clustering spatial data requires a careful analysis of the scale of the features, as well as appropriate tuning of Epsilon and MinPts. Feature scaling ensures that distances are comparable across dimensions. Adjusting Epsilon and MinPts tailors the algorithm to the specific density and size characteristics of the clusters in the spatial data.
To prevent overfitting in Polynomial Regression, you might use techniques like _______, ________, or ________ regularization.
- Lasso, Accuracy, Elastic Net
- Lasso, Ridge, Elastic Net
- Lasso, Ridge, Stability
- Ridge, Stability, Elastic Net
Lasso, Ridge, and Elastic Net regularization techniques can be used to prevent overfitting in Polynomial Regression by adding constraints to the coefficients.
Which regularization technique adds L1 penalty, causing some coefficients to be exactly zero?
- Elastic Net
- Lasso
- Ridge
- nan
Lasso regularization adds an L1 penalty, causing some of the coefficients to become exactly zero, effectively removing those features from the model.