Choosing too small a value for K in KNN can lead to a __________ model, while choosing too large a value can lead to a __________ model.

  • fast, slow
  • noisy, smooth
  • slow, fast
  • smooth, noisy
A small K leads to a noisy model as it is sensitive to noise, whereas a large K results in a smooth model due to the averaging effect over more neighbors.

How would you approach the problem of data leakage during the preprocessing and modeling phase of a Machine Learning project?

  • Ignore the problem as it has no impact
  • Mix the test and training data for preprocessing
  • Split the data before any preprocessing and carefully handle information from the validation/test sets
  • Use the same preprocessing techniques on all data regardless of splitting
To prevent data leakage, it's crucial to split the data before any preprocessing, ensuring that information from the validation or test sets doesn't influence the training process. This helps maintain the integrity of the evaluation.

In a multiclass classification problem with imbalanced classes, how would you ensure that your model is not biased towards the majority class?

  • Implement resampling techniques and consider using balanced algorithms
  • Increase the number of features
  • Use only majority class for training
  • Use the same algorithm for all classes
Implementing resampling techniques to balance the classes and considering algorithms that handle class imbalance can ensure that the model doesn't become biased towards the majority class.

_________ is a metric that considers both the ability of the classifier to correctly identify positive cases and the ability to correctly identify negative cases.

  • AUC
  • F1-Score
  • Precision
  • nan
AUC (Area Under the Curve) considers both the ability of the classifier to identify positive cases (sensitivity) and the ability to identify negative cases (specificity) at various thresholds, providing a comprehensive view.

Imagine you have a dataset where the relationship between the variables is cubic. What type of regression would be appropriate, and why?

  • Linear Regression
  • Logistic Regression
  • Polynomial Regression of degree 3
  • Ridge Regression
Since the relationship between the variables is cubic, a Polynomial Regression of degree 3 would be the best fit. It will model the cubic relationship effectively, whereas other types of regression would not capture the cubic nature of the relationship.

How do pruning techniques affect a Decision Tree?

  • Decrease Accuracy
  • Increase Complexity
  • Increase Size
  • Reduce Overfitting
Pruning techniques remove branches from the tree to simplify the model and reduce overfitting.

What role does the distance metric play in the K-Nearest Neighbors (KNN) algorithm?

  • Assigns classes
  • Defines decision boundaries
  • Determines clustering
  • Measures similarity between points
The distance metric in KNN is used to measure the similarity between points and determine the nearest neighbors.

In a case where both overfitting and underfitting are concerns depending on the chosen algorithm, how would you systematically approach model selection and tuning?

  • Increase model complexity
  • Reduce model complexity
  • Use L1 regularization
  • Use grid search with cross-validation
Systematic approach involves the use of techniques like grid search with cross-validation to explore different hyperparameters and model complexities. This ensures that the selected model neither overfits nor underfits the data and generalizes well to unseen data.

Can you differentiate between Logistic Regression and K-Nearest Neighbors (KNN) in terms of use case and functionality?

  • LR is for classification, KNN for classification; LR uses probability, KNN uses distance
  • LR is for classification, KNN for regression; LR uses distance, KNN uses probability
  • LR is for classification, KNN for regression; LR uses probability, KNN uses distance
  • LR is for regression, KNN for classification; LR uses distance, KNN uses probability
Logistic Regression is used for classification and models the probability of a binary outcome. KNN is also used for classification but works by considering the 'K' nearest data points. The fundamental difference lies in the approach: LR uses a logistic function, while KNN uses distance metrics.

Your K-Means clustering algorithm is converging to a local minimum. What role might centroid initialization play in this, and how could you address it?

  • Increase the number of clusters
  • Initialize centroids based on labels
  • Poor initialization; Try multiple random initializations
  • Use a fixed number of centroids
Converging to a local minimum in K-Means is often due to poor initialization. Running the algorithm multiple times with different random initializations can help avoid local minima and lead to a more globally optimal solution.