What role does the distance metric play in the K-Nearest Neighbors (KNN) algorithm?

Assigns classes
Defines decision boundaries
Determines clustering
Measures similarity between points

The distance metric in KNN is used to measure the similarity between points and determine the nearest neighbors.

Discuss it

How do pruning techniques affect a Decision Tree?

Decrease Accuracy
Increase Complexity
Increase Size
Reduce Overfitting

Pruning techniques remove branches from the tree to simplify the model and reduce overfitting.

Discuss it

Imagine you have a dataset where the relationship between the variables is cubic. What type of regression would be appropriate, and why?

Linear Regression
Logistic Regression
Polynomial Regression of degree 3
Ridge Regression

Since the relationship between the variables is cubic, a Polynomial Regression of degree 3 would be the best fit. It will model the cubic relationship effectively, whereas other types of regression would not capture the cubic nature of the relationship.

Discuss it

_________ is a metric that considers both the ability of the classifier to correctly identify positive cases and the ability to correctly identify negative cases.

AUC
F1-Score
Precision
nan

AUC (Area Under the Curve) considers both the ability of the classifier to identify positive cases (sensitivity) and the ability to identify negative cases (specificity) at various thresholds, providing a comprehensive view.

Discuss it

In a multiclass classification problem with imbalanced classes, how would you ensure that your model is not biased towards the majority class?

Implement resampling techniques and consider using balanced algorithms
Increase the number of features
Use only majority class for training
Use the same algorithm for all classes

Implementing resampling techniques to balance the classes and considering algorithms that handle class imbalance can ensure that the model doesn't become biased towards the majority class.

Discuss it

How would you approach the problem of data leakage during the preprocessing and modeling phase of a Machine Learning project?

Ignore the problem as it has no impact
Mix the test and training data for preprocessing
Split the data before any preprocessing and carefully handle information from the validation/test sets
Use the same preprocessing techniques on all data regardless of splitting

To prevent data leakage, it's crucial to split the data before any preprocessing, ensuring that information from the validation or test sets doesn't influence the training process. This helps maintain the integrity of the evaluation.

Discuss it

Your K-Means clustering algorithm is converging to a local minimum. What role might centroid initialization play in this, and how could you address it?

Increase the number of clusters
Initialize centroids based on labels
Poor initialization; Try multiple random initializations
Use a fixed number of centroids

Converging to a local minimum in K-Means is often due to poor initialization. Running the algorithm multiple times with different random initializations can help avoid local minima and lead to a more globally optimal solution.

Discuss it

Can you differentiate between Logistic Regression and K-Nearest Neighbors (KNN) in terms of use case and functionality?

LR is for classification, KNN for classification; LR uses probability, KNN uses distance
LR is for classification, KNN for regression; LR uses distance, KNN uses probability
LR is for classification, KNN for regression; LR uses probability, KNN uses distance
LR is for regression, KNN for classification; LR uses distance, KNN uses probability

Logistic Regression is used for classification and models the probability of a binary outcome. KNN is also used for classification but works by considering the 'K' nearest data points. The fundamental difference lies in the approach: LR uses a logistic function, while KNN uses distance metrics.

Discuss it

Explain how weighting the contributions of the neighbors can improve the KNN algorithm's performance.

Allows more influence from nearer neighbors
Improves sensitivity to outliers
Increases bias
Reduces complexity

Weighting the contributions of the neighbors allows nearer neighbors to have more influence on the prediction, often leading to improved performance in KNN.

Discuss it

You notice that your KNN model is highly sensitive to outliers. What might be causing this, and how could the choice of K and distance metric help in alleviating this issue?

Choose a larger K and an appropriate distance metric to mitigate sensitivity
Choose a small K and ignore outliers
Focus only on the majority class
Outliers have no effect

Choosing a larger K and an appropriate distance metric can help mitigate the sensitivity to outliers, as it would reduce the influence of individual data points.

Discuss it

You have built a Logistic Regression model, but the link test indicates that the Logit link function may not be appropriate. What could be the issue?

Incorrect loss function
Multicollinearity
Non-linearity between predictors and log-odds
Overfitting

If the Logit link function is not appropriate, it might indicate that there is a non-linear relationship between the predictors and the log-odds of the response, violating the assumptions of Logistic Regression.

Discuss it

What does the Mean Absolute Error (MAE) metric represent in regression analysis?

Average of absolute errors
Average of squared errors
Sum of absolute errors
Sum of squared errors

The Mean Absolute Error (MAE) represents the average of the absolute errors between the predicted values and the actual values. Unlike MSE, MAE does not square the errors, so it doesn't give extra weight to larger errors, making it more robust to outliers. It provides an understanding of how much the predictions deviate from the actual values on average.

Discuss it