You have a dataset with a high degree of multicollinearity. What steps would you take to address this before building a Multiple Linear Regression model?

  • Apply feature selection or dimensionality reduction techniques
  • Ignore it
  • Increase the size of the dataset
  • Remove all correlated variables
Multicollinearity can be addressed by applying feature selection techniques like LASSO or using dimensionality reduction methods like Principal Component Analysis (PCA). These techniques help in removing or combining correlated variables, reducing multicollinearity and improving the model's stability.

Dimensionality reduction is often used to overcome the ___________ problem, where having too many features relative to the number of observations can lead to overfitting.

  • curse of dimensionality
  • multicollinearity
  • overfitting
  • scaling
The overfitting problem occurs when a model is too complex relative to the amount and noise of the data, which can happen when there are too many features. Dimensionality reduction techniques can help by simplifying the feature space, reducing the risk of overfitting.

Can you explain the concept of feature importance in Random Forest?

  • Feature importance focuses on eliminating features
  • Feature importance is irrelevant in Random Forest
  • Feature importance quantifies the contribution of each feature to the model's predictions
  • Feature importance ranks the features by their correlation with the target
Feature importance in Random Forest quantifies the contribution of each feature to the model's predictions. It's based on the average impurity decrease computed from all decision trees in the forest. This helps in understanding the relative importance of different features in the model.

What is the Confusion Matrix, and what information does it provide about a classification model?

  • A matrix representing classification errors
  • A matrix representing feature importance
  • A matrix representing model's coefficients
  • A matrix representing model's hyperparameters
The Confusion Matrix is a table that describes the performance of a classification model by categorizing predictions into True Positives, False Positives, True Negatives, and False Negatives. It gives detailed insight into where the model is making mistakes.

A set of input variables and corresponding target values used to evaluate a model's performance is referred to as a _________ set.

  • evaluation
  • testing
  • training
  • validation
A "testing" set consists of input variables and corresponding target values used to assess a machine learning model's performance on unseen data, allowing for a more robust evaluation.

The assumption that the relationship between the independent and dependent variable is linear in Simple Linear Regression is called the assumption of _________.

  • Homoscedasticity
  • Independence
  • Linearity
  • Normality
The assumption of linearity ensures that the relationship between the independent and dependent variable is linear, which is fundamental to Simple Linear Regression.

What type of learning algorithm utilizes labeled data to make predictions?

  • Reinforcement Learning
  • Semi-supervised Learning
  • Supervised Learning
  • Unsupervised Learning
Supervised Learning uses labeled data, where the output is known, to train the algorithm and make predictions.

The slope coefficient in Simple Linear Regression gives the _________ change in the dependent variable for a one-unit change in the independent variable.

  • Absolute
  • Constant
  • Incremental
  • Marginal
The slope coefficient in Simple Linear Regression gives the marginal change in the dependent variable for a one-unit change in the independent variable.

What is Accuracy in the context of classification metrics?

  • False Positives / Total predictions
  • Total correct predictions / Total predictions
  • True Negatives / (True Negatives + False Positives)
  • True Positives / (True Positives + False Negatives)
Accuracy is the ratio of correct predictions to the total number of predictions. It gives an overall measure of how well the model is performing, but may not be suitable for imbalanced datasets where one class dominates.

You are working on a dataset with an imbalanced class distribution. How would you apply Cross-Validation to ensure that each fold maintains the same class distribution?

  • Applying Cross-Validation without folding
  • Using Leave-One-Out Cross-Validation
  • Using k-fold Cross-Validation with random sampling
  • Using stratified k-fold Cross-Validation
Using stratified k-fold Cross-Validation ensures that each fold maintains the same class distribution by having the same proportion of each class as the entire dataset. It's a suitable choice for imbalanced class distribution, as it guarantees that each fold is a representative sample of the overall class proportions in the dataset.