You have a dataset with many correlated features, and you decide to use PCA. How would you determine which Eigenvectors to keep?
- By choosing the eigenvectors with the highest eigenvalues
- By randomly selecting eigenvectors
- By selecting the eigenvectors with negative eigenvalues
- By using all eigenvectors without exception
You would keep the eigenvectors corresponding to the highest eigenvalues, as they explain the most variance in the data. The lower the eigenvalue, the less significant the corresponding eigenvector.
While estimating the coefficients in Simple Linear Regression, you find that one of the assumptions is not met. How would this affect the reliability of the predictions?
- Increase Accuracy
- Make Predictions More Reliable
- Make Predictions Unreliable
- No Effect
If the assumptions of Simple Linear Regression are not met, the reliability of the predictions may be compromised, and the model may become biased or inefficient.
What are the main challenges in training a Machine Learning model with imbalanced datasets?
- Computational complexity
- Dimensionality reduction
- Lack of suitable algorithms
- Overfitting to the majority class
Training on imbalanced datasets can lead to models that are biased towards the majority class, since they have seen more examples of it. This can make the model perform poorly on the minority class.
What is Bootstrapping, and how does it differ from Cross-Validation?
- A method for resampling data with replacement
- A technique for training ensemble models
- A technique to reduce bias
- A type of Cross-Validation
Bootstrapping is a method for resampling data with replacement, used to estimate statistics about a population from a sample. It differs from Cross-Validation, where data is split without replacement to validate the model. Bootstrapping is more about estimating the properties of an estimator, while Cross-Validation assesses the model's performance.
You built a regression model and it's yielding a very low R-Squared value. What could be the reason and how would you improve it?
- Data noise; Apply data cleaning
- Incorrect model; Change the model
- Poorly fitted; Improve the model fit
- Too many features; Reduce features
A low R-Squared value might indicate that the model doesn't fit the data well. This could be due to an incorrect choice of model, underfitting, or other issues. Improving the model fit by selecting an appropriate algorithm, feature engineering, or hyperparameter tuning can address this problem.
Which field utilizes Machine Learning to recommend products or media to consumers based on their past behavior?
- Autonomous Driving
- Education
- Healthcare
- Recommender Systems
Recommender Systems use machine learning algorithms to suggest products, media, or content to users based on their past interactions and behavior, creating personalized experiences.
The ___________ test in Logistic Regression can be used to assess if the Logit link function is the correct specification for the model.
- AIC
- Hosmer-Lemeshow
- Likelihood-ratio
- Link
The Link test in Logistic Regression can be used to determine if the Logit link function is the correct specification for the model.
Machine Learning is commonly used in ____________ to create personalized recommendations.
- Drug Development
- Recommender Systems
- Traffic Management
- Weather Prediction
Machine Learning is extensively used in Recommender Systems to create personalized recommendations, analyzing user behavior and preferences.
In reinforcement learning, the agent learns to take actions that maximize the cumulative __________.
- accuracy
- errors
- loss
- rewards
In reinforcement learning, the agent tries to maximize cumulative rewards through its actions.
Describe a scenario where Hierarchical Clustering would be more beneficial than K-Means Clustering, and explain the considerations in choosing the linkage method.
- When a fixed number of clusters is required
- When clusters are uniformly distributed
- When clusters have varying sizes and non-spherical shapes
- When computational efficiency is the priority
Hierarchical Clustering is more beneficial than K-Means when clusters have varying sizes and non-spherical shapes. Unlike K-Means, Hierarchical Clustering does not assume spherical clusters and can handle complex structures. The choice of linkage method will depend on the specific characteristics of the clusters, with considerations like distance metric and desired cluster shape guiding the selection.
A dataset contains both categorical and numerical features. Which ensemble method might be suitable, and what preprocessing might be required?
- Random Forest with no preprocessing
- Random Forest with normalization
- Random Forest with one-hot encoding
- Random Forest with scaling
Random Forest is an ensemble method suitable for handling both categorical and numerical features. For categorical features, one-hot encoding might be required to convert them into a numerical format that the algorithm can process.
You are using KNN for a regression problem. What are the special considerations in selecting K and the distance metric, and how would you evaluate the model's performance?
- Choose K and metric considering data characteristics, evaluate using regression metrics
- Choose fixed K and Manhattan metric, evaluate using recall
- Choose large K and any metric, evaluate using accuracy
- Choose small K and Euclidean metric, evaluate using precision
Selecting K and distance metric considering the data characteristics and evaluating the model using regression metrics like RMSE or MAE is the right approach for KNN in regression.