The IQR method defines an outlier as any value below Q1 - _______ or above Q3 + _______.
- 1.5*IQR
- 2*IQR
- 2.5*IQR
- 3*IQR
In the IQR method, an outlier is any value below Q1 - 1.5IQR or above Q3 + 1.5IQR.
You are analyzing a dataset where some missing values have been replaced using mean imputation. What effect might this have on the variance of the data?
- It could cause overfitting
- It could create multicollinearity
- It could decrease the variance
- It could increase the variance
When missing values are replaced using mean imputation, it could decrease the variance of the data. This is because imputed values are just the mean of observed values and do not add any variability. Therefore, the overall variability of the data could be underestimated, leading to biased estimates.
Can feature selection improve the computational efficiency of a machine learning model?
- Depends on the dataset
- Depends on the model
- No
- Yes
Yes, feature selection can improve the computational efficiency of a machine learning model by reducing the number of features it needs to process.
When applying multiple imputation, increasing the number of imputations can help reduce the ____________.
- Mean
- Mode
- Sampling error
- Standard deviation
When applying multiple imputation, increasing the number of imputations can help reduce the sampling error. More imputations allow for a better representation of the uncertainty due to missingness, resulting in more accurate standard errors and confidence intervals.
EDA generally precedes ________ in the data analysis process.
- Confirmatory Data Analysis
- Data cleaning
- Data collection
- Predictive Modeling
EDA generally precedes Confirmatory Data Analysis (CDA) in the data analysis process. While EDA is all about exploring data to find patterns and relationships, CDA is about confirming or falsifying existing hypotheses.
Suppose you have a dataset with 7 variables, and you want to quickly examine the relationships among all variables. Which type of plot would you choose and why?
- Correlation Matrix
- Histogram
- Pairplot
- Scatter Plot
In this scenario, a pairplot would be the best choice because it shows all pairwise relationships between the variables in a single view. It is an excellent tool for quickly visualizing and understanding the relationships among multiple variables at once.
What is the general threshold value of VIF above which multicollinearity is generally assumed to be high?
- 10
- 15
- 2
- 5
While the threshold can vary based on the context, a common rule of thumb is that if VIF is greater than 10, multicollinearity is high, indicating that the predictors are highly correlated. This could pose problems in a regression analysis and might need to be addressed.
A correlation matrix is a type of _____ matrix, which measures the linear relationships between variables.
- diagonal
- identity
- scalar
- square
A correlation matrix is a type of square matrix that measures the linear relationships between variables. It provides a compact and comprehensive view of how different variables in a dataset are correlated.
Imagine you are using Lasso Regression in a highly multicollinear dataset. What effect might this choice of model have and why?
- It might ignore all correlated variables.
- It might lead to high bias.
- It might lead to overfitting.
- It might randomly select one variable from a group of correlated variables.
Lasso regression is a regularization method that can shrink some coefficients to zero, effectively performing feature selection. In the presence of highly correlated variables, Lasso tends to randomly select one from a group of correlated variables, leaving the others being shrunk to zero.
What is the impact of positive skewness on data interpretation?
- It suggests that data is evenly distributed.
- It suggests that most values are clustered around the left tail.
- It suggests that most values are clustered around the right tail.
- It suggests the presence of numerous outliers in the left tail.
Positive skewness indicates that most of the data values are clustered around the left tail of the distribution, with the tail extending towards more positive values. This could potentially lead to the mean being larger than the median.