EDA generally precedes ________ in the data analysis process.
- Confirmatory Data Analysis
- Data cleaning
- Data collection
- Predictive Modeling
EDA generally precedes Confirmatory Data Analysis (CDA) in the data analysis process. While EDA is all about exploring data to find patterns and relationships, CDA is about confirming or falsifying existing hypotheses.
Suppose you have a dataset with 7 variables, and you want to quickly examine the relationships among all variables. Which type of plot would you choose and why?
- Correlation Matrix
- Histogram
- Pairplot
- Scatter Plot
In this scenario, a pairplot would be the best choice because it shows all pairwise relationships between the variables in a single view. It is an excellent tool for quickly visualizing and understanding the relationships among multiple variables at once.
What is the general threshold value of VIF above which multicollinearity is generally assumed to be high?
- 10
- 15
- 2
- 5
While the threshold can vary based on the context, a common rule of thumb is that if VIF is greater than 10, multicollinearity is high, indicating that the predictors are highly correlated. This could pose problems in a regression analysis and might need to be addressed.
A correlation matrix is a type of _____ matrix, which measures the linear relationships between variables.
- diagonal
- identity
- scalar
- square
A correlation matrix is a type of square matrix that measures the linear relationships between variables. It provides a compact and comprehensive view of how different variables in a dataset are correlated.
Imagine you are using Lasso Regression in a highly multicollinear dataset. What effect might this choice of model have and why?
- It might ignore all correlated variables.
- It might lead to high bias.
- It might lead to overfitting.
- It might randomly select one variable from a group of correlated variables.
Lasso regression is a regularization method that can shrink some coefficients to zero, effectively performing feature selection. In the presence of highly correlated variables, Lasso tends to randomly select one from a group of correlated variables, leaving the others being shrunk to zero.
What is the impact of positive skewness on data interpretation?
- It suggests that data is evenly distributed.
- It suggests that most values are clustered around the left tail.
- It suggests that most values are clustered around the right tail.
- It suggests the presence of numerous outliers in the left tail.
Positive skewness indicates that most of the data values are clustered around the left tail of the distribution, with the tail extending towards more positive values. This could potentially lead to the mean being larger than the median.
You are analyzing a dataset where some missing values have been replaced using mean imputation. What effect might this have on the variance of the data?
- It could cause overfitting
- It could create multicollinearity
- It could decrease the variance
- It could increase the variance
When missing values are replaced using mean imputation, it could decrease the variance of the data. This is because imputed values are just the mean of observed values and do not add any variability. Therefore, the overall variability of the data could be underestimated, leading to biased estimates.
Can feature selection improve the computational efficiency of a machine learning model?
- Depends on the dataset
- Depends on the model
- No
- Yes
Yes, feature selection can improve the computational efficiency of a machine learning model by reducing the number of features it needs to process.
In a Normal Distribution, approximately 95% of the data falls within _____ standard deviations of the mean.
- 1
- 2
- 3
- 4
In a Normal Distribution, approximately 95% of the data falls within 2 standard deviations of the mean.
How can extreme outliers impact the interpretation of the skewness of a dataset?
- Can either increase or decrease the skewness
- Decrease the skewness
- Does not affect the skewness
- Increase the skewness
The skewness of a distribution is a measure of the extent and direction of asymmetry. Extreme outliers can either increase or decrease skewness depending on which tail they lie in. If the outliers are greater than the mean, skewness will be increased. If less, skewness will be decreased.