EDA generally precedes ________ in the data analysis process.

Confirmatory Data Analysis
Data cleaning
Data collection
Predictive Modeling

EDA generally precedes Confirmatory Data Analysis (CDA) in the data analysis process. While EDA is all about exploring data to find patterns and relationships, CDA is about confirming or falsifying existing hypotheses.

Discuss it

Suppose you have a dataset with 7 variables, and you want to quickly examine the relationships among all variables. Which type of plot would you choose and why?

Correlation Matrix
Histogram
Pairplot
Scatter Plot

In this scenario, a pairplot would be the best choice because it shows all pairwise relationships between the variables in a single view. It is an excellent tool for quickly visualizing and understanding the relationships among multiple variables at once.

Discuss it

What is the general threshold value of VIF above which multicollinearity is generally assumed to be high?

10
15
2
5

While the threshold can vary based on the context, a common rule of thumb is that if VIF is greater than 10, multicollinearity is high, indicating that the predictors are highly correlated. This could pose problems in a regression analysis and might need to be addressed.

Discuss it

A correlation matrix is a type of _____ matrix, which measures the linear relationships between variables.

diagonal
identity
scalar
square

A correlation matrix is a type of square matrix that measures the linear relationships between variables. It provides a compact and comprehensive view of how different variables in a dataset are correlated.

Discuss it

Imagine you are using Lasso Regression in a highly multicollinear dataset. What effect might this choice of model have and why?

It might ignore all correlated variables.
It might lead to high bias.
It might lead to overfitting.
It might randomly select one variable from a group of correlated variables.

Lasso regression is a regularization method that can shrink some coefficients to zero, effectively performing feature selection. In the presence of highly correlated variables, Lasso tends to randomly select one from a group of correlated variables, leaving the others being shrunk to zero.

Discuss it

What is the impact of positive skewness on data interpretation?

It suggests that data is evenly distributed.
It suggests that most values are clustered around the left tail.
It suggests that most values are clustered around the right tail.
It suggests the presence of numerous outliers in the left tail.

Positive skewness indicates that most of the data values are clustered around the left tail of the distribution, with the tail extending towards more positive values. This could potentially lead to the mean being larger than the median.

Discuss it

You are analyzing a dataset where some missing values have been replaced using mean imputation. What effect might this have on the variance of the data?

It could cause overfitting
It could create multicollinearity
It could decrease the variance
It could increase the variance

When missing values are replaced using mean imputation, it could decrease the variance of the data. This is because imputed values are just the mean of observed values and do not add any variability. Therefore, the overall variability of the data could be underestimated, leading to biased estimates.

Discuss it

Can feature selection improve the computational efficiency of a machine learning model?

Depends on the dataset
Depends on the model
No
Yes

Yes, feature selection can improve the computational efficiency of a machine learning model by reducing the number of features it needs to process.

Discuss it

In a Normal Distribution, approximately 95% of the data falls within _____ standard deviations of the mean.

1
2
3
4

In a Normal Distribution, approximately 95% of the data falls within 2 standard deviations of the mean.

Discuss it

How can extreme outliers impact the interpretation of the skewness of a dataset?

Can either increase or decrease the skewness
Decrease the skewness
Does not affect the skewness
Increase the skewness

The skewness of a distribution is a measure of the extent and direction of asymmetry. Extreme outliers can either increase or decrease skewness depending on which tail they lie in. If the outliers are greater than the mean, skewness will be increased. If less, skewness will be decreased.

Discuss it