Which type of missing data could potentially introduce the most bias into an analysis if not properly addressed?

All can introduce equal bias
MAR
MCAR
NMAR

NMAR could potentially introduce the most bias into an analysis if not properly addressed because the missingness is related to the value of the missing data itself. Handling this missingness is the most challenging.

Discuss it

What is the impact of positive skewness on data interpretation?

It suggests that data is evenly distributed.
It suggests that most values are clustered around the left tail.
It suggests that most values are clustered around the right tail.
It suggests the presence of numerous outliers in the left tail.

Positive skewness indicates that most of the data values are clustered around the left tail of the distribution, with the tail extending towards more positive values. This could potentially lead to the mean being larger than the median.

Discuss it

You are analyzing a dataset where some missing values have been replaced using mean imputation. What effect might this have on the variance of the data?

It could cause overfitting
It could create multicollinearity
It could decrease the variance
It could increase the variance

When missing values are replaced using mean imputation, it could decrease the variance of the data. This is because imputed values are just the mean of observed values and do not add any variability. Therefore, the overall variability of the data could be underestimated, leading to biased estimates.

Discuss it

Can feature selection improve the computational efficiency of a machine learning model?

Depends on the dataset
Depends on the model
No
Yes

Yes, feature selection can improve the computational efficiency of a machine learning model by reducing the number of features it needs to process.

Discuss it

When applying multiple imputation, increasing the number of imputations can help reduce the ____________.

Mean
Mode
Sampling error
Standard deviation

When applying multiple imputation, increasing the number of imputations can help reduce the sampling error. More imputations allow for a better representation of the uncertainty due to missingness, resulting in more accurate standard errors and confidence intervals.

Discuss it

EDA generally precedes ________ in the data analysis process.

Confirmatory Data Analysis
Data cleaning
Data collection
Predictive Modeling

EDA generally precedes Confirmatory Data Analysis (CDA) in the data analysis process. While EDA is all about exploring data to find patterns and relationships, CDA is about confirming or falsifying existing hypotheses.

Discuss it

Suppose you have a dataset with 7 variables, and you want to quickly examine the relationships among all variables. Which type of plot would you choose and why?

Correlation Matrix
Histogram
Pairplot
Scatter Plot

In this scenario, a pairplot would be the best choice because it shows all pairwise relationships between the variables in a single view. It is an excellent tool for quickly visualizing and understanding the relationships among multiple variables at once.

Discuss it

What is the general threshold value of VIF above which multicollinearity is generally assumed to be high?

10
15
2
5

While the threshold can vary based on the context, a common rule of thumb is that if VIF is greater than 10, multicollinearity is high, indicating that the predictors are highly correlated. This could pose problems in a regression analysis and might need to be addressed.

Discuss it

A correlation matrix is a type of _____ matrix, which measures the linear relationships between variables.

diagonal
identity
scalar
square

A correlation matrix is a type of square matrix that measures the linear relationships between variables. It provides a compact and comprehensive view of how different variables in a dataset are correlated.

Discuss it

Imagine you are using Lasso Regression in a highly multicollinear dataset. What effect might this choice of model have and why?

It might ignore all correlated variables.
It might lead to high bias.
It might lead to overfitting.
It might randomly select one variable from a group of correlated variables.

Lasso regression is a regularization method that can shrink some coefficients to zero, effectively performing feature selection. In the presence of highly correlated variables, Lasso tends to randomly select one from a group of correlated variables, leaving the others being shrunk to zero.

Discuss it