The primary goal of EDA is to ________.

Build a final model
Predict the future outcomes
Prove a preconceived notion
Understand the underlying structure of the data

The primary goal of EDA is to understand the underlying structure of the data. This understanding includes gaining insights about the distribution, variability, and relationships among variables in the data, and it helps guide the choice of appropriate models or further data transformations.

Discuss it

Imagine you are analyzing a dataset with continuous variables with outliers. The focus is on understanding the linear relationship between these variables. What type of correlation coefficient should you consider?

Covariance
Kendall's Tau
Pearson's correlation coefficient
Spearman's correlation coefficient

In this case, considering that the data is continuous and there's an interest in understanding the linear relationship, Pearson's correlation coefficient should be used. However, one should be cautious about outliers as Pearson's correlation is sensitive to them.

Discuss it

Which type of missing data could potentially introduce the most bias into an analysis if not properly addressed?

All can introduce equal bias
MAR
MCAR
NMAR

NMAR could potentially introduce the most bias into an analysis if not properly addressed because the missingness is related to the value of the missing data itself. Handling this missingness is the most challenging.

Discuss it

What kind of effect can an outlier have on a linear regression model?

Decreases the model's accuracy
Has no effect on the model
Increases the model's accuracy
Increases the model's precision

Outliers can significantly affect the estimates of the parameters in a linear regression model and can lead to a misleading representation of the data, hence decreasing the model's accuracy.

Discuss it

What is the role of domain knowledge in feature selection?

All of the above
To define new features based on the existing ones
To guide the selection of features that are relevant to the task
To interpret the importance of different features

Domain knowledge can guide the selection of features that are relevant to the task, interpret the importance of different features, and even define new features based on the existing ones.

Discuss it

___________ kurtosis signifies a flatter distribution with thinner tails.

Any of these
Leptokurtic
Mesokurtic
Platykurtic

A platykurtic distribution has kurtosis less than 0, which indicates a flatter distribution with thinner tails compared to a normal distribution.

Discuss it

How does EDA facilitate the identification of important variables in a dataset?

By exploring relationships between variables and their relation to the outcome variable
By fitting a predictive model to the data
By performing a cost-benefit analysis of each variable
By running a feature importance algorithm on the dataset

EDA facilitates the identification of important variables by exploring relationships between variables and their relation to the outcome variable. Visualizations and summary statistics can highlight which variables have strong relationships with the outcome variable, and these variables are often important for predictive modeling.

Discuss it

You have built a model for credit risk assessment with 100 features. Upon evaluation, you find that only 20 features have significant predictive power. How would you proceed?

Increase the number of features
Keep all the features
Retrain the model using only the 20 significant features
Use all the features for model training

If only 20 features have significant predictive power, it might be beneficial to retrain the model using only these features. Reducing the number of features can make the model simpler, easier to interpret, and faster to train. It can also reduce the risk of overfitting.

Discuss it

The IQR method defines an outlier as any value below Q1 - _ or above Q3 + _.

1.5*IQR
2*IQR
2.5*IQR
3*IQR

In the IQR method, an outlier is any value below Q1 - 1.5IQR or above Q3 + 1.5IQR.

Discuss it

You are analyzing a dataset where some missing values have been replaced using mean imputation. What effect might this have on the variance of the data?

It could cause overfitting
It could create multicollinearity
It could decrease the variance
It could increase the variance

When missing values are replaced using mean imputation, it could decrease the variance of the data. This is because imputed values are just the mean of observed values and do not add any variability. Therefore, the overall variability of the data could be underestimated, leading to biased estimates.

Discuss it

The primary goal of EDA is to ________.

Imagine you are analyzing a dataset with continuous variables with outliers. The focus is on understanding the linear relationship between these variables. What type of correlation coefficient should you consider?

Which type of missing data could potentially introduce the most bias into an analysis if not properly addressed?

What kind of effect can an outlier have on a linear regression model?

What is the role of domain knowledge in feature selection?

___________ kurtosis signifies a flatter distribution with thinner tails.

How does EDA facilitate the identification of important variables in a dataset?

You have built a model for credit risk assessment with 100 features. Upon evaluation, you find that only 20 features have significant predictive power. How would you proceed?

The IQR method defines an outlier as any value below Q1 - _______ or above Q3 + _______.

You are analyzing a dataset where some missing values have been replaced using mean imputation. What effect might this have on the variance of the data?

The IQR method defines an outlier as any value below Q1 - _ or above Q3 + _.