The IQR method defines an outlier as any value below Q1 - _ or above Q3 + _.

1.5*IQR
2*IQR
2.5*IQR
3*IQR

In the IQR method, an outlier is any value below Q1 - 1.5IQR or above Q3 + 1.5IQR.

You have built a model for credit risk assessment with 100 features. Upon evaluation, you find that only 20 features have significant predictive power. How would you proceed?

Increase the number of features
Keep all the features
Retrain the model using only the 20 significant features
Use all the features for model training

If only 20 features have significant predictive power, it might be beneficial to retrain the model using only these features. Reducing the number of features can make the model simpler, easier to interpret, and faster to train. It can also reduce the risk of overfitting.

Discuss it

How does EDA facilitate the identification of important variables in a dataset?

By exploring relationships between variables and their relation to the outcome variable
By fitting a predictive model to the data
By performing a cost-benefit analysis of each variable
By running a feature importance algorithm on the dataset

EDA facilitates the identification of important variables by exploring relationships between variables and their relation to the outcome variable. Visualizations and summary statistics can highlight which variables have strong relationships with the outcome variable, and these variables are often important for predictive modeling.

Discuss it

___________ kurtosis signifies a flatter distribution with thinner tails.

Any of these
Leptokurtic
Mesokurtic
Platykurtic

A platykurtic distribution has kurtosis less than 0, which indicates a flatter distribution with thinner tails compared to a normal distribution.

Discuss it

What is the role of domain knowledge in feature selection?

All of the above
To define new features based on the existing ones
To guide the selection of features that are relevant to the task
To interpret the importance of different features

Domain knowledge can guide the selection of features that are relevant to the task, interpret the importance of different features, and even define new features based on the existing ones.

Discuss it

What kind of effect can an outlier have on a linear regression model?

Decreases the model's accuracy
Has no effect on the model
Increases the model's accuracy
Increases the model's precision

Outliers can significantly affect the estimates of the parameters in a linear regression model and can lead to a misleading representation of the data, hence decreasing the model's accuracy.

Discuss it

Which type of missing data could potentially introduce the most bias into an analysis if not properly addressed?

All can introduce equal bias
MAR
MCAR
NMAR

NMAR could potentially introduce the most bias into an analysis if not properly addressed because the missingness is related to the value of the missing data itself. Handling this missingness is the most challenging.

Discuss it

What is the impact of positive skewness on data interpretation?

It suggests that data is evenly distributed.
It suggests that most values are clustered around the left tail.
It suggests that most values are clustered around the right tail.
It suggests the presence of numerous outliers in the left tail.

Positive skewness indicates that most of the data values are clustered around the left tail of the distribution, with the tail extending towards more positive values. This could potentially lead to the mean being larger than the median.

Discuss it

Imagine you are using Lasso Regression in a highly multicollinear dataset. What effect might this choice of model have and why?

It might ignore all correlated variables.
It might lead to high bias.
It might lead to overfitting.
It might randomly select one variable from a group of correlated variables.

Lasso regression is a regularization method that can shrink some coefficients to zero, effectively performing feature selection. In the presence of highly correlated variables, Lasso tends to randomly select one from a group of correlated variables, leaving the others being shrunk to zero.

Discuss it

A correlation matrix is a type of _____ matrix, which measures the linear relationships between variables.

diagonal
identity
scalar
square

A correlation matrix is a type of square matrix that measures the linear relationships between variables. It provides a compact and comprehensive view of how different variables in a dataset are correlated.

Discuss it

What is the general threshold value of VIF above which multicollinearity is generally assumed to be high?

10
15
2
5

While the threshold can vary based on the context, a common rule of thumb is that if VIF is greater than 10, multicollinearity is high, indicating that the predictors are highly correlated. This could pose problems in a regression analysis and might need to be addressed.

Discuss it

Suppose you have a dataset with 7 variables, and you want to quickly examine the relationships among all variables. Which type of plot would you choose and why?

Correlation Matrix
Histogram
Pairplot
Scatter Plot

In this scenario, a pairplot would be the best choice because it shows all pairwise relationships between the variables in a single view. It is an excellent tool for quickly visualizing and understanding the relationships among multiple variables at once.

Discuss it

The IQR method defines an outlier as any value below Q1 - _______ or above Q3 + _______.

You have built a model for credit risk assessment with 100 features. Upon evaluation, you find that only 20 features have significant predictive power. How would you proceed?

How does EDA facilitate the identification of important variables in a dataset?

___________ kurtosis signifies a flatter distribution with thinner tails.

What is the role of domain knowledge in feature selection?

What kind of effect can an outlier have on a linear regression model?

Which type of missing data could potentially introduce the most bias into an analysis if not properly addressed?

What is the impact of positive skewness on data interpretation?

Imagine you are using Lasso Regression in a highly multicollinear dataset. What effect might this choice of model have and why?

A correlation matrix is a type of _____ matrix, which measures the linear relationships between variables.

What is the general threshold value of VIF above which multicollinearity is generally assumed to be high?

Suppose you have a dataset with 7 variables, and you want to quickly examine the relationships among all variables. Which type of plot would you choose and why?

The IQR method defines an outlier as any value below Q1 - _ or above Q3 + _.