Consider a situation where a Spearman's correlation coefficient between two variables was found to be significantly different from zero, but the Pearson's correlation was not. How would you explain this?

  • Data has outliers
  • Data is normally distributed
  • Relationship between variables is linear
  • Relationship between variables is non-linear
If Spearman's correlation is significantly different from zero, but the Pearson's correlation is not, it suggests that the relationship between the variables is not linear. Spearman's correlation, unlike Pearson's, can capture monotonic relationships that are not necessarily linear.

Which type of missing data relies on information that is not included in the dataset?

  • MAR
  • MCAR
  • NMAR
  • nan
NMAR (Not Missing At Random) type of missing data relies on information that is not included in the dataset.

Which type of missing data could potentially introduce the most bias into an analysis if not properly addressed?

  • All can introduce equal bias
  • MAR
  • MCAR
  • NMAR
NMAR could potentially introduce the most bias into an analysis if not properly addressed because the missingness is related to the value of the missing data itself. Handling this missingness is the most challenging.

What kind of effect can an outlier have on a linear regression model?

  • Decreases the model's accuracy
  • Has no effect on the model
  • Increases the model's accuracy
  • Increases the model's precision
Outliers can significantly affect the estimates of the parameters in a linear regression model and can lead to a misleading representation of the data, hence decreasing the model's accuracy.

What is the role of domain knowledge in feature selection?

  • All of the above
  • To define new features based on the existing ones
  • To guide the selection of features that are relevant to the task
  • To interpret the importance of different features
Domain knowledge can guide the selection of features that are relevant to the task, interpret the importance of different features, and even define new features based on the existing ones.

___________ kurtosis signifies a flatter distribution with thinner tails.

  • Any of these
  • Leptokurtic
  • Mesokurtic
  • Platykurtic
A platykurtic distribution has kurtosis less than 0, which indicates a flatter distribution with thinner tails compared to a normal distribution.

How does EDA facilitate the identification of important variables in a dataset?

  • By exploring relationships between variables and their relation to the outcome variable
  • By fitting a predictive model to the data
  • By performing a cost-benefit analysis of each variable
  • By running a feature importance algorithm on the dataset
EDA facilitates the identification of important variables by exploring relationships between variables and their relation to the outcome variable. Visualizations and summary statistics can highlight which variables have strong relationships with the outcome variable, and these variables are often important for predictive modeling.

You have built a model for credit risk assessment with 100 features. Upon evaluation, you find that only 20 features have significant predictive power. How would you proceed?

  • Increase the number of features
  • Keep all the features
  • Retrain the model using only the 20 significant features
  • Use all the features for model training
If only 20 features have significant predictive power, it might be beneficial to retrain the model using only these features. Reducing the number of features can make the model simpler, easier to interpret, and faster to train. It can also reduce the risk of overfitting.

The IQR method defines an outlier as any value below Q1 - _______ or above Q3 + _______.

  • 1.5*IQR
  • 2*IQR
  • 2.5*IQR
  • 3*IQR
In the IQR method, an outlier is any value below Q1 - 1.5IQR or above Q3 + 1.5IQR.

The primary goal of EDA is to ________.

  • Build a final model
  • Predict the future outcomes
  • Prove a preconceived notion
  • Understand the underlying structure of the data
The primary goal of EDA is to understand the underlying structure of the data. This understanding includes gaining insights about the distribution, variability, and relationships among variables in the data, and it helps guide the choice of appropriate models or further data transformations.