_____ imputation is a basic method of handling missing data by replacing missing values with the most frequent category (for categorical variables).

  • Listwise
  • Mean
  • Median
  • Mode
'Mode' imputation is a basic method of handling missing data by replacing missing values with the most frequent category (for categorical variables). It is easy to implement but might introduce bias by overrepresenting the most frequent category.

Which measure of central tendency will be most affected in a scenario where the dataset has extreme values?

  • Mean
  • Median
  • Mode
  • nan
The "Mean" or average will be most affected in a scenario where the dataset has extreme values. Since the mean is calculated by taking into account all values in the dataset, outliers or extreme values can cause significant shifts in the mean, making it less representative of the dataset's central tendency.

When performing a pairwise analysis, _____ deletion discards only the specific pairs of data where one is missing.

  • Listwise
  • Pairwise
  • Random
  • Systematic
When performing a pairwise analysis, 'pairwise' deletion discards only the specific pairs of data where one is missing. It allows the retention of more data compared to listwise deletion, but it can lead to biased results if the data is not missing completely at random.

In regression analysis, if the Variance Inflation Factor (VIF) for a predictor is 1, this means that _________.

  • The predictor is not at all correlated with other predictors
  • The predictor is not at all correlated with the response
  • The predictor is perfectly correlated with other predictors
  • The predictor is perfectly correlated with the response
In regression analysis, a Variance Inflation Factor (VIF) of 1 indicates that there is no correlation between the given predictor and the other predictors. This implies no multicollinearity.

Why might PCA be considered a method of feature selection?

  • It can handle correlated features
  • It can improve model performance
  • It can reduce the dimensionality of the data
  • It transforms the data into a new space
Principal Component Analysis (PCA) can be considered a method of feature selection because it reduces the dimensionality of the data by transforming the original features into a new set of uncorrelated features. These new features, called principal components, are linear combinations of the original features and are selected to capture the most variance in the data.

You've created a pairplot of your dataset, and one scatter plot in the grid shows a clear linear pattern. What could this potentially indicate?

  • The two variables are highly uncorrelated
  • The two variables are unrelated
  • The two variables have a strong linear relationship
  • The two variables have no relationship
If a scatter plot in a pairplot shows a clear linear pattern, this could potentially indicate that the two variables have a strong linear relationship. This means that changes in one variable correspond directly to changes in the other variable.

A team of researchers has already formulated their hypotheses and now they want to test these against their collected data. What type of data analysis would be appropriate?

  • All are equally suitable
  • CDA
  • EDA
  • Predictive Modeling
CDA would be the most appropriate as it involves testing pre-formulated hypotheses against the collected data to either confirm or refute them.

How does multicollinearity affect feature selection?

  • It affects the accuracy of the model
  • It causes unstable parameter estimates
  • It makes the model less interpretable
  • It results in high variance of the model
Multicollinearity, which refers to the high correlation between predictor variables, can affect feature selection by causing unstable estimates of the parameters. This instability can lead to strange and unreliable predictions, making the feature selection process less accurate.

Modified Z-score is a more robust estimator in the presence of _______.

  • normally distributed data
  • outliers
  • skewed data
  • uniformly distributed data
The modified Z-score is more robust in the presence of outliers, making it better suited to datasets with many extreme values.

What type of data is Spearman's correlation most suitable for?

  • Categorical data
  • Continuous, normally distributed data
  • Nominal data
  • Ordinal data
Spearman's correlation is most suitable for ordinal data. It assesses how well the relationship between two variables can be described using a monotonic function. Because it's based on ranks, it can be used with ordinal data, where the order is important but not the difference between values.