In a scenario where your dataset has a Gaussian distribution, which scaling method is typically recommended and why?

  • All scaling methods work equally well with Gaussian distributed data
  • Min-Max scaling because it scales all values between 0 and 1
  • Robust scaling because it is not affected by outliers
  • Z-score standardization because it creates a normal distribution
Z-score standardization is typically recommended for a dataset with a Gaussian distribution. Although it doesn't create a normal distribution, it scales the data such that it has a mean of 0 and a standard deviation of 1, which aligns with the properties of a standard normal distribution.

How can mishandling missing data in a feature affect the feature's importance in a machine learning model?

  • Decreases the feature's importance.
  • Depends on the feature's initial importance.
  • Has no effect on the feature's importance.
  • Increases the feature's importance.
Mishandling missing data can distort the data distribution and skew the feature's statistical properties, which might lead to a decrease in its importance when the model is learning.

You're using a model that is sensitive to multicollinearity. How can feature selection help improve your model's performance?

  • By adding more features
  • By removing highly correlated features
  • By transforming the features
  • By using all features
If you're using a model that is sensitive to multicollinearity, feature selection can help improve the model's performance by removing highly correlated features. Multicollinearity can affect the stability and performance of some models, and removing features that are highly correlated with others can alleviate this problem.

How can incorrect handling of missing data impact the bias-variance trade-off in a machine learning model?

  • Does not affect the bias-variance trade-off.
  • Increases bias and reduces variance.
  • Increases both bias and variance.
  • Increases variance and reduces bias.
Improper handling of missing data, such as by naive imputation methods, can lead to an increase in bias and a decrease in variance. This is because the imputed values could be biased, leading the model to learn incorrect patterns.

How does the IQR method categorize a data point as an outlier?

  • By comparing it to the mean
  • By comparing it to the median
  • By comparing it to the standard deviation
  • By seeing if it falls below Q1-1.5IQR or above Q3+1.5IQR
The IQR method categorizes a data point as an outlier by seeing if it falls below Q1-1.5IQR or above Q3+1.5IQR.

You're working with a data set that does not follow a normal distribution. Which method, Z-score or IQR, should be used for detecting outliers?

  • Both are suitable
  • IQR
  • Neither is suitable
  • Z-score
In this case, the IQR method is a better choice as it does not assume any specific data distribution unlike the Z-score method, which assumes data is normally distributed.

Which measure of central tendency can be used for both quantitative and qualitative data?

  • Mean
  • Median
  • Mode
  • nan
The "Mode" is the measure of central tendency that can be used for both quantitative and qualitative data. It is the value that appears most frequently in a data set, and it is the only measure of central tendency that can be used with nominal data.

Which method for dealing with missing data might introduce bias if the data is not missing completely at random?

  • Listwise Deletion
  • Mean/Median/Mode Imputation
  • Pairwise Deletion
  • Regression Imputation
Mean/Median/Mode Imputation might introduce bias if the data is not missing completely at random. If missing values have some systematic patterns, replacing them with mean, median, or mode might lead to incorrect estimation of variability and biased results.

You find that both Z-score and modified Z-score methods give different sets of outliers for the same dataset. How will you reconcile this?

  • Assume the Z-score method is correct
  • Assume the modified Z-score method is correct
  • Consider the intersection of both methods
  • Further inspect the data and the assumptions of each method
When two methods give different sets of outliers, it's best to further inspect the data and the assumptions of each method before drawing conclusions.

To create multiple plots in one figure in Matplotlib, you would use the ___________ function.

  • heatmap
  • pairplot
  • subplot
  • violinplot
The 'subplot' function in Matplotlib is used to create multiple plots in a single figure. It allows you to arrange plots in a grid structure.