Why is listwise deletion not recommended when the data missingness is systematic or 'not at random'?

  • It can cause overfitting
  • It can introduce bias
  • It can introduce random noise
  • It can lead to underfitting
Listwise deletion is not recommended when the data missingness is systematic or 'not at random' because it can introduce bias. If missing values are related to any underlying unobservable phenomena, listwise deletion might result in biased or misleading results by excluding certain types of observations.

Consider you have a regression model that is underfitting. On investigation, you discover missing data was dropped instead of imputed. What might be the reason for underfitting in this context?

  • The model didn't have enough data to learn from.
  • The model was over-regularized.
  • The model's complexity was too low.
  • The model's hyperparameters were not optimized.
Dropping missing data can significantly reduce the size of the training set. If much of the data is discarded, the model may not have enough data to learn the underlying patterns, leading to underfitting.

How would the mean change if an additional number far away from the current mean were added to the dataset?

  • It would always decrease
  • It would always increase
  • It would increase or decrease depending on the value
  • It would not change
The addition of an additional number far away from the current mean would either increase or decrease the mean, depending on the value. If the added number is greater than the current mean, the mean will increase; if less, the mean will decrease. This illustrates how sensitive the mean is to outliers or extreme values.

You've received feedback that your box plots are not providing a clear visual of the distribution of your dataset. What alternative plot could you use and why?

  • Bar graph
  • Line graph
  • Scatter plot
  • Violin plot
If box plots are not providing a clear visualization, an alternative could be Violin plots. Violin plots are similar to box plots, but also show the probability density of the data at different values. This can provide a more detailed depiction of the distribution of the dataset.

Incorrectly filling missing values in a feature can disproportionately increase the feature's ________, affecting model interpretability.

  • importance
  • precision
  • recall
  • weight
If missing values in a feature are filled incorrectly, it can disproportionately increase the feature's importance, potentially causing other important features to be overlooked and making the model difficult to interpret.

Imagine you have a dataset where only 5% of the rows contain missing values. What potential problems could arise if you choose to use listwise deletion?

  • It could cause all of the above problems
  • It may distort the original data distribution
  • It may lead to a significant reduction in sample size
  • It might introduce selection bias
Even though only 5% of the rows contain missing values, using listwise deletion could still lead to a significant reduction in sample size, potential distortion in the original data distribution, and introduce selection bias. These problems may affect the statistical power and the representativeness of the analysis.

How can outlier handling techniques potentially impact the accuracy of a predictive model?

  • They can decrease the accuracy by removing important information
  • They can either increase or decrease the accuracy depending on the dataset and model
  • They can increase the accuracy by reducing noise
  • nan
Outlier handling techniques can either increase or decrease the accuracy of a predictive model depending on the dataset and model. Properly handled outliers can improve model accuracy, but incorrectly handled outliers or the removal of important information can decrease model accuracy.

Replacing missing data with a constant value can introduce ________ in a machine learning model.

  • bias
  • noise
  • precision
  • variance
Filling missing values with a constant can add artificial noise to the data, as it does not consider the variability of the data. This added noise can mislead the model during training.

How can a logarithmic transformation of the axes affect the identification of outliers in a scatter plot?

  • It can convert outliers to normal data points
  • It can hide outliers
  • It can highlight outliers
  • It does not affect outlier identification
A logarithmic transformation of the axes can highlight outliers in a scatter plot by compressing the scale where the larger mass of the data points are and expanding the scale for the potential outliers.

In EDA, "data wrangling" involves ________.

  • Building predictive models
  • Cleaning and transforming raw data
  • Performing statistical tests
  • Visualizing data
In EDA, "data wrangling" involves cleaning and transforming raw data into a more suitable format for analysis. This could include handling missing values, dealing with outliers, encoding categorical variables, and other data preprocessing steps.

Which Python visualization library would be most suited to creating a complex, layered, "small multiple" style plot?

  • Bokeh
  • Matplotlib
  • Plotly
  • Seaborn
Seaborn is particularly well-suited for creating complex, layered "small multiple" style plots. The 'FacetGrid' class in Seaborn makes this type of plot easy to create.

How could the handling of missing data influence the interpretability of a machine learning model?

  • Depends on the model used.
  • Does not impact model interpretability.
  • Makes the model less interpretable.
  • Makes the model more interpretable.
If missing data are handled incorrectly, it may lead to inaccurate learning and prediction, which makes the model's decisions less understandable and hence reduces its interpretability.