Why is listwise deletion not recommended when the data missingness is systematic or 'not at random'?

It can cause overfitting
It can introduce bias
It can introduce random noise
It can lead to underfitting

Listwise deletion is not recommended when the data missingness is systematic or 'not at random' because it can introduce bias. If missing values are related to any underlying unobservable phenomena, listwise deletion might result in biased or misleading results by excluding certain types of observations.

Discuss it

Consider you have a regression model that is underfitting. On investigation, you discover missing data was dropped instead of imputed. What might be the reason for underfitting in this context?

The model didn't have enough data to learn from.
The model was over-regularized.
The model's complexity was too low.
The model's hyperparameters were not optimized.

Dropping missing data can significantly reduce the size of the training set. If much of the data is discarded, the model may not have enough data to learn the underlying patterns, leading to underfitting.

Discuss it

How would the mean change if an additional number far away from the current mean were added to the dataset?

It would always decrease
It would always increase
It would increase or decrease depending on the value
It would not change

The addition of an additional number far away from the current mean would either increase or decrease the mean, depending on the value. If the added number is greater than the current mean, the mean will increase; if less, the mean will decrease. This illustrates how sensitive the mean is to outliers or extreme values.

Discuss it

You've received feedback that your box plots are not providing a clear visual of the distribution of your dataset. What alternative plot could you use and why?

Bar graph
Line graph
Scatter plot
Violin plot

If box plots are not providing a clear visualization, an alternative could be Violin plots. Violin plots are similar to box plots, but also show the probability density of the data at different values. This can provide a more detailed depiction of the distribution of the dataset.

Discuss it

Incorrectly filling missing values in a feature can disproportionately increase the feature's ________, affecting model interpretability.

importance
precision
recall
weight

If missing values in a feature are filled incorrectly, it can disproportionately increase the feature's importance, potentially causing other important features to be overlooked and making the model difficult to interpret.

Discuss it

Imagine you have a dataset where only 5% of the rows contain missing values. What potential problems could arise if you choose to use listwise deletion?

It could cause all of the above problems
It may distort the original data distribution
It may lead to a significant reduction in sample size
It might introduce selection bias

Even though only 5% of the rows contain missing values, using listwise deletion could still lead to a significant reduction in sample size, potential distortion in the original data distribution, and introduce selection bias. These problems may affect the statistical power and the representativeness of the analysis.

Discuss it

How can outlier handling techniques potentially impact the accuracy of a predictive model?

They can decrease the accuracy by removing important information
They can either increase or decrease the accuracy depending on the dataset and model
They can increase the accuracy by reducing noise
nan

Outlier handling techniques can either increase or decrease the accuracy of a predictive model depending on the dataset and model. Properly handled outliers can improve model accuracy, but incorrectly handled outliers or the removal of important information can decrease model accuracy.

Discuss it

Replacing missing data with a constant value can introduce ________ in a machine learning model.

bias
noise
precision
variance

Filling missing values with a constant can add artificial noise to the data, as it does not consider the variability of the data. This added noise can mislead the model during training.

Discuss it

How can a logarithmic transformation of the axes affect the identification of outliers in a scatter plot?

It can convert outliers to normal data points
It can hide outliers
It can highlight outliers
It does not affect outlier identification

A logarithmic transformation of the axes can highlight outliers in a scatter plot by compressing the scale where the larger mass of the data points are and expanding the scale for the potential outliers.

Discuss it

In EDA, "data wrangling" involves ________.

Building predictive models
Cleaning and transforming raw data
Performing statistical tests
Visualizing data

In EDA, "data wrangling" involves cleaning and transforming raw data into a more suitable format for analysis. This could include handling missing values, dealing with outliers, encoding categorical variables, and other data preprocessing steps.

Discuss it

Which Python visualization library would be most suited to creating a complex, layered, "small multiple" style plot?

Bokeh
Matplotlib
Plotly
Seaborn

Seaborn is particularly well-suited for creating complex, layered "small multiple" style plots. The 'FacetGrid' class in Seaborn makes this type of plot easy to create.

Discuss it

How could the handling of missing data influence the interpretability of a machine learning model?

Depends on the model used.
Does not impact model interpretability.
Makes the model less interpretable.
Makes the model more interpretable.

If missing data are handled incorrectly, it may lead to inaccurate learning and prediction, which makes the model's decisions less understandable and hence reduces its interpretability.

Discuss it