Improper handling of missing data can affect the ________ of a model, thereby impacting its ability to generalize on unseen data.

bias-variance tradeoff
overfitting
regularization
underfitting

Improper handling of missing data can adversely affect the bias-variance tradeoff of a model. This can lead to issues such as overfitting or underfitting, which impact the model's ability to generalize to unseen data.

Discuss it

In which situations would you prefer a heatmap over a scatter plot?

When dealing with a single variable
When the data is non-numeric
When visualizing a time series
When visualizing the correlation between multiple variables

You would prefer a heatmap over a scatter plot when visualizing the correlation between multiple variables. A heatmap can visualize any number of variables at once and is particularly effective when the dataset contains many variables.

Discuss it

Which of the following types of analysis provides the least assumptions about data: EDA, CDA, or Predictive Modeling?

CDA
EDA
Predictive Modeling
They all make the same number of assumptions.

EDA makes the least assumptions about data. While CDA and Predictive Modeling typically require some assumptions about the data's distribution or the relationships between variables, EDA is a more open-ended exploration of the data's structure and patterns.

Discuss it

You have a scatter plot with a strong positive correlation, but a few points are far from the correlation line. What might these points represent?

Correlated data points
False positives
Normal data points
Outliers

In a scatter plot, points that are far away from the correlation line often represent outliers.

Discuss it

You create a histogram of a dataset and notice that the frequency count is very high on the far right of the distribution but drops significantly after that. What can be inferred from this?

Data has a negative skewness
Data has a positive skewness
Data is evenly distributed
Data is normally distributed

If the frequency count in a histogram is very high on the far right but drops significantly after that, it can indicate that the data has a positive skewness.

Discuss it

A data analyst needs to demonstrate the occurrence of outliers in a dataset using a plot. Which plot type would you recommend and why?

Bar graph
Box plot
Line graph
Scatter plot

The Box plot is ideal for demonstrating outliers in a dataset. The 'whiskers' in a box plot represent the range for the bulk of the data, and any data point that falls outside of this range is visually represented as an outlier.

Discuss it

The process of combining highly correlated variables into one is called _________.

Data Aggregation
Principal Component Analysis (PCA)
Standardization
Variance Inflation

When dealing with multicollinearity, one approach is to combine the correlated variables into one using a technique such as Principal Component Analysis (PCA). PCA creates new uncorrelated variables that capture the information of the original variables.

Discuss it

The ______ of a scatter plot may indicate the presence of outliers in the dataset.

correlation
scatter
slope
trend line

In a scatter plot, the scattering or spread of data points can help identify outliers. Points that are distant from the main concentration of data can indicate potential outliers.

Discuss it

In what scenario would you choose standardization over Min-Max scaling?

When the algorithm requires features to be on the same scale and the data is normally distributed
When the maximum and minimum values are unknown
When there are no outliers in the data
When you need to normalize the distribution

You would choose standardization over Min-Max scaling when the algorithm requires features to be on the same scale and the data is normally distributed. Standardization does not bound values to a specific range like Min-Max scaling, which can be useful for algorithms that do not require input features to be within a certain range.

Discuss it

What is the key visual feature of a scatter plot that may indicate the presence of outliers?

Color coding
Legends
Points far away from the general grouping
Trend line

Points that are far away from the general grouping in a scatter plot may indicate the presence of outliers.

Discuss it