How can EDA assist in identifying errors or anomalies in the dataset?

By conducting a statistical test of normality
By creating a correlation matrix of the variables
By running the dataset through a predefined ML model
By summarizing and visualizing the data, which can reveal unexpected values or patterns

EDA, especially through summarizing and visualizing data, can assist in identifying errors or anomalies in the dataset. Graphical representations of data often make it easier to spot unexpected values, patterns, or aberrations that may not be apparent in the raw data.

Discuss it

When applying regression imputation, what factors need to be taken into consideration?

Both dependent and independent variables
None of the variables
Only the dependent variable
Only the independent variables

When applying regression imputation, both dependent and independent variables need to be taken into consideration. A regression model is built using the complete cases and then this model is used to predict the missing values in the incomplete cases. Therefore, it is important to carefully consider which variables to include in the regression model.

Discuss it

When would it be appropriate to use 'transformation' as an outlier handling method?

When the outliers are a result of data duplication
When the outliers are errors in data collection
When the outliers are extreme but legitimate data points
When the outliers do not significantly impact the data analysis

Transformation is appropriate to use as an outlier handling method when the outliers are extreme but legitimate data points that carry valuable information.

Discuss it

Suppose you are comparing the dispersion of two different data sets. One has a higher range, but a lower IQR than the other. What might this tell you about each data set?

The one with the higher range has more outliers
The one with the higher range has more variability
The one with the lower IQR has more variability
The one with the lower IQR is more skewed

If one dataset has a higher range but a lower IQR than the other, it could suggest that "The one with the higher range has more outliers". The range is sensitive to extreme values, while the IQR focuses on the middle 50% of data and is not affected by outliers.

Discuss it

What is a primary assumption when using regression imputation?

All data is normally distributed
Missing data is missing completely at random (MCAR)
Missing values are negligible
The relationship between variables is linear

A primary assumption when using regression imputation is that the relationship between variables is linear. This is because regression imputation uses a regression model to predict missing values, and the basic form of regression models assumes a linear relationship between predictor and response variables.

Discuss it

What range of values does a dataset typically have after Min-Max scaling?

-1 to 1
0 to 1
Depends on the dataset
Depends on the feature

Min-Max scaling transforms features by scaling each feature to a given range. The default range for the Min-Max scaling technique is 0 to 1. Therefore, after Min-Max scaling, the dataset will typically have values ranging from 0 to 1.

Discuss it

What is the term for the measure of how spread out the values in a data set are?

Central Tendency
Dispersion
Kurtosis
Skewness

The term for the measure of how spread out the values in a data set are is called "Dispersion". It includes range, interquartile range (IQR), variance, and standard deviation.

Discuss it

You've created a histogram of your data and you notice a few bars standing alone far from the main distribution. What might this suggest?

Data is evenly distributed
Normal distribution
Outliers
Skewness

In a histogram, bars that stand alone far from the main distribution often suggest the presence of outliers.

Discuss it

You have a dataset where the relationships between variables are not linear. Which correlation method is better to use and why?

Covariance
Kendall's Tau
Pearson's correlation coefficient
Spearman's correlation coefficient

For non-linear relationships between variables, Spearman's correlation coefficient would be a better choice. This is because Spearman's correlation measures the monotonic relationship between two variables and does not require the relationship to be linear.

Discuss it

Which of the following is a type of data distribution?

Age Bracket Distribution
Binomial Distribution
Household Distribution
Sales Distribution

The Binomial Distribution is a type of probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials each with the same probability of success.

Discuss it