While analyzing a dataset using a box plot, you notice that there are several data points plotted as circles. What might these circles represent?

Data within the interquartile range
Data within the whiskers
Median values
Outliers

In a box plot, data points plotted as circles often represent outliers.

What is the difference between skewness and kurtosis?

Skewness measures asymmetry, kurtosis measures variability.
Skewness measures center, kurtosis measures spread.
Skewness measures spread, kurtosis measures center.
Skewness measures symmetry, kurtosis measures tailedness.

The difference between skewness and kurtosis is that skewness measures the asymmetry of a data distribution around its mean, whereas kurtosis measures the "tailedness" of a data distribution. So, skewness is about the symmetry, and kurtosis is about the tails of the distribution.

Discuss it

Imagine you are examining a correlation matrix and find that two variables have a correlation coefficient close to -1. What does this imply about the relationship between these two variables?

Their relationship is random
They are unrelated
They have a strong negative relationship
They have a weak positive relationship

A correlation coefficient close to -1 implies that the two variables have a strong negative relationship. This means that as one variable increases, the other decreases and vice versa.

Discuss it

In the context of handling missing data, what does 'imputation' mean?

Adding artificial data
Deleting data points
Filling in missing data with substituted values
Transforming data

In the context of handling missing data, 'imputation' refers to the process of filling in missing data with substituted values. These values can be determined in a variety of ways such as using measures of central tendency (mean, median, mode), predictive models, or other techniques.

Discuss it

Even after concluding, it's crucial to '______' effectively in the EDA process, as this step is where your findings are shared and potentially acted upon.

communicate
conclude
question
wrangle

Even after concluding, it's crucial to 'communicate' effectively in the EDA process, as this step is where your findings are shared and potentially acted upon. Communication is not only about presenting the findings, but also about making sure that they are understood and can be acted upon.

Discuss it

In the context of outlier detection, a Z-score above or below _______ is typically considered as an outlier.

1.5
2
2.5
3

A data point with a Z-score above 3 or below -3 is usually considered an outlier. However, this threshold can vary depending on the context.

Discuss it

How does the Spearman's correlation handle ties compared to Kendall's Tau?

It doesn't handle ties
It handles ties better than Kendall's Tau
It handles ties worse than Kendall's Tau
The method of handling ties is the same

Spearman's correlation coefficient handles ties worse than Kendall's Tau. While both are rank correlation coefficients, Kendall's Tau is better at handling ties. Ties are handled in Spearman's correlation by assigning each tied group the mean of the ranks they would have received if they weren't tied.

Discuss it

Which of the following graphs can help identify outliers in a univariate dataset?

Bar Chart
Box Plot
Line Graph
Pie Chart

A box plot is a type of graph that can help identify outliers in a univariate dataset.

Discuss it

You are given a dataset where the salaries of a company are reported. The CEO's salary is significantly higher than the rest of the employees. Which measure of central tendency would give a more representative measure of the typical salary?

Mean
Median
Mode
None would be representative

The "Median" would be a more representative measure of the typical salary. Because the CEO's salary is an outlier and would significantly skew the mean, the median provides a more accurate central measure by considering the middle value in the sorted data.

Discuss it

A teacher is analyzing test scores and finds that the distribution is bimodal, with one peak at 70 and another at 90. Which measure of central tendency might not be the best choice in this situation, and why?

Mean, because it doesn't reflect the peaks
Median, because it doesn't reflect the bimodality
Mode, because there are two peaks
None, because all are suitable

The "Mean" might not be the best choice in this situation because it does not reflect the two peaks. The mean would give a single central value, which does not accurately represent the two distinct groups in a bimodal distribution.

Discuss it

You have a data set with a large number of outliers. Which measure of dispersion should you use to best describe the data set, and why?

Interquartile range (IQR) because it is robust to outliers
Range because it covers all values
Standard deviation because it gives the average spread
Variance because it squares the differences

When dealing with a large number of outliers in a data set, the "Interquartile range (IQR)" is the most suitable measure of dispersion. This is because it measures the statistical spread between the 25th and 75th percentiles, thus excluding outliers.

Discuss it

_____ data can only take certain values with gaps between them.

Continuous
Discrete
Nominal
Ordinal

Discrete data can only take certain values (usually integers) and there are gaps between the values.

Discuss it