Which measure of central tendency divides a data set into two equal halves?

Mean
Median
Mode
nan

The "Median" is the measure of central tendency that divides a data set into two equal halves. It is the middle score for a set of ordered data such that 50% of the scores are above it, and 50% are below it.

Discuss it

__________ missing data occurs when the probability of an observation being missing depends on both observed and unobserved data.

All missing data
MAR
MCAR
NMAR

NMAR (Not Missing at Random) missing data occurs when the missingness depends on both observed and unobserved data.

Discuss it

How does the assumption of MAR differ from MCAR in terms of data missingness?

MAR assumes the missingness is only related to the observed data
MAR assumes the missingness is related to the unobserved data
MAR assumes the missingness is unrelated to any variable
There's no difference between MAR and MCAR

In MCAR, the missingness is completely random and doesn't depend on any variable. In MAR, the missingness is not random but is related only to the observed data, not the unobserved (missing) data.

Discuss it

What is the effect of 'binning' on the overall variance of the dataset?

It can either increase or decrease the variance
It decreases the variance
It does not affect the variance
It increases the variance

Binning reduces the variance of a dataset by replacing outlier values with summary statistics like the bin mean or median, hence, reducing the spread of data.

Discuss it

Describe the impact of skewness and kurtosis on parametric testing.

They can improve the accuracy of parametric testing.
They can invalidate the results of parametric testing.
They can reduce the variance in parametric testing.
They do not impact parametric testing.

Skewness and kurtosis can invalidate the results of parametric testing. Many parametric tests assume that the data follows a normal distribution. If the data is highly skewed or has high kurtosis, these assumptions are violated, and the test results may not be valid.

Discuss it

If a distribution is leptokurtic, what does it signify about the data?

The data has a high variance.
The data is heavily tailed with potential outliers.
The data is less outlier-prone.
The data is normally distributed.

Leptokurtic distribution signifies that the data has heavy tails and a sharp peak, meaning there are substantial outliers (or extreme values). This kind of distribution often indicates that the data may have more frequent large jumps away from the mean.

Discuss it

A potential drawback of the Z-score method for outlier detection is that it assumes the data is _______ distributed.

exponentially
logistically
normally
uniformly

The Z-score method assumes that the data is normally distributed, which may not be the case with all datasets, and is a drawback.

Discuss it

Can the IQR method be applied to multimodal data sets for outlier detection? Explain.

No, it can only be applied to normally distributed data
No, it only works with unimodal distributions
Yes, but it may not be effective
Yes, it works well with any distribution

The IQR method can be applied to multimodal datasets for outlier detection, but it may not be effective as it's based on percentiles which can be influenced by multiple modes.

Discuss it

In _____ scaling, we scale the data between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Min-Max
Robust
Standard
Z-score

In Robust scaling, we scale the data between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). This approach minimizes the impact of outliers.

Discuss it

What is the primary goal of Exploratory Data Analysis (EDA)?

To confirm a pre-existing hypothesis
To create an aesthetic representation of the data
To make precise predictions about future events
To understand the underlying structure of the data

The primary goal of EDA is to understand the underlying structure of the data, including distribution, variability, and relationships among variables. EDA allows analysts to make informed decisions about further data processing steps and analysis.

Discuss it