How does the IQR method categorize a data point as an outlier?

By comparing it to the mean
By comparing it to the median
By comparing it to the standard deviation
By seeing if it falls below Q1-1.5IQR or above Q3+1.5IQR

The IQR method categorizes a data point as an outlier by seeing if it falls below Q1-1.5IQR or above Q3+1.5IQR.

You're working with a data set that does not follow a normal distribution. Which method, Z-score or IQR, should be used for detecting outliers?

Both are suitable
IQR
Neither is suitable
Z-score

In this case, the IQR method is a better choice as it does not assume any specific data distribution unlike the Z-score method, which assumes data is normally distributed.

Discuss it

You are visualizing a heatmap and notice a row with colors drastically different than the rest. What might this indicate about the corresponding variable?

The variable has a unique distribution
The variable has many missing values
The variable is an outlier
The variable is unrelated to the others

If a row in a heatmap has colors that are drastically different than the rest, it might indicate that the corresponding variable is unrelated or has very different relationships with the other variables in the dataset.

Discuss it

How does standard deviation differ in a sample versus a population?

The denominator in the calculation of the sample standard deviation is (n-1)
The standard deviation of a sample is always larger
The standard deviation of a sample is always smaller
They are calculated in the same way

The "Standard Deviation" in a sample differs from that in a population in the way it is calculated. For a sample, the denominator is (n-1) instead of n, which is Bessel's correction to account for sample bias.

Discuss it

What does a correlation coefficient close to 0 indicate about the relationship between two variables?

A perfect negative linear relationship
A perfect positive linear relationship
A very strong linear relationship
No linear relationship

A correlation coefficient close to 0 indicates that there is no linear relationship between the two variables. This means that changes in one variable are not consistently associated with changes in the other variable. It does not necessarily mean that there is no relationship at all, as there may be a non-linear relationship.

Discuss it

What step comes after 'wrangling' in the EDA process?

Communicating
Concluding
Exploring
Questioning

Once the data has been 'wrangled' i.e., cleaned and transformed, the next step in the EDA process is 'exploring'. This stage involves examining the data through statistical analysis and visual methods.

Discuss it

Which type of analysis is most commonly used for hypothesis testing?

CDA
Data Visualization
EDA
Predictive Modeling

CDA (Confirmatory Data Analysis) is most commonly used for hypothesis testing. While EDA is used to formulate hypotheses, CDA uses statistical techniques to confirm or reject these hypotheses.

Discuss it

How does negative kurtosis affect the tails of a data distribution?

It has no effect on the tails of the distribution.
It makes the distribution perfectly symmetrical.
It makes the tails of the distribution heavier.
It makes the tails of the distribution lighter.

Negative kurtosis, also known as platykurtic kurtosis, makes the tails of the data distribution lighter, indicating fewer extreme outliers. The distribution is flatter or more spread out than a normal distribution.

Discuss it

What type of plot is often used for visualizing the relationship between two continuous variables?

Bar plot
Box plot
Histogram
Scatter plot

Scatter plots are ideal for visualizing the relationship between two continuous variables. Each point in the scatter plot corresponds to the values of two variables.

Discuss it

What is the process of removing an entire row when any single data point within it is missing called?

Listwise Deletion
Mean Imputation
Pairwise Deletion
Regression Imputation

The process of removing an entire row when any single data point within it is missing is called 'Listwise Deletion'. Also known as 'Complete Case Analysis', this technique is straightforward and fast, but it can potentially discard valuable data and introduce bias if the missingness is not completely at random.

Discuss it