When the distribution is skewed to the right, it is referred to as _________ skewness.

Any of these
Negative
Positive
Zero

Positive skewness refers to a distribution where the right tail is longer or fatter than the left tail. In such distributions, the majority of the values (including the median and the mode) tend to be less than the mean.

Discuss it

A high ________ suggests that data points are generally far from the mean, indicating a wide spread in the data set.

Mean
Median
Standard Deviation
Variance

A "High Standard Deviation" suggests that data points are generally far from the mean, indicating a wide spread in the dataset. It measures the absolute variability of a distribution; the higher the spread, the higher the standard deviation.

Discuss it

You've identified several outliers using the modified Z-score method in your dataset. What could be the possible reasons for their existence?

All of these
The data may have been corrupted
The dataset may contain measurement errors
The dataset may have a complex, multi-modal distribution

All these reasons could lead to the existence of outliers in a dataset.

Discuss it

_____ plots can give a high-level view of a single continuous variable but may hide details about the distribution.

Bar
Box
Histogram
Scatter

Histograms can provide a high-level view of a single continuous variable by showing the frequency of data points in different bins. However, due to the binning process, some details about the distribution might be hidden.

Discuss it

When the correlation coefficient is close to 1, it implies a strong ________ relationship between the two variables.

Negative
Neutral
Positive
Zero

When the correlation coefficient is close to 1, it implies a strong positive relationship between the two variables. This means as one variable increases, the other also increases.

Discuss it

How does the curse of dimensionality relate to feature selection?

It can cause overfitting
It can make visualizing data difficult
It increases computational complexity
It reduces the effectiveness of distance-based methods

The curse of dimensionality refers to the various problems that arise when dealing with high-dimensional data. In the context of feature selection, high dimensionality can reduce the effectiveness of distance-based methods, as distances in high-dimensional space become less meaningful.

Discuss it

In what scenarios would it be more appropriate to use Kendall's Tau over Spearman's correlation coefficient?

Datasets with many tied ranks
Datasets with normally distributed data
Datasets without outliers
Large datasets with ordinal data

It might be more appropriate to use Kendall's Tau over Spearman's correlation coefficient in scenarios with datasets with many tied ranks. Kendall's Tau is better at handling ties than Spearman's correlation coefficient. It's often used in scenarios where the data have many tied ranks.

Discuss it

Can the steps of the EDA process be re-ordered or are they strictly sequential?

Some steps can be reordered, but not all.
The order of steps depends on the data set size.
They are strictly sequential and cannot be reordered.
They can be reordered based on the analysis needs.

The EDA process is generally sequential, starting from questioning and ending in communication. However, depending on the nature and needs of the analysis, some steps might be revisited. For instance, new questions might emerge during the explore phase, necessitating going back to the questioning phase. Or, additional data wrangling might be needed after exploring the data.

Discuss it

In the context of EDA, you find that certain features in your dataset are highly correlated. How would you interpret this finding and how might it affect your analysis?

The presence of multicollinearity may require you to consider it in your model selection or feature engineering steps
You should combine the correlated features into one
You should remove all correlated features
You should use only correlated features in your analysis

High correlation between features indicates multicollinearity. This can be problematic in certain types of models (like linear regression) as it can destabilize the model and make the effects of predictor variables hard to separate. Depending on the severity of multicollinearity, you may need to consider it during model selection or feature engineering steps, such as removing highly correlated variables, combining them, or using regularization techniques.

Discuss it

You are analyzing customer purchasing behavior and the data exhibits high skewness. What could be the potential challenges and how can you address them?

Data normality assumptions may be violated, address this by transformation techniques.
No challenges would be encountered.
Skewness would make the data easier to analyze.
The mean would become more reliable, no action is needed.

High skewness may cause a violation of data normality assumptions often required for many statistical tests and machine learning models. One common way to address this is through data transformation techniques like log, square root, or inverse transformations to make the distribution more symmetrical.

Discuss it

In a study on job satisfaction, employees with lower satisfaction scores are less likely to complete surveys. How would you categorize this missing data?

MAR
MCAR
NMAR
Not missing data

This would be NMAR (Not Missing at Random) because the missingness depends on the unobserved data itself (i.e., the job satisfaction score). If employees with lower job satisfaction are less likely to complete the survey, the missingness is related to the missing satisfaction scores.

Discuss it

You have a dataset with many tied ranks. Which correlation coefficient would you prefer to use, and why?

Covariance
Kendall's Tau
Pearson's correlation coefficient
Spearman's correlation coefficient

For a dataset with many tied ranks, Kendall's Tau is a better choice. Kendall's Tau handles tied ranks better than the Spearman's correlation coefficient.

Discuss it