How does the uncertainty level differ in EDA, CDA, and Predictive Modeling?

Uncertainty is equally distributed among all three.
Uncertainty is highest in CDA, lower in Predictive Modeling, and lowest in EDA.
Uncertainty is highest in EDA, lower in CDA, and lowest in Predictive Modeling.
Uncertainty is highest in Predictive Modeling, lower in CDA, and lowest in EDA.

In EDA, where the primary aim is to explore patterns and relationships in the data, the level of uncertainty is highest. This reduces in CDA, which seeks to confirm the hypotheses generated during EDA. The uncertainty level is lowest in Predictive Modeling as it builds on the outcomes of EDA and CDA to make future predictions.

Discuss it

Can the steps of the EDA process be re-ordered or are they strictly sequential?

Some steps can be reordered, but not all.
The order of steps depends on the data set size.
They are strictly sequential and cannot be reordered.
They can be reordered based on the analysis needs.

The EDA process is generally sequential, starting from questioning and ending in communication. However, depending on the nature and needs of the analysis, some steps might be revisited. For instance, new questions might emerge during the explore phase, necessitating going back to the questioning phase. Or, additional data wrangling might be needed after exploring the data.

Discuss it

In what scenarios would it be more appropriate to use Kendall's Tau over Spearman's correlation coefficient?

Datasets with many tied ranks
Datasets with normally distributed data
Datasets without outliers
Large datasets with ordinal data

It might be more appropriate to use Kendall's Tau over Spearman's correlation coefficient in scenarios with datasets with many tied ranks. Kendall's Tau is better at handling ties than Spearman's correlation coefficient. It's often used in scenarios where the data have many tied ranks.

Discuss it

How does the curse of dimensionality relate to feature selection?

It can cause overfitting
It can make visualizing data difficult
It increases computational complexity
It reduces the effectiveness of distance-based methods

The curse of dimensionality refers to the various problems that arise when dealing with high-dimensional data. In the context of feature selection, high dimensionality can reduce the effectiveness of distance-based methods, as distances in high-dimensional space become less meaningful.

Discuss it

You are analyzing customer purchasing behavior and the data exhibits high skewness. What could be the potential challenges and how can you address them?

Data normality assumptions may be violated, address this by transformation techniques.
No challenges would be encountered.
Skewness would make the data easier to analyze.
The mean would become more reliable, no action is needed.

High skewness may cause a violation of data normality assumptions often required for many statistical tests and machine learning models. One common way to address this is through data transformation techniques like log, square root, or inverse transformations to make the distribution more symmetrical.

Discuss it

In the context of EDA, you find that certain features in your dataset are highly correlated. How would you interpret this finding and how might it affect your analysis?

The presence of multicollinearity may require you to consider it in your model selection or feature engineering steps
You should combine the correlated features into one
You should remove all correlated features
You should use only correlated features in your analysis

High correlation between features indicates multicollinearity. This can be problematic in certain types of models (like linear regression) as it can destabilize the model and make the effects of predictor variables hard to separate. Depending on the severity of multicollinearity, you may need to consider it during model selection or feature engineering steps, such as removing highly correlated variables, combining them, or using regularization techniques.

Discuss it

In what circumstances can the IQR method lead to incorrect detection of outliers?

When data has a high standard deviation
When data is heavily skewed or bimodal
When data is normally distributed
When data is uniformly distributed

The IQR method might lead to incorrect detection of outliers in heavily skewed or bimodal distributions because it's based on percentiles which can be influenced by such irregularities.

Discuss it

A potential drawback of using regression imputation is that it can underestimate the ___________.

Mean
Median
Mode
Variance

One of the potential drawbacks of using regression imputation is that it can underestimate the variance. This is because it uses the relationship with other variables to estimate the missing values, which usually leads to less variability.

Discuss it

To ensure that the audience doesn't misinterpret a data visualization, it's important to avoid __________.

Bias and misleading scales
Using interactive elements
Using more than one type of graph
Using too many colors

To avoid misinterpretation of a data visualization, it's essential to avoid bias and misleading scales. These could skew the representation of the data and thus lead to inaccurate conclusions.

Discuss it

How does feature selection contribute to model accuracy?

All of the above
By improving interpretability of the model
By reducing overfitting
By reducing the complexity of the model

Feature selection contributes to model accuracy primarily by reducing overfitting. Overfitting occurs when a model learns the training data too well, including its noise, and performs poorly on unseen data.

Discuss it