How does the application of Predictive Modeling differ from EDA and CDA in data-driven decision making?

  • Predictive Modeling does not play a role in data-driven decision making.
  • Predictive Modeling is used after EDA and CDA to make future predictions based on the data.
  • Predictive Modeling is used before EDA and CDA to anticipate the outcomes.
  • Predictive Modeling, EDA, and CDA all serve the same purpose.
Predictive Modeling, which is often performed after EDA and CDA, is used to make future predictions based on the data. These predictions can inform decision-making processes, particularly in data-driven organizations.

Can the steps of the EDA process be re-ordered or are they strictly sequential?

  • Some steps can be reordered, but not all.
  • The order of steps depends on the data set size.
  • They are strictly sequential and cannot be reordered.
  • They can be reordered based on the analysis needs.
The EDA process is generally sequential, starting from questioning and ending in communication. However, depending on the nature and needs of the analysis, some steps might be revisited. For instance, new questions might emerge during the explore phase, necessitating going back to the questioning phase. Or, additional data wrangling might be needed after exploring the data.

In what scenarios would it be more appropriate to use Kendall's Tau over Spearman's correlation coefficient?

  • Datasets with many tied ranks
  • Datasets with normally distributed data
  • Datasets without outliers
  • Large datasets with ordinal data
It might be more appropriate to use Kendall's Tau over Spearman's correlation coefficient in scenarios with datasets with many tied ranks. Kendall's Tau is better at handling ties than Spearman's correlation coefficient. It's often used in scenarios where the data have many tied ranks.

How does the curse of dimensionality relate to feature selection?

  • It can cause overfitting
  • It can make visualizing data difficult
  • It increases computational complexity
  • It reduces the effectiveness of distance-based methods
The curse of dimensionality refers to the various problems that arise when dealing with high-dimensional data. In the context of feature selection, high dimensionality can reduce the effectiveness of distance-based methods, as distances in high-dimensional space become less meaningful.

When the correlation coefficient is close to 1, it implies a strong ________ relationship between the two variables.

  • Negative
  • Neutral
  • Positive
  • Zero
When the correlation coefficient is close to 1, it implies a strong positive relationship between the two variables. This means as one variable increases, the other also increases.

_____ plots can give a high-level view of a single continuous variable but may hide details about the distribution.

  • Bar
  • Box
  • Histogram
  • Scatter
Histograms can provide a high-level view of a single continuous variable by showing the frequency of data points in different bins. However, due to the binning process, some details about the distribution might be hidden.

You've identified several outliers using the modified Z-score method in your dataset. What could be the possible reasons for their existence?

  • All of these
  • The data may have been corrupted
  • The dataset may contain measurement errors
  • The dataset may have a complex, multi-modal distribution
All these reasons could lead to the existence of outliers in a dataset.

A high ________ suggests that data points are generally far from the mean, indicating a wide spread in the data set.

  • Mean
  • Median
  • Standard Deviation
  • Variance
A "High Standard Deviation" suggests that data points are generally far from the mean, indicating a wide spread in the dataset. It measures the absolute variability of a distribution; the higher the spread, the higher the standard deviation.

When the distribution is skewed to the right, it is referred to as _________ skewness.

  • Any of these
  • Negative
  • Positive
  • Zero
Positive skewness refers to a distribution where the right tail is longer or fatter than the left tail. In such distributions, the majority of the values (including the median and the mode) tend to be less than the mean.

The final step of the EDA process, '______,' is about presenting your conclusions in an understandable way to your audience.

  • communicating
  • concluding
  • questioning
  • wrangling
The final step of the EDA process, 'communicating,' is about presenting your conclusions in an understandable way to your audience. It is crucial to ensure that the insights and conclusions drawn from the data are communicated effectively and can be understood by the audience.