A high ________ suggests that data points are generally far from the mean, indicating a wide spread in the data set.

  • Mean
  • Median
  • Standard Deviation
  • Variance
A "High Standard Deviation" suggests that data points are generally far from the mean, indicating a wide spread in the dataset. It measures the absolute variability of a distribution; the higher the spread, the higher the standard deviation.

When the distribution is skewed to the right, it is referred to as _________ skewness.

  • Any of these
  • Negative
  • Positive
  • Zero
Positive skewness refers to a distribution where the right tail is longer or fatter than the left tail. In such distributions, the majority of the values (including the median and the mode) tend to be less than the mean.

The final step of the EDA process, '______,' is about presenting your conclusions in an understandable way to your audience.

  • communicating
  • concluding
  • questioning
  • wrangling
The final step of the EDA process, 'communicating,' is about presenting your conclusions in an understandable way to your audience. It is crucial to ensure that the insights and conclusions drawn from the data are communicated effectively and can be understood by the audience.

A machine learning model is overfitting on a training dataset. How could feature selection be used to address this issue?

  • By increasing the model complexity
  • By increasing the number of features
  • By reducing the number of features
  • By transforming the features
Feature selection can be used to address overfitting by reducing the number of features. Overfitting occurs when a model learns the noise in the training data, leading to poor performance on unseen data. By reducing the number of features, the complexity of the model can be reduced, which in turn can help to mitigate overfitting.

In what circumstances can the IQR method lead to incorrect detection of outliers?

  • When data has a high standard deviation
  • When data is heavily skewed or bimodal
  • When data is normally distributed
  • When data is uniformly distributed
The IQR method might lead to incorrect detection of outliers in heavily skewed or bimodal distributions because it's based on percentiles which can be influenced by such irregularities.

A potential drawback of using regression imputation is that it can underestimate the ___________.

  • Mean
  • Median
  • Mode
  • Variance
One of the potential drawbacks of using regression imputation is that it can underestimate the variance. This is because it uses the relationship with other variables to estimate the missing values, which usually leads to less variability.

To ensure that the audience doesn't misinterpret a data visualization, it's important to avoid __________.

  • Bias and misleading scales
  • Using interactive elements
  • Using more than one type of graph
  • Using too many colors
To avoid misinterpretation of a data visualization, it's essential to avoid bias and misleading scales. These could skew the representation of the data and thus lead to inaccurate conclusions.

How does feature selection contribute to model accuracy?

  • All of the above
  • By improving interpretability of the model
  • By reducing overfitting
  • By reducing the complexity of the model
Feature selection contributes to model accuracy primarily by reducing overfitting. Overfitting occurs when a model learns the training data too well, including its noise, and performs poorly on unseen data.

You have a dataset with many tied ranks. Which correlation coefficient would you prefer to use, and why?

  • Covariance
  • Kendall's Tau
  • Pearson's correlation coefficient
  • Spearman's correlation coefficient
For a dataset with many tied ranks, Kendall's Tau is a better choice. Kendall's Tau handles tied ranks better than the Spearman's correlation coefficient.

In a study on job satisfaction, employees with lower satisfaction scores are less likely to complete surveys. How would you categorize this missing data?

  • MAR
  • MCAR
  • NMAR
  • Not missing data
This would be NMAR (Not Missing at Random) because the missingness depends on the unobserved data itself (i.e., the job satisfaction score). If employees with lower job satisfaction are less likely to complete the survey, the missingness is related to the missing satisfaction scores.