In a Normal Distribution, approximately 95% of the data falls within _____ standard deviations of the mean.

  • 1
  • 2
  • 3
  • 4
In a Normal Distribution, approximately 95% of the data falls within 2 standard deviations of the mean.

How can extreme outliers impact the interpretation of the skewness of a dataset?

  • Can either increase or decrease the skewness
  • Decrease the skewness
  • Does not affect the skewness
  • Increase the skewness
The skewness of a distribution is a measure of the extent and direction of asymmetry. Extreme outliers can either increase or decrease skewness depending on which tail they lie in. If the outliers are greater than the mean, skewness will be increased. If less, skewness will be decreased.

How do outliers affect the performance of machine learning models?

  • Decrease model accuracy
  • Increase model accuracy
  • Increase model precision
  • Increase model recall
Outliers can significantly affect the performance of machine learning models, often leading to decreased accuracy. This is because they can cause the model to learn based on these anomalies rather than the underlying data pattern.

How do outliers affect the standard deviation of a dataset?

  • Can either increase or decrease the standard deviation
  • Decrease the standard deviation
  • Does not affect the standard deviation
  • Increase the standard deviation
Outliers can significantly increase the standard deviation, as the standard deviation is sensitive to extreme values. This is because the standard deviation squares the differences from the mean, making it more reactive to values far from the mean.

_____ are used to indicate different values in a heatmap.

  • Colors
  • Lines
  • Shapes
  • Sizes
Colors are used to indicate different values in a heatmap. The color scale represents the magnitude of the variable, with different color gradients representing different value ranges.

You are given a dataset with a high number of features. The computational resources are limited. What feature selection method might you consider?

  • Backward elimination
  • Filter methods
  • Forward selection
  • Wrapper methods
Given limited computational resources, filter methods might be a good choice. These methods are less computationally expensive than wrapper methods as they do not involve the use of any specific machine learning algorithm. Instead, they rank features based on statistical measures and remove irrelevant features based on a certain threshold or number of top features to keep.

What type of data can only take on discrete values?

  • Categorical data
  • Continuous data
  • Discrete data
  • Ordinal data
Discrete data can only take on distinct, separate values. It can't be made more precise by further measurement or counting. For example, the number of students in a class would be discrete data.

You're working with a dataset where two features, 'age' and 'years of experience', have a high correlation. Which problem does this situation exemplify?

  • Data leakage
  • Multicollinearity
  • Overfitting
  • Underfitting
This situation exemplifies multicollinearity, a condition where two or more predictors in a multiple regression model are highly correlated. This high correlation means that 'age' and 'years of experience' provide similar information in predicting the dependent variable.

What kind of bias might be introduced into a model if missing data is not appropriately addressed?

  • All above.
  • Confirmation bias.
  • Observation bias.
  • Sampling bias.
Inappropriate handling of missing data can lead to sampling bias, where the model is trained on a non-representative subset of the data, hence the model's predictions could be biased.

What are the key components to focus on during the 'communicate' step in EDA?

  • Cleaning and transforming data
  • Ensuring the insights are effectively conveyed to relevant stakeholders
  • Only sharing the raw data
  • Reordering the EDA steps
During the communication phase of the EDA process, the key focus is to ensure that the insights, findings, or conclusions drawn from the analysis are effectively conveyed to the relevant stakeholders. This might involve presenting the insights in a simple and understandable manner, making use of visualizations, and tailoring the communication to the audience's needs and context.