In a Normal Distribution, approximately 95% of the data falls within _____ standard deviations of the mean.

  • 1
  • 2
  • 3
  • 4
In a Normal Distribution, approximately 95% of the data falls within 2 standard deviations of the mean.

How can extreme outliers impact the interpretation of the skewness of a dataset?

  • Can either increase or decrease the skewness
  • Decrease the skewness
  • Does not affect the skewness
  • Increase the skewness
The skewness of a distribution is a measure of the extent and direction of asymmetry. Extreme outliers can either increase or decrease skewness depending on which tail they lie in. If the outliers are greater than the mean, skewness will be increased. If less, skewness will be decreased.

How do outliers affect the performance of machine learning models?

  • Decrease model accuracy
  • Increase model accuracy
  • Increase model precision
  • Increase model recall
Outliers can significantly affect the performance of machine learning models, often leading to decreased accuracy. This is because they can cause the model to learn based on these anomalies rather than the underlying data pattern.

How do outliers affect the standard deviation of a dataset?

  • Can either increase or decrease the standard deviation
  • Decrease the standard deviation
  • Does not affect the standard deviation
  • Increase the standard deviation
Outliers can significantly increase the standard deviation, as the standard deviation is sensitive to extreme values. This is because the standard deviation squares the differences from the mean, making it more reactive to values far from the mean.

_____ are used to indicate different values in a heatmap.

  • Colors
  • Lines
  • Shapes
  • Sizes
Colors are used to indicate different values in a heatmap. The color scale represents the magnitude of the variable, with different color gradients representing different value ranges.

You are given a dataset with a high number of features. The computational resources are limited. What feature selection method might you consider?

  • Backward elimination
  • Filter methods
  • Forward selection
  • Wrapper methods
Given limited computational resources, filter methods might be a good choice. These methods are less computationally expensive than wrapper methods as they do not involve the use of any specific machine learning algorithm. Instead, they rank features based on statistical measures and remove irrelevant features based on a certain threshold or number of top features to keep.

What type of data can only take on discrete values?

  • Categorical data
  • Continuous data
  • Discrete data
  • Ordinal data
Discrete data can only take on distinct, separate values. It can't be made more precise by further measurement or counting. For example, the number of students in a class would be discrete data.

You're working with a dataset where two features, 'age' and 'years of experience', have a high correlation. Which problem does this situation exemplify?

  • Data leakage
  • Multicollinearity
  • Overfitting
  • Underfitting
This situation exemplifies multicollinearity, a condition where two or more predictors in a multiple regression model are highly correlated. This high correlation means that 'age' and 'years of experience' provide similar information in predicting the dependent variable.

How is the shape of a Normal Distribution usually described?

  • Bell-shaped
  • Skewed to the left
  • Skewed to the right
  • Uniformly flat
A Normal Distribution is described as bell-shaped. It is symmetric around the mean, and most of the data falls close to the mean with fewer values further away.

Suppose your machine learning model shows a significant shift in performance when transitioning from the training set to the test set. How could mishandling missing data contribute to this issue?

  • It may have caused an imbalance in the data distribution between the sets.
  • It may have caused overfitting.
  • It may have led to the model learning irrelevant patterns.
  • It may have led to underfitting.
If the handling of missing data is not consistent between the training and test sets, it could lead to an imbalance in data distribution between the two sets, causing the model's performance to shift.