What type of data can only take on discrete values?

  • Categorical data
  • Continuous data
  • Discrete data
  • Ordinal data
Discrete data can only take on distinct, separate values. It can't be made more precise by further measurement or counting. For example, the number of students in a class would be discrete data.

You're working with a dataset where two features, 'age' and 'years of experience', have a high correlation. Which problem does this situation exemplify?

  • Data leakage
  • Multicollinearity
  • Overfitting
  • Underfitting
This situation exemplifies multicollinearity, a condition where two or more predictors in a multiple regression model are highly correlated. This high correlation means that 'age' and 'years of experience' provide similar information in predicting the dependent variable.

In a Normal Distribution, approximately 95% of the data falls within _____ standard deviations of the mean.

  • 1
  • 2
  • 3
  • 4
In a Normal Distribution, approximately 95% of the data falls within 2 standard deviations of the mean.

How can extreme outliers impact the interpretation of the skewness of a dataset?

  • Can either increase or decrease the skewness
  • Decrease the skewness
  • Does not affect the skewness
  • Increase the skewness
The skewness of a distribution is a measure of the extent and direction of asymmetry. Extreme outliers can either increase or decrease skewness depending on which tail they lie in. If the outliers are greater than the mean, skewness will be increased. If less, skewness will be decreased.

How do outliers affect the performance of machine learning models?

  • Decrease model accuracy
  • Increase model accuracy
  • Increase model precision
  • Increase model recall
Outliers can significantly affect the performance of machine learning models, often leading to decreased accuracy. This is because they can cause the model to learn based on these anomalies rather than the underlying data pattern.

How do outliers affect the standard deviation of a dataset?

  • Can either increase or decrease the standard deviation
  • Decrease the standard deviation
  • Does not affect the standard deviation
  • Increase the standard deviation
Outliers can significantly increase the standard deviation, as the standard deviation is sensitive to extreme values. This is because the standard deviation squares the differences from the mean, making it more reactive to values far from the mean.

_____ are used to indicate different values in a heatmap.

  • Colors
  • Lines
  • Shapes
  • Sizes
Colors are used to indicate different values in a heatmap. The color scale represents the magnitude of the variable, with different color gradients representing different value ranges.

You are given a dataset with a high number of features. The computational resources are limited. What feature selection method might you consider?

  • Backward elimination
  • Filter methods
  • Forward selection
  • Wrapper methods
Given limited computational resources, filter methods might be a good choice. These methods are less computationally expensive than wrapper methods as they do not involve the use of any specific machine learning algorithm. Instead, they rank features based on statistical measures and remove irrelevant features based on a certain threshold or number of top features to keep.

What are the potential risks associated with incorrectly assuming that data are MCAR when they are actually MAR?

  • Bias in parameter estimates
  • Both underestimation of standard errors and bias in parameter estimates
  • No potential risks
  • Underestimation of standard errors
If data are incorrectly assumed to be MCAR when they are actually MAR, it can lead to both underestimation of standard errors and bias in parameter estimates, leading to inaccurate analyses and conclusions.

You notice that the data from some weather sensors is missing because the sensors malfunctioned when the temperature dropped below a certain level. What type of missing data does this represent?

  • MAR
  • MCAR
  • NMAR
  • Not missing data
This would be MAR (Missing at Random) because the missingness is related to an observed data (the temperature). The missing data is not random, but it doesn't depend on the unobserved data itself.