In what way does improper handling of missing data affect the generalization capability of a model?

  • Depends on the amount of missing data.
  • Hampers generalization.
  • Improves generalization.
  • No effect on generalization.
Improper handling of missing data can lead to the model learning incorrect or misleading patterns from the data. This can hamper the model's ability to generalize well to unseen data.

In a box plot, outliers are typically represented as ______.

  • boxes
  • dots
  • lines
  • whiskers
In a box plot, outliers are typically represented as dots or points that fall outside the whiskers of the box.

What type of data typically requires more complex statistical methods for analysis?

  • Categorical data
  • Continuous data
  • Discrete data
  • Ordinal data
Continuous data usually requires more complex statistical methods for analysis because it can take on any value within a certain range. This might require techniques like regression, hypothesis testing, and advanced graphical representations.

You have a dataset with missing values and you've chosen to use multiple imputation. However, the results after applying multiple imputation are not as expected. What factors might be causing this?

  • Both too few and too many imputations
  • The model used for imputation is perfect
  • Too few imputations
  • Too many imputations
If too few imputations are used in multiple imputation, the results may not be accurate. This may lead to an underestimation of standard errors and incorrect statistical inference. Increasing the number of imputations generally leads to more accurate results.

In ____________, different models are used to estimate the missing values based on observed data.

  • Mean Imputation
  • Mode-based Imputation
  • Model-based Imputation
  • Multiple Imputation
This process is called model-based imputation. Different statistical models are used to estimate the missing values based on the observed (non-missing) data.

Imagine you are working with a data set that includes survey responses on a 1-5 scale (1=Very Unsatisfied, 5=Very Satisfied). How would you classify this data type?

  • Continuous data
  • Interval data
  • Nominal data
  • Ordinal data
This type of data is ordinal because the ratings exist on an arbitrary scale where the rank order (1-5) is significant, but the precise numerical differences between the scale values are not.

What is the importance of understanding data distributions in Exploratory Data Analysis?

  • All of the above
  • It helps in identifying the right statistical tests to apply
  • It helps in spotting outliers and anomalies
  • It helps in understanding the underlying structure of the data
Understanding data distributions is fundamental in Exploratory Data Analysis. It aids in understanding the structure of data, identifying outliers, formulating hypotheses, and selecting appropriate statistical tests.

What is the importance of the 'explore' step in the EDA process?

  • To analyze and investigate the data
  • To clean and transform data
  • To communicate the results
  • To pose initial questions
The 'explore' step in the EDA process is crucial as it involves the analysis and investigation of the cleaned and transformed data, using statistical techniques and visualization methods. This stage helps uncover patterns, trends, relationships, and anomalies in the data, and aids in forming or refining hypotheses.

What is the main purpose of data normalization in machine learning?

  • To ensure numerical stability and bring features to a comparable scale
  • To increase the accuracy of the model
  • To increase the size of the dataset
  • To reduce the computation time
Data normalization is a technique often applied as part of data preprocessing in machine learning to bring features to a comparable scale. This is done to ensure numerical stability during the computation and to prevent certain features from dominating others due to their scale.

You have applied mean imputation to a dataset where values are missing not at random. What kind of bias might you have unintentionally introduced, and why?

  • Confirmation bias
  • Overfitting bias
  • Selection bias
  • Underfitting bias
If you have applied mean imputation to a dataset where values are missing not at random, you might have unintentionally introduced selection bias. This is because mean imputation could lead to an underestimation of the variability in the data and potentially introduce a systematic bias, as it doesn't consider the reasons behind the missingness.

Under what circumstances might 'removal' of outliers lead to biased results?

  • When outliers are a result of data duplication
  • When outliers are due to data collection errors
  • When outliers are extreme but legitimate data points
  • When outliers do not significantly impact the analysis
Removing outliers can lead to biased results when the outliers are extreme but legitimate data points, as they could represent important aspects of the phenomenon being studied.

You are given a dataset with a significant amount of outliers. Which scaling method would be most suitable and why?

  • , outliers should always be removed
  • Min-Max scaling because it scales all values between 0 and 1
  • Robust scaling because it is not affected by outliers
  • Z-score standardization because it reduces skewness
Robust scaling would be the most suitable as it uses the median and the interquartile range, which are not sensitive to outliers, for scaling the data. Other methods like Min-Max and Z-score are affected by the presence of outliers.