Using the ________ method for handling outliers, extreme values are grouped together and treated as a single entity.

  • Binning
  • Imputation
  • Removal
  • Transformation
The binning method involves grouping extreme values (outliers) together and treating them as a single entity by replacing them with a summary statistic like mean, median, or mode.

How does the number of imputations affect the accuracy of multiple imputation?

  • More imputations, less accuracy
  • More imputations, more accuracy
  • Number of imputations doesn't affect accuracy
  • Only one imputation is needed for full accuracy
The number of imputations directly affects the accuracy of multiple imputation. More imputations result in more accurate estimates, up to a point. Although the exact number depends on the proportion and nature of the missing data, often 20 to 100 imputations are recommended in the literature.

In data analysis, EDA stands for _______.

  • Empirical Data Assessment
  • Exploratory Data Analysis
  • Exponential Data Analysis
  • Expressive Data Assimilation
In data analysis, EDA stands for Exploratory Data Analysis. It is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

Can multiple imputation be applied when data are missing completely at random (MCAR)?

  • No
  • Only if data is numerical
  • Only in rare cases
  • Yes
Yes, multiple imputation can be applied when data are missing completely at random (MCAR). In fact, it is a flexible method that can be applied in various missing data situations including MCAR, MAR (missing at random), and even NMAR (not missing at random).

You're in the 'explore' phase of the EDA process and you notice a potential error back in the 'wrangle' phase. How should you proceed?

  • Conclude the analysis with the current data.
  • Go back to the wrangling phase to correct the error.
  • Ignore the error and continue with the exploration.
  • Inform the stakeholders about the error.
If you notice a potential error in the 'wrangle' phase while you are in the 'explore' phase, you should go back to the 'wrangle' phase to correct the error. Ensuring the accuracy and quality of the data during the 'wrangle' phase is crucial for the validity of the insights drawn in subsequent phases.

What is the impact on training time if missing data is incorrectly handled in a large dataset?

  • Decreases dramatically.
  • Depends on the specific dataset.
  • Increases dramatically.
  • Remains largely the same.
If missing data is not handled correctly, particularly in a large dataset, the training time can increase significantly. This is because the model might struggle to learn from the distorted data, requiring more time to try to fit the data.

The _______ method of feature selection involves removing features one by one until the removal of further features decreases model accuracy.

  • Backward elimination
  • Forward selection
  • Recursive feature elimination
  • Stepwise selection
The backward elimination method of feature selection involves removing features one by one until the removal of further features decreases model accuracy. This process starts with a model trained on all features and iteratively removes the least important feature until the overall model performance declines.

High degrees of Multicollinearity can inflate the _________ of the estimated regression coefficients.

  • Bias
  • Distribution
  • Efficiency
  • Variance
High degrees of multicollinearity can inflate the variance of the estimated regression coefficients. This means that the coefficients become highly sensitive to minor changes in the model, which can make them unreliable and difficult to interpret.

In the context of data visualization, what is a pairplot primarily used for?

  • Comparing multiple variables at once
  • Showing the correlation between two variables
  • Visualizing the distribution of a single variable
  • Visualizing the relationship between two variables
Pairplots are primarily used for comparing multiple variables at once. It creates a grid of scatter plots for each pair of variables, which helps in understanding the relationships between all variables.

Which category of missing data implies that the probability of missingness is related to the observed data?

  • MAR
  • MCAR
  • NMAR
  • nan
MAR, which stands for Missing At Random, implies that the probability of missingness is related to the observed data.