In data analysis, EDA stands for _______.

  • Empirical Data Assessment
  • Exploratory Data Analysis
  • Exponential Data Analysis
  • Expressive Data Assimilation
In data analysis, EDA stands for Exploratory Data Analysis. It is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

How does the number of imputations affect the accuracy of multiple imputation?

  • More imputations, less accuracy
  • More imputations, more accuracy
  • Number of imputations doesn't affect accuracy
  • Only one imputation is needed for full accuracy
The number of imputations directly affects the accuracy of multiple imputation. More imputations result in more accurate estimates, up to a point. Although the exact number depends on the proportion and nature of the missing data, often 20 to 100 imputations are recommended in the literature.

Using the ________ method for handling outliers, extreme values are grouped together and treated as a single entity.

  • Binning
  • Imputation
  • Removal
  • Transformation
The binning method involves grouping extreme values (outliers) together and treating them as a single entity by replacing them with a summary statistic like mean, median, or mode.

What is the relationship between the Z-score of a data point and its distance from the mean?

  • The Z-score is independent of the distance from the mean
  • The higher the Z-score, the closer the data point is to the mean
  • The higher the Z-score, the further the data point is from the mean
  • The lower the Z-score, the further the data point is from the mean
The higher the Z-score, the further the data point is from the mean. A Z-score of 0 indicates that the data point is identical to the mean score.

The process of replacing each missing data point with a set of plausible values creating multiple complete data sets is known as ____________.

  • Mean Imputation
  • Mode Imputation
  • Multiple Imputation
  • Regression Imputation
This process is called multiple imputation. It generates several different plausible imputed datasets and the results from these are combined to produce the final analysis.

You are dealing with a dataset where outliers significantly affect the mean of the distribution but not the median. What approach would you suggest to handle these outliers?

  • Binning
  • Removal
  • Transformation
  • nan
In this case, a transformation such as a log or square root transformation might be suitable. These transformations pull in high values, thereby reducing their impact on the mean.

High degrees of Multicollinearity can inflate the _________ of the estimated regression coefficients.

  • Bias
  • Distribution
  • Efficiency
  • Variance
High degrees of multicollinearity can inflate the variance of the estimated regression coefficients. This means that the coefficients become highly sensitive to minor changes in the model, which can make them unreliable and difficult to interpret.

The _______ method of feature selection involves removing features one by one until the removal of further features decreases model accuracy.

  • Backward elimination
  • Forward selection
  • Recursive feature elimination
  • Stepwise selection
The backward elimination method of feature selection involves removing features one by one until the removal of further features decreases model accuracy. This process starts with a model trained on all features and iteratively removes the least important feature until the overall model performance declines.

What is the impact on training time if missing data is incorrectly handled in a large dataset?

  • Decreases dramatically.
  • Depends on the specific dataset.
  • Increases dramatically.
  • Remains largely the same.
If missing data is not handled correctly, particularly in a large dataset, the training time can increase significantly. This is because the model might struggle to learn from the distorted data, requiring more time to try to fit the data.

In a model-based imputation, the choice of the model has a direct impact on the ____________ of the imputation process.

  • Accuracy
  • All of the above
  • Complexity
  • Time
The choice of the model in a model-based imputation method directly affects the accuracy of the imputation process. If the chosen model does not accurately reflect the true data generation process, the imputed values may be biased, leading to incorrect conclusions.

In a leptokurtic distribution, the kurtosis value is ___________ than 0.

  • Any of these
  • Equal
  • Greater
  • Less
A leptokurtic distribution has kurtosis greater than 0, indicating a sharper peak and fatter tails compared to a normal distribution.

In a positively skewed distribution, which is greater: mean or median?

  • Both are equal
  • Mean
  • Median
  • nan
In a positively skewed distribution, the "Mean" is generally greater than the median. Positive skewness means that the distribution has a long right tail, so extreme values in the positive direction can pull the mean upwards more than the median.