In data analysis, EDA stands for _______.

Empirical Data Assessment
Exploratory Data Analysis
Exponential Data Analysis
Expressive Data Assimilation

In data analysis, EDA stands for Exploratory Data Analysis. It is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

Discuss it

How does the number of imputations affect the accuracy of multiple imputation?

More imputations, less accuracy
More imputations, more accuracy
Number of imputations doesn't affect accuracy
Only one imputation is needed for full accuracy

The number of imputations directly affects the accuracy of multiple imputation. More imputations result in more accurate estimates, up to a point. Although the exact number depends on the proportion and nature of the missing data, often 20 to 100 imputations are recommended in the literature.

Discuss it

Using the ________ method for handling outliers, extreme values are grouped together and treated as a single entity.

Binning
Imputation
Removal
Transformation

The binning method involves grouping extreme values (outliers) together and treating them as a single entity by replacing them with a summary statistic like mean, median, or mode.

Discuss it

What is the relationship between the Z-score of a data point and its distance from the mean?

The Z-score is independent of the distance from the mean
The higher the Z-score, the closer the data point is to the mean
The higher the Z-score, the further the data point is from the mean
The lower the Z-score, the further the data point is from the mean

The higher the Z-score, the further the data point is from the mean. A Z-score of 0 indicates that the data point is identical to the mean score.

Discuss it

The process of replacing each missing data point with a set of plausible values creating multiple complete data sets is known as ____________.

Mean Imputation
Mode Imputation
Multiple Imputation
Regression Imputation

This process is called multiple imputation. It generates several different plausible imputed datasets and the results from these are combined to produce the final analysis.

Discuss it

You are dealing with a dataset where outliers significantly affect the mean of the distribution but not the median. What approach would you suggest to handle these outliers?

Binning
Removal
Transformation
nan

In this case, a transformation such as a log or square root transformation might be suitable. These transformations pull in high values, thereby reducing their impact on the mean.

Discuss it

High degrees of Multicollinearity can inflate the _________ of the estimated regression coefficients.

Bias
Distribution
Efficiency
Variance

High degrees of multicollinearity can inflate the variance of the estimated regression coefficients. This means that the coefficients become highly sensitive to minor changes in the model, which can make them unreliable and difficult to interpret.

Discuss it

The _______ method of feature selection involves removing features one by one until the removal of further features decreases model accuracy.

Backward elimination
Forward selection
Recursive feature elimination
Stepwise selection

The backward elimination method of feature selection involves removing features one by one until the removal of further features decreases model accuracy. This process starts with a model trained on all features and iteratively removes the least important feature until the overall model performance declines.

Discuss it

What is the impact on training time if missing data is incorrectly handled in a large dataset?

Decreases dramatically.
Depends on the specific dataset.
Increases dramatically.
Remains largely the same.

If missing data is not handled correctly, particularly in a large dataset, the training time can increase significantly. This is because the model might struggle to learn from the distorted data, requiring more time to try to fit the data.

Discuss it

In a model-based imputation, the choice of the model has a direct impact on the ____________ of the imputation process.

Accuracy
All of the above
Complexity
Time

The choice of the model in a model-based imputation method directly affects the accuracy of the imputation process. If the chosen model does not accurately reflect the true data generation process, the imputed values may be biased, leading to incorrect conclusions.

Discuss it

In a leptokurtic distribution, the kurtosis value is ___________ than 0.

Any of these
Equal
Greater
Less

A leptokurtic distribution has kurtosis greater than 0, indicating a sharper peak and fatter tails compared to a normal distribution.

Discuss it

In a positively skewed distribution, which is greater: mean or median?

Both are equal
Mean
Median
nan

In a positively skewed distribution, the "Mean" is generally greater than the median. Positive skewness means that the distribution has a long right tail, so extreme values in the positive direction can pull the mean upwards more than the median.

Discuss it