The process of replacing each missing data point with a set of plausible values creating multiple complete data sets is known as ____________.
- Mean Imputation
- Mode Imputation
- Multiple Imputation
- Regression Imputation
This process is called multiple imputation. It generates several different plausible imputed datasets and the results from these are combined to produce the final analysis.
What is the relationship between the Z-score of a data point and its distance from the mean?
- The Z-score is independent of the distance from the mean
- The higher the Z-score, the closer the data point is to the mean
- The higher the Z-score, the further the data point is from the mean
- The lower the Z-score, the further the data point is from the mean
The higher the Z-score, the further the data point is from the mean. A Z-score of 0 indicates that the data point is identical to the mean score.
Using the ________ method for handling outliers, extreme values are grouped together and treated as a single entity.
- Binning
- Imputation
- Removal
- Transformation
The binning method involves grouping extreme values (outliers) together and treating them as a single entity by replacing them with a summary statistic like mean, median, or mode.
How does the number of imputations affect the accuracy of multiple imputation?
- More imputations, less accuracy
- More imputations, more accuracy
- Number of imputations doesn't affect accuracy
- Only one imputation is needed for full accuracy
The number of imputations directly affects the accuracy of multiple imputation. More imputations result in more accurate estimates, up to a point. Although the exact number depends on the proportion and nature of the missing data, often 20 to 100 imputations are recommended in the literature.
In data analysis, EDA stands for _______.
- Empirical Data Assessment
- Exploratory Data Analysis
- Exponential Data Analysis
- Expressive Data Assimilation
In data analysis, EDA stands for Exploratory Data Analysis. It is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
Can multiple imputation be applied when data are missing completely at random (MCAR)?
- No
- Only if data is numerical
- Only in rare cases
- Yes
Yes, multiple imputation can be applied when data are missing completely at random (MCAR). In fact, it is a flexible method that can be applied in various missing data situations including MCAR, MAR (missing at random), and even NMAR (not missing at random).
You're in the 'explore' phase of the EDA process and you notice a potential error back in the 'wrangle' phase. How should you proceed?
- Conclude the analysis with the current data.
- Go back to the wrangling phase to correct the error.
- Ignore the error and continue with the exploration.
- Inform the stakeholders about the error.
If you notice a potential error in the 'wrangle' phase while you are in the 'explore' phase, you should go back to the 'wrangle' phase to correct the error. Ensuring the accuracy and quality of the data during the 'wrangle' phase is crucial for the validity of the insights drawn in subsequent phases.
How can outliers influence the mean of a dataset?
- Can either increase or decrease the mean
- Decrease the mean
- Does not affect the mean
- Increase the mean
Outliers can have a big impact on the mean. Depending on whether the outlier is much higher or lower than the other values, it can significantly increase or decrease the mean, thereby skewing the data.
In a positively skewed distribution, which is greater: mean or median?
- Both are equal
- Mean
- Median
- nan
In a positively skewed distribution, the "Mean" is generally greater than the median. Positive skewness means that the distribution has a long right tail, so extreme values in the positive direction can pull the mean upwards more than the median.
In a leptokurtic distribution, the kurtosis value is ___________ than 0.
- Any of these
- Equal
- Greater
- Less
A leptokurtic distribution has kurtosis greater than 0, indicating a sharper peak and fatter tails compared to a normal distribution.