What could be the implications of a high degree of skewness for statistical inference?
- A high degree of skewness implies a high degree of kurtosis.
- A high degree of skewness may bias the statistical inference.
- A high degree of skewness may reduce the standard deviation.
- Skewness does not impact statistical inference.
A high degree of skewness can bias the statistical inference because it can affect the mean of the data significantly. Because many statistical techniques assume a normal distribution, skewness can violate assumptions and possibly lead to incorrect conclusions.
Which plot can be considered a combination of a box plot and a rotated kernel density plot?
- Histogram
- Line plot
- Scatter plot
- Violin plot
A Violin plot can be considered a combination of a box plot and a rotated kernel density plot. This allows it to provide a more comprehensive view of the data distribution.
What does the acronym MCAR stand for in the context of missing data?
- Missing Coefficient At Random
- Missing Completely And Regularly
- Missing Completely At Random
- Missing Conditionally At Random
MCAR stands for Missing Completely At Random. This occurs when the probability of missing data on a variable is unrelated to any other measured variable and is also unrelated to the variable itself.
Under what conditions might a model-based method be preferred over other imputation methods?
- When a known and well-fitting model can be assumed for the data
- When the amount of missing data is negligible
- When the data is missing completely at random
- When the data is not missing at random
A model-based method might be preferred over other imputation methods when a known and well-fitting model can be assumed for the data. The model-based method is a principled method of handling missing data under the assumption that the data follows a specific statistical model. It could be any model like linear regression, logistic regression, etc.
When the outlier is a result of a data entry error, the best approach would often be ________.
- Binning
- Imputation
- Removal
- Transformation
When outliers are due to data entry errors, they do not provide meaningful information, hence removing them would be the most appropriate method.
You are working with a dataset where participants omitted to answer sensitive questions due to personal discomfort. How would you classify this type of missing data?
- MAR
- MCAR
- NMAR
- Not missing data
This is an example of NMAR (Not Missing at Random) because the probability of missingness depends on the unobserved data itself (i.e., the sensitive information that participants chose not to provide).
Suppose you're working on a dataset with missing values distributed randomly throughout. What issues might you encounter when using pairwise deletion?
- All of the above
- It can reduce power
- It could inflate correlations
- It might cause inconsistency in results
When missing values are distributed randomly throughout the dataset, pairwise deletion can lead to inconsistencies in results, inflate correlations, and reduce power. This is because it uses different subsets of data for different analyses, potentially leading to biased results.
What could be potential drawbacks of using regression imputation?
- Can lead to an underestimation of errors
- Can lead to biased results if relationships between variables are non-linear
- Does not handle missing values
- No drawbacks
The potential drawbacks of using regression imputation are that it can lead to an underestimation of errors or variances. This happens because it estimates missing values using a deterministic function (i.e., regression), but does not account for the inherent uncertainty associated with the missing values.
Which measure of central tendency is most affected by outliers in the data set?
- All of them
- Mean
- Median
- Mode
The "Mean" or the average is the measure of central tendency that is most affected by outliers in a data set. The mean considers every value in the data set, and hence, extreme values (outliers) can significantly affect its value.
The data missingness mechanism that could lead to the most bias if not addressed properly is __________.
- All missing data
- MAR
- MCAR
- NMAR
The NMAR (Not Missing at Random) missing data mechanism could lead to the most bias if not addressed properly as the missingness is related to the unobserved data.