A Variance Inflation Factor (VIF) greater than 5 indicates a high degree of _______ among the predictors.
- correlation
- distribution
- multicollinearity
- variance
A VIF greater than 5 is often taken as an indication of high multicollinearity among the predictors in a regression model. This could lead to imprecise and unreliable estimates of the regression coefficients.
How does the 'elbow method' help in determining the optimal number of clusters in K-means clustering?
- By calculating the average distance between all pairs of clusters
- By comparing the silhouette scores for different numbers of clusters
- By creating a dendrogram of clusters
- By finding the point in the plot of within-cluster sum of squares where the decrease rate sharply shifts
The elbow method involves plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. This 'elbow' is the point representing the optimal number of clusters at which the within-cluster sum of squares (WCSS) doesn't decrease significantly with each iteration.
The bin width (and thus number of categories or ranges) in a histogram can dramatically affect the ________, skewness, and appearance of the histogram.
- Interpretation
- Mean
- Median
- Mode
The bin width and the number of bins in a histogram can dramatically affect the interpretation, skewness, and overall appearance of the histogram. This is because the choice of bin size can influence the level of detail visible in the histogram, potentially either obscuring or highlighting certain patterns in the data.
In PCA, if two variables are similar, they will have _______ loadings on the same component.
- high
- low
- opposite
- random
In PCA, if two variables are similar or highly correlated, they will have high loadings on the same component. This is because PCA identifies the directions (Principal Components) in which the data varies the most, and similar variables will contribute to this variance in the same way.
What is the impact of heteroscedasticity on a multiple linear regression model?
- It affects the linearity of the model
- It affects the normality of the residuals
- It causes multicollinearity
- It invalidates the statistical inferences that could be made from the model
Heteroscedasticity, or non-constant variance of the error term, can invalidate statistical inferences that could be made from the model because it violates one of the assumptions of multiple linear regression. This could lead to inefficient estimation of the regression coefficients and incorrect standard errors, which in turn affects confidence intervals and hypothesis tests.
What is the impact of data transformation on the decision to use non-parametric tests?
- A suitable data transformation may make it possible to use a parametric test
- Data transformation always leads to non-parametric tests
- Data transformation always makes data normally distributed
- Data transformation does not affect the choice between parametric and non-parametric tests
A suitable data transformation may make it possible to use a parametric test instead of a non-parametric test. Transformations can help to stabilize variances, normalize the data, or linearize relationships between variables, allowing for the use of parametric tests that might have more statistical power.
What is the z-value associated with a 95% confidence interval in a standard normal distribution?
- 1.64
- 1.96
- 2
- 2.33
The z-value associated with a 95% confidence interval in a standard normal distribution is approximately 1.96. This means that we are 95% confident that the true population parameter lies within 1.96 standard deviations of the sample mean.
How is the interquartile range different from the range in handling outliers?
- Both exclude outliers
- Both include outliers
- The interquartile range does not include outliers, the range does
- The interquartile range includes outliers, the range does not
The interquartile range, which is the difference between the upper quartile (Q3) and the lower quartile (Q1), represents the middle 50% of the data and is not affected by outliers. The range, on the other hand, is the difference between the maximum and minimum data values and is significantly affected by outliers.
How can 'outliers' impact the result of K-means clustering?
- Outliers can distort the shape and size of the clusters
- Outliers can lead to fewer clusters
- Outliers can lead to more clusters
- Outliers don't impact K-means clustering
Outliers can have a significant impact on the result of K-means clustering. They can distort the shape and size of the clusters, as they may pull the centroid towards them, creating less accurate and meaningful clusters.
A positive Pearson's Correlation Coefficient indicates a ________ relationship between two variables.
- inverse
- linear
- perfect
- positive
A positive Pearson's Correlation Coefficient indicates a positive relationship between two variables. This means that as one variable increases, the other variable also increases, and vice versa.