Principal Component Analysis (PCA) is a technique that reduces dimensionality by creating new uncorrelated variables called _______. These new variables retain most of the variability in the original dataset.
- Eigenvalues
- Eigenvectors
- Factors
- Principal components
Principal Component Analysis (PCA) is a technique that reduces dimensionality by creating new uncorrelated variables called principal components. These new variables retain most of the variability in the original dataset. PCA works by projecting the original data onto a new space, represented by the principal components, which are orthogonal to each other and thus uncorrelated.
How does the missing data mechanism affect the effectiveness of multiple imputation?
- Affects only if data is missing at random
- Affects only if data is not missing at random
- Doesn't affect
- Significantly affects
The missing data mechanism significantly affects the effectiveness of multiple imputation. If data is missing completely at random (MCAR), any method would give unbiased results, but if data is not missing at random (NMAR), the results might be biased even with multiple imputation. The effectiveness also depends on how accurately the imputation model reflects the data process.
The _________ function in Matplotlib is used to create a figure and a set of subplots.
- heatmap
- pairplot
- subplot
- subplots
The 'subplots' function in Matplotlib is used to create a figure and a set of subplots. This function provides a convenient way to create both a figure and one or more subplots with a single call.
________ is one potential cause of outliers in a dataset.
- Measurement error
- Overfitting
- Underfitting
- nan
Measurement error is one potential cause of outliers in a dataset. This can occur due to inaccuracies in data collection, recording, or entry.
Consider a scatter plot displaying a tight, downward sloping distribution of points. What can be inferred about the relationship between the two plotted variables?
- There is a random relationship
- There is a strong negative relationship
- There is a strong positive relationship
- There is no relationship
A tight, downward sloping distribution of points in a scatter plot implies a strong negative relationship between the two plotted variables. As one variable increases, the other variable decreases.
Which outlier handling technique would be suitable for a dataset with numerous extreme values distributed on both ends?
- Binning
- Removal
- Transformation
- nan
Transformation is a suitable technique for handling outliers when the dataset contains numerous extreme values distributed on both ends, as it can pull in these extreme values and make the data distribution more symmetrical.
The ________ is a measure of dispersion that considers the spread of the middle 50% of data.
- Interquartile Range
- Range
- Standard Deviation
- Variance
The "Interquartile Range" is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.
How does the regularization technique aid in addressing the Multicollinearity issue?
- By constraining the coefficient estimates, potentially setting some to zero.
- By increasing model complexity.
- By increasing the variance of the model.
- By reducing model bias.
Regularization techniques, such as Ridge and Lasso regression, can help address multicollinearity by adding a penalty term to the loss function that constrains the coefficients. In particular, Lasso regression can set some coefficients to zero, effectively performing feature selection.
You are working with a healthcare dataset and you observe that a patient's health status influences the probability of missingness in the data. What type of missing data is this?
- MAR
- MCAR
- NMAR
- Not missing data
This would be MAR (Missing at Random) because the missingness is related to an observed data (the patient's health status). The missing data is not random, but it doesn't depend on the unobserved data itself.
What are the advantages and disadvantages of using a violin plot versus a box plot?
- All of the above
- Box plots are more visually complex than violin plots
- Violin plots can be harder to read for a non-technical audience
- Violin plots provide less information than box plots
Violin plots, while they provide more information (including the density of the distribution), can be harder to read for a non-technical audience. Box plots, while less information-rich, are generally easier to interpret.