For data with outliers, the _____ is typically a better measure of central tendency as it is less sensitive to extreme values.
- Mean
- Median
- Mode
- Variance
The "Median" is less sensitive to extreme values, or outliers, in a dataset. Therefore, it's often a better measure of central tendency when outliers are present.
If you are working with a large data set and need to produce interactive visualizations for a web application, which Python library would be the most suitable?
- Bokeh
- Matplotlib
- Plotly
- Seaborn
Plotly is well-suited for creating interactive visualizations and can handle large data sets efficiently. It also supports rendering in web applications, making it ideal for this scenario.
What type of bias could be introduced by mean/median/mode imputation, particularly if the data is not missing at random?
- Confirmation bias
- Overfitting bias
- Selection bias
- Underfitting bias
Mean/Median/Mode Imputation, particularly when data is not missing at random, could introduce a type of bias known as 'Selection Bias'. This is because it might lead to incorrect estimation of variability and distorted representation of true relationships between variables, as the substituted values may not accurately reflect the reasons behind the missingness.
How can regularization techniques contribute to feature selection?
- By adding a penalty term to the loss function
- By avoiding overfitting
- By reducing model complexity
- By shrinking coefficients towards zero
Regularization techniques contribute to feature selection by shrinking the coefficients of less important features towards zero. This has the effect of effectively removing these features from the model, thus achieving feature selection.
What is the impact on training time if missing data is incorrectly handled in a large dataset?
- Decreases dramatically.
- Depends on the specific dataset.
- Increases dramatically.
- Remains largely the same.
If missing data is not handled correctly, particularly in a large dataset, the training time can increase significantly. This is because the model might struggle to learn from the distorted data, requiring more time to try to fit the data.
The _______ method of feature selection involves removing features one by one until the removal of further features decreases model accuracy.
- Backward elimination
- Forward selection
- Recursive feature elimination
- Stepwise selection
The backward elimination method of feature selection involves removing features one by one until the removal of further features decreases model accuracy. This process starts with a model trained on all features and iteratively removes the least important feature until the overall model performance declines.
High degrees of Multicollinearity can inflate the _________ of the estimated regression coefficients.
- Bias
- Distribution
- Efficiency
- Variance
High degrees of multicollinearity can inflate the variance of the estimated regression coefficients. This means that the coefficients become highly sensitive to minor changes in the model, which can make them unreliable and difficult to interpret.
You are dealing with a dataset where outliers significantly affect the mean of the distribution but not the median. What approach would you suggest to handle these outliers?
- Binning
- Removal
- Transformation
- nan
In this case, a transformation such as a log or square root transformation might be suitable. These transformations pull in high values, thereby reducing their impact on the mean.
The process of replacing each missing data point with a set of plausible values creating multiple complete data sets is known as ____________.
- Mean Imputation
- Mode Imputation
- Multiple Imputation
- Regression Imputation
This process is called multiple imputation. It generates several different plausible imputed datasets and the results from these are combined to produce the final analysis.
What is the relationship between the Z-score of a data point and its distance from the mean?
- The Z-score is independent of the distance from the mean
- The higher the Z-score, the closer the data point is to the mean
- The higher the Z-score, the further the data point is from the mean
- The lower the Z-score, the further the data point is from the mean
The higher the Z-score, the further the data point is from the mean. A Z-score of 0 indicates that the data point is identical to the mean score.