What type of data visualization method is typically color-coded to represent different values?

  • Heatmap
  • Histogram
  • Line plot
  • Scatter plot
Heatmaps are typically color-coded to represent different values. In a heatmap, data values are represented as colors, making it an excellent tool for visualizing large amounts of data and the correlation between different variables.

What is the potential disadvantage of using listwise deletion for handling missing data?

  • It causes overfitting
  • It discards valuable data
  • It introduces random noise
  • It leads to multicollinearity
The potential disadvantage of using listwise deletion for handling missing data is that it can discard valuable data. If the missing values are not completely random, discarding the entire observation might lead to biased or incorrect results because it might exclude certain types of observations.

If a data point's Z-score is 0, it indicates that the data point is _______.

  • above the mean
  • an outlier
  • below the mean
  • on the mean
A Z-score of 0 indicates that the data point is on the mean.

How does incorrect imputation of missing data influence the accuracy of a predictive model?

  • Decreases accuracy.
  • Depends on the specific model.
  • Increases accuracy.
  • No effect on accuracy.
Incorrect imputation of missing data can lead to the model learning incorrect patterns, which in turn can significantly decrease the accuracy of predictions.

Why is it important to check the normality of residuals in regression analysis?

  • To ensure the accuracy of the model's predictive ability
  • To ensure the model is not overfitting
  • To make sure the regression line is the best fit
  • To satisfy one of the key assumptions of linear regression
It is important to check the normality of residuals in regression analysis because it is one of the key assumptions of linear regression. If the residuals are normally distributed, it validates the model's assumptions and ensures the accuracy of the hypothesis tests and confidence intervals.

Which type of graph is frequently used to represent an estimate of a variable's probability density function?

  • Bar chart
  • Kernel Density plot
  • Pie chart
  • Scatter plot
A Kernel Density Plot is frequently used to represent an estimate of a variable's probability density function. This type of plot uses a smoothing kernel to create a curve and the area under the curve is equal to 1.

You're in the 'explore' phase of the EDA process and you notice a potential error back in the 'wrangle' phase. How should you proceed?

  • Conclude the analysis with the current data.
  • Go back to the wrangling phase to correct the error.
  • Ignore the error and continue with the exploration.
  • Inform the stakeholders about the error.
If you notice a potential error in the 'wrangle' phase while you are in the 'explore' phase, you should go back to the 'wrangle' phase to correct the error. Ensuring the accuracy and quality of the data during the 'wrangle' phase is crucial for the validity of the insights drawn in subsequent phases.

What is the impact on training time if missing data is incorrectly handled in a large dataset?

  • Decreases dramatically.
  • Depends on the specific dataset.
  • Increases dramatically.
  • Remains largely the same.
If missing data is not handled correctly, particularly in a large dataset, the training time can increase significantly. This is because the model might struggle to learn from the distorted data, requiring more time to try to fit the data.

The _______ method of feature selection involves removing features one by one until the removal of further features decreases model accuracy.

  • Backward elimination
  • Forward selection
  • Recursive feature elimination
  • Stepwise selection
The backward elimination method of feature selection involves removing features one by one until the removal of further features decreases model accuracy. This process starts with a model trained on all features and iteratively removes the least important feature until the overall model performance declines.

High degrees of Multicollinearity can inflate the _________ of the estimated regression coefficients.

  • Bias
  • Distribution
  • Efficiency
  • Variance
High degrees of multicollinearity can inflate the variance of the estimated regression coefficients. This means that the coefficients become highly sensitive to minor changes in the model, which can make them unreliable and difficult to interpret.