Incorrect handling of missing data can lead to a(n) ________ in model performance.

  • amplification
  • boost
  • degradation
  • improvement
Incorrectly handling missing data can distort the data, thereby negatively affecting the model's ability to learn accurately from it and leading to a degradation in the model's performance.

The ________ correlation coefficient is based on the ranks of data rather than the actual values.

  • Covariance
  • Kendall's Tau
  • Pearson's
  • Spearman's
The Spearman's correlation coefficient is based on the ranks of data rather than the actual values. This makes it suitable for use with ordinal variables and resistant to outliers.

The choice of graph for data visualization largely depends on the __________ of the dataset.

  • File format
  • Shape
  • Size
  • Type of variables
The choice of graph for data visualization largely depends on the type of variables in the dataset. For example, categorical variables are best represented with bar charts or pie charts, while continuous variables might be better shown with histograms or box plots.

What are the key statistical tools used in Confirmatory Data Analysis (CDA)?

  • Box-Plot, Scatter Plot, Histogram, and Density Plots
  • Hypothesis Testing, Regression Analysis, Chi-Squared Test, and ANOVA
  • PCA, LDA, t-SNE, and UMAP
  • Random Forests, SVM, Neural Networks, and Gradient Boosting
In CDA, the primary goal is to confirm or refute the hypotheses that were generated during EDA. Key statistical tools used in CDA include Hypothesis Testing, Regression Analysis, Chi-Squared Test, and Analysis of Variance (ANOVA).

When would a scatter plot be less effective in identifying outliers?

  • When the data has no correlation
  • When the data is normally distributed
  • When the data points are closely grouped
  • When there are many data points
A scatter plot may be less effective in identifying outliers when the data points are closely grouped because it would be hard to visually identify points that are far away from the others.

How can regularization techniques contribute to feature selection?

  • By adding a penalty term to the loss function
  • By avoiding overfitting
  • By reducing model complexity
  • By shrinking coefficients towards zero
Regularization techniques contribute to feature selection by shrinking the coefficients of less important features towards zero. This has the effect of effectively removing these features from the model, thus achieving feature selection.

What type of data visualization method is typically color-coded to represent different values?

  • Heatmap
  • Histogram
  • Line plot
  • Scatter plot
Heatmaps are typically color-coded to represent different values. In a heatmap, data values are represented as colors, making it an excellent tool for visualizing large amounts of data and the correlation between different variables.

What is the potential disadvantage of using listwise deletion for handling missing data?

  • It causes overfitting
  • It discards valuable data
  • It introduces random noise
  • It leads to multicollinearity
The potential disadvantage of using listwise deletion for handling missing data is that it can discard valuable data. If the missing values are not completely random, discarding the entire observation might lead to biased or incorrect results because it might exclude certain types of observations.

If a data point's Z-score is 0, it indicates that the data point is _______.

  • above the mean
  • an outlier
  • below the mean
  • on the mean
A Z-score of 0 indicates that the data point is on the mean.

How does incorrect imputation of missing data influence the accuracy of a predictive model?

  • Decreases accuracy.
  • Depends on the specific model.
  • Increases accuracy.
  • No effect on accuracy.
Incorrect imputation of missing data can lead to the model learning incorrect patterns, which in turn can significantly decrease the accuracy of predictions.