What are the key statistical tools used in Confirmatory Data Analysis (CDA)?

  • Box-Plot, Scatter Plot, Histogram, and Density Plots
  • Hypothesis Testing, Regression Analysis, Chi-Squared Test, and ANOVA
  • PCA, LDA, t-SNE, and UMAP
  • Random Forests, SVM, Neural Networks, and Gradient Boosting
In CDA, the primary goal is to confirm or refute the hypotheses that were generated during EDA. Key statistical tools used in CDA include Hypothesis Testing, Regression Analysis, Chi-Squared Test, and Analysis of Variance (ANOVA).

When would a scatter plot be less effective in identifying outliers?

  • When the data has no correlation
  • When the data is normally distributed
  • When the data points are closely grouped
  • When there are many data points
A scatter plot may be less effective in identifying outliers when the data points are closely grouped because it would be hard to visually identify points that are far away from the others.

The _____ Distribution is used for modeling the number of times an event occurs in an interval of time or space.

  • Binomial
  • Normal
  • Poisson
  • Uniform
The Poisson Distribution is used for modeling the number of times an event occurs in an interval of time or space.

How does 'binning' help in dealing with outliers in a dataset?

  • By dividing the data into intervals and replacing outlier values
  • By eliminating irrelevant variables
  • By identifying and removing outliers
  • By normalizing the data
Binning helps in dealing with outliers by dividing the data into intervals or 'bins' and replacing outlier values with summary statistics like the bin mean or median.

Suppose you have a data set with many missing values and outliers. In which step of the EDA process would you primarily deal with these issues?

  • In the communicating phase
  • In the exploring phase
  • In the questioning phase
  • In the wrangling phase
During the 'wrangling' phase of the EDA process, data analysts deal with data cleaning tasks which includes handling missing values and dealing with outliers. Data wrangling involves transforming and cleaning data to enable further exploration and analysis.

How does incorrect imputation of missing data influence the accuracy of a predictive model?

  • Decreases accuracy.
  • Depends on the specific model.
  • Increases accuracy.
  • No effect on accuracy.
Incorrect imputation of missing data can lead to the model learning incorrect patterns, which in turn can significantly decrease the accuracy of predictions.

Why is it important to check the normality of residuals in regression analysis?

  • To ensure the accuracy of the model's predictive ability
  • To ensure the model is not overfitting
  • To make sure the regression line is the best fit
  • To satisfy one of the key assumptions of linear regression
It is important to check the normality of residuals in regression analysis because it is one of the key assumptions of linear regression. If the residuals are normally distributed, it validates the model's assumptions and ensures the accuracy of the hypothesis tests and confidence intervals.

Which type of graph is frequently used to represent an estimate of a variable's probability density function?

  • Bar chart
  • Kernel Density plot
  • Pie chart
  • Scatter plot
A Kernel Density Plot is frequently used to represent an estimate of a variable's probability density function. This type of plot uses a smoothing kernel to create a curve and the area under the curve is equal to 1.

You are analyzing a data set that includes the number of visitors to a website per day. How would you categorize this data type?

  • Continuous data
  • Discrete data
  • Nominal data
  • Ordinal data
The number of visitors to a website per day would be discrete data as it is countable in a finite amount of time.

For data with outliers, the _____ is typically a better measure of central tendency as it is less sensitive to extreme values.

  • Mean
  • Median
  • Mode
  • Variance
The "Median" is less sensitive to extreme values, or outliers, in a dataset. Therefore, it's often a better measure of central tendency when outliers are present.