In a survey about income levels, some individuals chose not to disclose their earnings. How would you categorize this missing data?

  • MAR
  • MCAR
  • NMAR
  • Not missing data
This would also be NMAR (Not Missing at Random) because the missingness (income level) depends on the value of the unobserved data itself (i.e., people with higher or lower incomes may be more likely to omit this information).

_____ data can only take certain values with gaps between them.

  • Continuous
  • Discrete
  • Nominal
  • Ordinal
Discrete data can only take certain values (usually integers) and there are gaps between the values.

You have a data set with a large number of outliers. Which measure of dispersion should you use to best describe the data set, and why?

  • Interquartile range (IQR) because it is robust to outliers
  • Range because it covers all values
  • Standard deviation because it gives the average spread
  • Variance because it squares the differences
When dealing with a large number of outliers in a data set, the "Interquartile range (IQR)" is the most suitable measure of dispersion. This is because it measures the statistical spread between the 25th and 75th percentiles, thus excluding outliers.

A teacher is analyzing test scores and finds that the distribution is bimodal, with one peak at 70 and another at 90. Which measure of central tendency might not be the best choice in this situation, and why?

  • Mean, because it doesn't reflect the peaks
  • Median, because it doesn't reflect the bimodality
  • Mode, because there are two peaks
  • None, because all are suitable
The "Mean" might not be the best choice in this situation because it does not reflect the two peaks. The mean would give a single central value, which does not accurately represent the two distinct groups in a bimodal distribution.

You are given a dataset where the salaries of a company are reported. The CEO's salary is significantly higher than the rest of the employees. Which measure of central tendency would give a more representative measure of the typical salary?

  • Mean
  • Median
  • Mode
  • None would be representative
The "Median" would be a more representative measure of the typical salary. Because the CEO's salary is an outlier and would significantly skew the mean, the median provides a more accurate central measure by considering the middle value in the sorted data.

Which of the following graphs can help identify outliers in a univariate dataset?

  • Bar Chart
  • Box Plot
  • Line Graph
  • Pie Chart
A box plot is a type of graph that can help identify outliers in a univariate dataset.

How does the Spearman's correlation handle ties compared to Kendall's Tau?

  • It doesn't handle ties
  • It handles ties better than Kendall's Tau
  • It handles ties worse than Kendall's Tau
  • The method of handling ties is the same
Spearman's correlation coefficient handles ties worse than Kendall's Tau. While both are rank correlation coefficients, Kendall's Tau is better at handling ties. Ties are handled in Spearman's correlation by assigning each tied group the mean of the ranks they would have received if they weren't tied.

In a correlation matrix, the value -1 signifies a perfect _____ correlation between two variables.

  • negative
  • neutral
  • positive
  • random
In a correlation matrix, a value of -1 signifies a perfect negative correlation between two variables. This means that as one variable increases, the other decreases proportionally, and vice versa.

Outliers can make a histogram appear ____, hence, distorting the true distribution of the data.

  • skewed
  • spread out
  • symmetrical
  • uniform
Outliers can cause a histogram to appear skewed or distorted as they can create bars that stand alone far from the main distribution.

Imagine you're working on a data project where the 'wrangle' phase is taking significantly longer than expected. How might this impact the rest of your EDA process?

  • It could delay subsequent steps and overall analysis timeline.
  • The communication phase will be quicker.
  • The explore phase might be shortened to make up for lost time.
  • The rest of the process will not be impacted.
If the 'wrangling' phase takes significantly longer than expected, it could delay subsequent steps and the overall timeline for the analysis. The EDA process is often iterative, and delays in one phase could impact the time available for later phases. Proper time management and planning are crucial for a successful data analysis project.