If missingness depends on unobserved data, the missing data mechanism is usually categorized as __________.

  • All missing data
  • MAR
  • MCAR
  • NMAR
If missingness depends on unobserved data, the missing data mechanism is usually categorized as NMAR (Not Missing at Random).

What measure of central tendency is also known as the 50th percentile or the second quartile?

  • Mean
  • Median
  • Mode
  • nan
The "Median" is the measure of central tendency that is also known as the 50th percentile or the second quartile. When data points are ordered from smallest to largest, the median is the value that separates the higher half from the lower half of the data set.

A researcher measures the heights of a large group of individuals and finds that the data is symmetrically distributed with most of the values clustered around the mean. Which distribution does the data most likely follow?

  • Binomial Distribution
  • Normal Distribution
  • Poisson Distribution
  • Uniform Distribution
Given the characteristics of the data - symmetric distribution and most values clustered around the mean, it is most likely that the data follows a Normal Distribution.

How does the presence of outliers affect the range and interquartile range?

  • Decreases both
  • Increases IQR, but doesn't affect range
  • Increases both
  • Increases range, but doesn't affect IQR
Outliers significantly affect the "Range" as it measures the distance between the largest and smallest values. However, the Interquartile Range (IQR), being a measure of the middle 50% of the data, is not affected by outliers.

Which technique for handling missing data replaces missing values with the median of the available data?

  • Listwise Deletion
  • Median Imputation
  • Mode Imputation
  • Regression Imputation
'Median Imputation' is a method that replaces missing values with the median of the available data. This technique is useful because it is not influenced by outliers, but it can potentially distort the original distribution of data.

While EDA is often conducted at the _______ of the data analysis process, CDA is usually done towards the _______.

  • end, start
  • middle, end
  • start, end
  • start, middle
EDA (Exploratory Data Analysis) is typically the first step in the data analysis process, where we explore the data. CDA (Confirmatory Data Analysis) is conducted towards the end to confirm or refute the hypotheses formed during EDA.

How does the variance affect the shape of a distribution?

  • Higher variance leads to a more skewed distribution
  • Higher variance leads to a more uniform distribution
  • Higher variance leads to a narrower distribution
  • Higher variance leads to a wider distribution
"Higher Variance" leads to a "Wider Distribution". Variance measures how far a set of numbers is spread out from their average value, thus a higher variance means a wider spread or dispersion.

How does the Central Limit Theorem relate to the Normal Distribution?

  • The Central Limit Theorem and the Normal Distribution are unrelated
  • The Central Limit Theorem states that any distribution can be transformed into a Normal Distribution
  • The Central Limit Theorem states that large samples will always follow a Normal Distribution
  • The Central Limit Theorem states that the sum of independent and identically distributed random variables tends toward a Normal Distribution
The Central Limit Theorem states that the sum of a large number of independent and identically distributed random variables, irrespective of their shape, tends towards a Normal Distribution as the number of variables increases.

Your organization has collected a large dataset from their latest marketing campaign and they want you to generate actionable insights from this data. Which type of data analysis would be the most suitable for this situation?

  • All are equally suitable
  • CDA
  • EDA
  • Predictive Modeling
EDA would be the most suitable initial approach as it involves exploring and understanding the dataset to identify patterns, trends, and potential relationships that can lead to actionable insights.

You have to present the sales data of a company over 10 years to the board of directors. What type of graph should you choose and why?

  • Histogram, because it shows distributions
  • Line graph, because it shows trends over time
  • Pie chart, because it shows proportions
  • Scatter plot, because it shows relationships between variables
A line graph would be the most suitable choice for presenting sales data over time. Line graphs are excellent for showing continuous data over time set at equal intervals, like years in this case. This would allow the board of directors to easily see trends, patterns, and fluctuations in sales data.