If missingness depends on unobserved data, the missing data mechanism is usually categorized as __________.

  • All missing data
  • MAR
  • MCAR
  • NMAR
If missingness depends on unobserved data, the missing data mechanism is usually categorized as NMAR (Not Missing at Random).

What are some of the major limitations of Matplotlib that Plotly and Seaborn help to overcome?

  • All of the above
  • Lack of interactivity
  • Lack of statistical plots
  • Limited 3D plotting
Matplotlib, while powerful, has several limitations, including lack of interactivity and limited support for statistical plots. Both Seaborn and Plotly address these limitations – Seaborn adds high-level, attractive statistical plots while Plotly adds interactive capabilities.

Your organization has collected a large dataset from their latest marketing campaign and they want you to generate actionable insights from this data. Which type of data analysis would be the most suitable for this situation?

  • All are equally suitable
  • CDA
  • EDA
  • Predictive Modeling
EDA would be the most suitable initial approach as it involves exploring and understanding the dataset to identify patterns, trends, and potential relationships that can lead to actionable insights.

How does the Central Limit Theorem relate to the Normal Distribution?

  • The Central Limit Theorem and the Normal Distribution are unrelated
  • The Central Limit Theorem states that any distribution can be transformed into a Normal Distribution
  • The Central Limit Theorem states that large samples will always follow a Normal Distribution
  • The Central Limit Theorem states that the sum of independent and identically distributed random variables tends toward a Normal Distribution
The Central Limit Theorem states that the sum of a large number of independent and identically distributed random variables, irrespective of their shape, tends towards a Normal Distribution as the number of variables increases.

How does the variance affect the shape of a distribution?

  • Higher variance leads to a more skewed distribution
  • Higher variance leads to a more uniform distribution
  • Higher variance leads to a narrower distribution
  • Higher variance leads to a wider distribution
"Higher Variance" leads to a "Wider Distribution". Variance measures how far a set of numbers is spread out from their average value, thus a higher variance means a wider spread or dispersion.

While EDA is often conducted at the _______ of the data analysis process, CDA is usually done towards the _______.

  • end, start
  • middle, end
  • start, end
  • start, middle
EDA (Exploratory Data Analysis) is typically the first step in the data analysis process, where we explore the data. CDA (Confirmatory Data Analysis) is conducted towards the end to confirm or refute the hypotheses formed during EDA.

Which technique for handling missing data replaces missing values with the median of the available data?

  • Listwise Deletion
  • Median Imputation
  • Mode Imputation
  • Regression Imputation
'Median Imputation' is a method that replaces missing values with the median of the available data. This technique is useful because it is not influenced by outliers, but it can potentially distort the original distribution of data.

How does the presence of outliers affect the range and interquartile range?

  • Decreases both
  • Increases IQR, but doesn't affect range
  • Increases both
  • Increases range, but doesn't affect IQR
Outliers significantly affect the "Range" as it measures the distance between the largest and smallest values. However, the Interquartile Range (IQR), being a measure of the middle 50% of the data, is not affected by outliers.

You're visualizing a bivariate data set using a scatter plot and notice an isolated group of points far from the main concentration of data. How would you categorize these points?

  • Negative correlation
  • Normal data points
  • Outliers
  • Positive correlation
In a scatter plot, a group of points that are isolated from the main concentration of data could be categorized as outliers.

In a longitudinal study on childhood development, some data points are missing randomly due to logistical issues during data collection. How would you classify this missing data?

  • MAR
  • MCAR
  • NMAR
  • Not missing data
This would be MCAR (Missing Completely at Random) because the reason for the missing data (logistical issues) has nothing to do with the observed or unobserved data. It's entirely random.

How does EDA contribute to the model building process in Machine Learning?

  • By defining the ML algorithm to be used
  • By fine-tuning the hyperparameters of the ML model
  • By providing insights into the nature of data, and identifying trends and outliers
  • By testing the performance of the ML model
EDA is integral to the model-building process in Machine Learning as it provides insights into the nature of the data, and identifies trends, patterns and outliers. These insights help to determine which Machine Learning models might be most appropriate to apply and can guide the feature engineering process.

During the '______' phase of the EDA process, you might use visualization techniques to understand the patterns in your data.

  • communicating
  • exploring
  • questioning
  • wrangling
During the 'exploring' phase of the EDA process, you might use visualization techniques to understand the patterns in your data. This step involves delving into the data to discover patterns, spot anomalies, test hypotheses, and check assumptions.