A researcher measures the heights of a large group of individuals and finds that the data is symmetrically distributed with most of the values clustered around the mean. Which distribution does the data most likely follow?

  • Binomial Distribution
  • Normal Distribution
  • Poisson Distribution
  • Uniform Distribution
Given the characteristics of the data - symmetric distribution and most values clustered around the mean, it is most likely that the data follows a Normal Distribution.

Which type of missing data relies on information that is not included in the dataset?

  • MAR
  • MCAR
  • NMAR
  • nan
NMAR (Not Missing At Random) type of missing data relies on information that is not included in the dataset.

You are working on a dataset with ordinal variables. You are interested in the correlation between these variables. Which correlation coefficient would be the best choice and why?

  • Covariance
  • Kendall's Tau
  • Pearson's correlation coefficient
  • Spearman's correlation coefficient
In a dataset with ordinal variables, Spearman's correlation coefficient would be the best choice. This is because Spearman's correlation coefficient does not assume that data is normally distributed and works with ranks, making it suitable for ordinal data.

During the '______' phase of the EDA process, you might use visualization techniques to understand the patterns in your data.

  • communicating
  • exploring
  • questioning
  • wrangling
During the 'exploring' phase of the EDA process, you might use visualization techniques to understand the patterns in your data. This step involves delving into the data to discover patterns, spot anomalies, test hypotheses, and check assumptions.

How does EDA contribute to the model building process in Machine Learning?

  • By defining the ML algorithm to be used
  • By fine-tuning the hyperparameters of the ML model
  • By providing insights into the nature of data, and identifying trends and outliers
  • By testing the performance of the ML model
EDA is integral to the model-building process in Machine Learning as it provides insights into the nature of the data, and identifies trends, patterns and outliers. These insights help to determine which Machine Learning models might be most appropriate to apply and can guide the feature engineering process.

In a longitudinal study on childhood development, some data points are missing randomly due to logistical issues during data collection. How would you classify this missing data?

  • MAR
  • MCAR
  • NMAR
  • Not missing data
This would be MCAR (Missing Completely at Random) because the reason for the missing data (logistical issues) has nothing to do with the observed or unobserved data. It's entirely random.

You're visualizing a bivariate data set using a scatter plot and notice an isolated group of points far from the main concentration of data. How would you categorize these points?

  • Negative correlation
  • Normal data points
  • Outliers
  • Positive correlation
In a scatter plot, a group of points that are isolated from the main concentration of data could be categorized as outliers.

Multicollinearity refers to a situation where _________.

  • All variables in a model are perfectly uncorrelated
  • Two or more predictors in a regression model are highly correlated
  • Two variables are uncorrelated
  • Two variables have a correlation coefficient of zero
Multicollinearity refers to a situation in which two or more predictors in a regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

You have to present the sales data of a company over 10 years to the board of directors. What type of graph should you choose and why?

  • Histogram, because it shows distributions
  • Line graph, because it shows trends over time
  • Pie chart, because it shows proportions
  • Scatter plot, because it shows relationships between variables
A line graph would be the most suitable choice for presenting sales data over time. Line graphs are excellent for showing continuous data over time set at equal intervals, like years in this case. This would allow the board of directors to easily see trends, patterns, and fluctuations in sales data.

The _____ is the most appropriate measure of central tendency when the distribution of data is heavily skewed.

  • Mean
  • Median
  • Mode
  • Standard Deviation
The "Median" is the most appropriate measure of central tendency when the distribution of data is heavily skewed. This is because it is less affected by outliers and skewed data.