The mode is the only measure of central tendency that can be used for _____ data.

Categorical
Interval
Numerical
Ordinal

The "Mode" is the only measure of central tendency that can be used for "Categorical" data. This is because it simply represents the most frequently occurring category or value.

Discuss it

You are developing a linear regression model and notice that despite a high R-squared value, none of your independent variables are statistically significant. What might be the potential issue here?

Data leakage
High variance
Multicollinearity
Underfitting

This could be due to multicollinearity. Multicollinearity inflates the variances of the parameter estimates, which might lead to none of them being statistically significant. Despite this, the overall model might still be significant, leading to a high R-squared value.

Discuss it

You have a dataset with an odd number of observations. If you were to calculate both the mean and median, how would adding a very large value to the dataset affect these measures of central tendency?

Both would increase
Both would remain unchanged
Only the mean would increase
Only the median would increase

Adding a very large value to the dataset would increase the "Mean" because it takes into account all values in the data set. However, the "Median" would not be affected unless the new value changes the middle value of the ordered data set.

Discuss it

In _____ deletion, all data from a participant is discarded if any single value is missing.

Listwise
Pairwise
Random
Systematic

In 'listwise' deletion, all data from a participant is discarded if any single value is missing. It is the simplest form of dealing with missing data but can lead to significant loss of information if missing data is not completely at random.

Discuss it

A correlation coefficient of +1 between two variables indicates what kind of relationship?

No relationship
Perfect negative linear relationship
Perfect positive linear relationship
Weak relationship

A correlation coefficient of +1 between two variables indicates a perfect positive linear relationship. This means that if one variable increases, the other variable also increases at a constant rate, and vice versa.

Discuss it

You've 'explored' the data and drawn some conclusions, but upon 'communicating' your findings, stakeholders have additional questions. What would be the next step in the EDA process?

Direct the stakeholders to the raw data
Ignore the questions and conclude the analysis
Revisit the questioning phase with these new questions
Wrap up the communication phase quickly

In this situation, the next step should be to revisit the 'questioning' phase with these new questions from stakeholders. Additional questions from stakeholders might reflect aspects of the data that have not been covered or require further investigation. Revisiting the questioning phase will allow these aspects to be incorporated into the analysis.

Discuss it

What kind of distribution is indicated by a skewness of zero?

A bimodal distribution.
A negatively skewed distribution.
A normal distribution.
A positively skewed distribution.

A skewness of zero is indicative of a "Normal Distribution". In a perfect normal distribution, both tails are equal, so they balance each other out, and hence, the skewness is zero.

Discuss it

If missing data is not properly addressed, the model's ________ can be significantly affected.

F1 score
accuracy
precision
recall

If missing data is not handled correctly, it can lead to biases in the data, which can adversely affect the model's accuracy.

Discuss it

In a clinical trial, the average recovery time for a new drug is drastically increased due to a patient who took an unusually long time to recover. How would this patient's data point be classified in this context?

A normal data point
An error
An outlier
nan

This patient's data point would be classified as an outlier as it deviates significantly from the other data points.

Discuss it

A researcher measures the heights of a large group of individuals and finds that the data is symmetrically distributed with most of the values clustered around the mean. Which distribution does the data most likely follow?

Binomial Distribution
Normal Distribution
Poisson Distribution
Uniform Distribution

Given the characteristics of the data - symmetric distribution and most values clustered around the mean, it is most likely that the data follows a Normal Distribution.

Discuss it