How does the Central Limit Theorem relate to the Normal Distribution?

The Central Limit Theorem and the Normal Distribution are unrelated
The Central Limit Theorem states that any distribution can be transformed into a Normal Distribution
The Central Limit Theorem states that large samples will always follow a Normal Distribution
The Central Limit Theorem states that the sum of independent and identically distributed random variables tends toward a Normal Distribution

The Central Limit Theorem states that the sum of a large number of independent and identically distributed random variables, irrespective of their shape, tends towards a Normal Distribution as the number of variables increases.

Discuss it

You're visualizing a bivariate data set using a scatter plot and notice an isolated group of points far from the main concentration of data. How would you categorize these points?

Negative correlation
Normal data points
Outliers
Positive correlation

In a scatter plot, a group of points that are isolated from the main concentration of data could be categorized as outliers.

Discuss it

Multicollinearity refers to a situation where _________.

All variables in a model are perfectly uncorrelated
Two or more predictors in a regression model are highly correlated
Two variables are uncorrelated
Two variables have a correlation coefficient of zero

Multicollinearity refers to a situation in which two or more predictors in a regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

Discuss it

You have to present the sales data of a company over 10 years to the board of directors. What type of graph should you choose and why?

Histogram, because it shows distributions
Line graph, because it shows trends over time
Pie chart, because it shows proportions
Scatter plot, because it shows relationships between variables

A line graph would be the most suitable choice for presenting sales data over time. Line graphs are excellent for showing continuous data over time set at equal intervals, like years in this case. This would allow the board of directors to easily see trends, patterns, and fluctuations in sales data.

Discuss it

The _____ is the most appropriate measure of central tendency when the distribution of data is heavily skewed.

Mean
Median
Mode
Standard Deviation

The "Median" is the most appropriate measure of central tendency when the distribution of data is heavily skewed. This is because it is less affected by outliers and skewed data.

Discuss it

The kernel in a kernel density plot is a _____ function.

Exponential
Linear
Logarithmic
Smoothing

In a kernel density plot, the kernel is a smoothing function. It takes the raw data and smooths it into a continuous curve, providing a clear picture of the distribution and density of data.

Discuss it

Consider a situation where a Spearman's correlation coefficient between two variables was found to be significantly different from zero, but the Pearson's correlation was not. How would you explain this?

Data has outliers
Data is normally distributed
Relationship between variables is linear
Relationship between variables is non-linear

If Spearman's correlation is significantly different from zero, but the Pearson's correlation is not, it suggests that the relationship between the variables is not linear. Spearman's correlation, unlike Pearson's, can capture monotonic relationships that are not necessarily linear.

Discuss it

Which type of missing data relies on information that is not included in the dataset?

MAR
MCAR
NMAR
nan

NMAR (Not Missing At Random) type of missing data relies on information that is not included in the dataset.

Discuss it

You are working on a dataset with ordinal variables. You are interested in the correlation between these variables. Which correlation coefficient would be the best choice and why?

Covariance
Kendall's Tau
Pearson's correlation coefficient
Spearman's correlation coefficient

In a dataset with ordinal variables, Spearman's correlation coefficient would be the best choice. This is because Spearman's correlation coefficient does not assume that data is normally distributed and works with ranks, making it suitable for ordinal data.

Discuss it

During the '______' phase of the EDA process, you might use visualization techniques to understand the patterns in your data.

communicating
exploring
questioning
wrangling

During the 'exploring' phase of the EDA process, you might use visualization techniques to understand the patterns in your data. This step involves delving into the data to discover patterns, spot anomalies, test hypotheses, and check assumptions.

Discuss it