The Central Limit Theorem states that the sum of a large number of independent and identically distributed variables will approximately follow a _____ Distribution, regardless of the shape of the original distribution.
- Binomial
- Normal
- Poisson
- Uniform
The Central Limit Theorem states that the sum of a large number of independent and identically distributed variables will approximately follow a Normal Distribution, regardless of the shape of the original distribution.
You are using a box plot to analyze a dataset and observe that the upper whisker is much longer than the lower whisker. What could this indicate about your data?
- Data has negative skewness
- Data has positive skewness
- Data is evenly distributed
- Data is normally distributed
If the upper whisker in a box plot is much longer than the lower whisker, it can indicate that the data has positive skewness, meaning there are a number of data points greater than the median.
_______ is typically used when the data analyst has no specific expectations from the data, whereas _______ is used when the analyst wants to confirm certain assumptions.
- CDA, EDA
- EDA, CDA
- EDA, Predictive Modeling
- Predictive Modeling, EDA
EDA (Exploratory Data Analysis) is typically used when the data analyst does not have specific expectations or hypotheses about the data. It is an open-ended process where we aim to discover patterns and anomalies in the data. CDA (Confirmatory Data Analysis), on the other hand, is used when the analyst wants to confirm or refute certain assumptions or hypotheses.
Imagine a dataset with a negative skewness and a low kurtosis. How would this influence your data interpretation and statistical tests?
- It would not impact the interpretation or statistical tests.
- The data would be less likely to have outliers and the distribution would be wider.
- The data would be more likely to have outliers and the distribution would be narrow.
- The mean of the dataset would be greater than the median.
Negative skewness means that the tail of the distribution extends towards more negative values and most values are clustered around the right tail. Low kurtosis (or platykurtic) suggests that the data is flatter and more spread out than a normal distribution, indicating less likelihood of extreme outliers.
How does the Z-score method perform when the data is not normally distributed?
- It performs better
- It performs the same
- It performs worse
- Its performance is independent of the data distribution
Z-score method assumes a Gaussian distribution and can perform poorly when data is not normally distributed, possibly leading to an over or under identification of outliers.
Define kurtosis in statistical data analysis.
- It's the measure of how outliers are present in the data.
- It's the measure of how the data is centered around the mean.
- It's the measure of the "tailedness" of the distribution.
- It's the measure of the spread of data.
Kurtosis in statistical data analysis is the measure of the "tailedness" of the distribution. It describes the extreme values in one versus the other tail. It is used to describe the peak of a distribution.
When outliers are present in the dataset, we prefer to use _____ scaling.
- Min-Max
- Robust
- Standard
- Z-score
When outliers are present in the dataset, we prefer to use Robust scaling. Robust scaling uses the median and interquartile range for scaling, thus it is less affected by outliers than other methods such as Min-Max and Z-score.
In a scenario where you are dealing with stock return data, the returns are exhibiting high positive kurtosis. What does this imply?
- The stock return data has a high degree of negative skewness.
- The stock return data is less likely to experience extreme events.
- The stock return data is more likely to experience extreme events.
- The stock return data is normally distributed.
High positive kurtosis in stock return data, known as leptokurtosis, means that the returns are prone to extreme jumps, i.e., the distribution has fatter tails. Therefore, the stock is more likely to experience extreme events than a normally distributed return.
What is the main goal of data visualization?
- To display all data in a single graph
- To make data look colorful and appealing
- To transform data into a graphical format
- To understand complex data through graphical representation
The main goal of data visualization is to help understand complex data sets by transforming them into a graphical representation. Good visualizations simplify complex data and make it understandable and interpretable, enabling more informed decision-making.
Suppose the Variance Inflation Factor (VIF) of a variable in your model is 10. What does this imply and what actions would you take?
- The variable is causing overfitting.
- The variable is highly correlated with other predictors.
- The variable is not correlated with other predictors.
- The variable is not important in predicting the output.
A high VIF value (generally greater than 5 or 10) indicates that a predictor is highly correlated with other predictors in the model. Actions to rectify this might include removing the variable from the model, combining it with other variables, or using techniques like PCA.