The Central Limit Theorem states that the sum of a large number of independent and identically distributed variables will approximately follow a _____ Distribution, regardless of the shape of the original distribution.

  • Binomial
  • Normal
  • Poisson
  • Uniform
The Central Limit Theorem states that the sum of a large number of independent and identically distributed variables will approximately follow a Normal Distribution, regardless of the shape of the original distribution.

When data is normally distributed, approximately 95% of the data falls within ________ standard deviations of the mean.

  • Four
  • One
  • Three
  • Two
When data is normally distributed, approximately "95%" of the data falls within "Two" standard deviations of the mean. This is known as the empirical rule, or the 68-95-99.7 rule, a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution.

How do filter, wrapper, and embedded methods for feature selection differ from each other?

  • By the bias-variance tradeoff
  • By the computational complexity
  • By the problem-solving approach
  • By their use of machine learning models
Filter methods for feature selection evaluate the relevance of the input features based on their correlation with the target variable, and do not involve the use of any specific machine learning algorithm. Wrapper methods involve the use of a specific machine learning algorithm and select features that contribute to the performance of the model. Embedded methods integrate feature selection as part of the model training process.

The process of presenting data in a graphical format to help people understand the significance of the data is called ____________.

  • Data manipulation
  • Data transformation
  • Data validation
  • Data visualization
Data Visualization is the process of representing raw data in a graphical format that reveals the inherent patterns, correlations, trends, outliers, and significant features of the data, making it easy to comprehend and interpret.

_______ is typically used when the data analyst has no specific expectations from the data, whereas _______ is used when the analyst wants to confirm certain assumptions.

  • CDA, EDA
  • EDA, CDA
  • EDA, Predictive Modeling
  • Predictive Modeling, EDA
EDA (Exploratory Data Analysis) is typically used when the data analyst does not have specific expectations or hypotheses about the data. It is an open-ended process where we aim to discover patterns and anomalies in the data. CDA (Confirmatory Data Analysis), on the other hand, is used when the analyst wants to confirm or refute certain assumptions or hypotheses.

Imagine a dataset with a negative skewness and a low kurtosis. How would this influence your data interpretation and statistical tests?

  • It would not impact the interpretation or statistical tests.
  • The data would be less likely to have outliers and the distribution would be wider.
  • The data would be more likely to have outliers and the distribution would be narrow.
  • The mean of the dataset would be greater than the median.
Negative skewness means that the tail of the distribution extends towards more negative values and most values are clustered around the right tail. Low kurtosis (or platykurtic) suggests that the data is flatter and more spread out than a normal distribution, indicating less likelihood of extreme outliers.

How does the Z-score method perform when the data is not normally distributed?

  • It performs better
  • It performs the same
  • It performs worse
  • Its performance is independent of the data distribution
Z-score method assumes a Gaussian distribution and can perform poorly when data is not normally distributed, possibly leading to an over or under identification of outliers.

Define kurtosis in statistical data analysis.

  • It's the measure of how outliers are present in the data.
  • It's the measure of how the data is centered around the mean.
  • It's the measure of the "tailedness" of the distribution.
  • It's the measure of the spread of data.
Kurtosis in statistical data analysis is the measure of the "tailedness" of the distribution. It describes the extreme values in one versus the other tail. It is used to describe the peak of a distribution.

When outliers are present in the dataset, we prefer to use _____ scaling.

  • Min-Max
  • Robust
  • Standard
  • Z-score
When outliers are present in the dataset, we prefer to use Robust scaling. Robust scaling uses the median and interquartile range for scaling, thus it is less affected by outliers than other methods such as Min-Max and Z-score.

In a scenario where you are dealing with stock return data, the returns are exhibiting high positive kurtosis. What does this imply?

  • The stock return data has a high degree of negative skewness.
  • The stock return data is less likely to experience extreme events.
  • The stock return data is more likely to experience extreme events.
  • The stock return data is normally distributed.
High positive kurtosis in stock return data, known as leptokurtosis, means that the returns are prone to extreme jumps, i.e., the distribution has fatter tails. Therefore, the stock is more likely to experience extreme events than a normally distributed return.