How many variables can a heatmap typically visualize at once?

  • Any number
  • Four
  • Three
  • Two
A heatmap can visualize any number of variables at once. Each cell in the heatmap corresponds to a combination of categories from all the variables.

In a scenario where you are dealing with stock return data, the returns are exhibiting high positive kurtosis. What does this imply?

  • The stock return data has a high degree of negative skewness.
  • The stock return data is less likely to experience extreme events.
  • The stock return data is more likely to experience extreme events.
  • The stock return data is normally distributed.
High positive kurtosis in stock return data, known as leptokurtosis, means that the returns are prone to extreme jumps, i.e., the distribution has fatter tails. Therefore, the stock is more likely to experience extreme events than a normally distributed return.

What is the main goal of data visualization?

  • To display all data in a single graph
  • To make data look colorful and appealing
  • To transform data into a graphical format
  • To understand complex data through graphical representation
The main goal of data visualization is to help understand complex data sets by transforming them into a graphical representation. Good visualizations simplify complex data and make it understandable and interpretable, enabling more informed decision-making.

Suppose the Variance Inflation Factor (VIF) of a variable in your model is 10. What does this imply and what actions would you take?

  • The variable is causing overfitting.
  • The variable is highly correlated with other predictors.
  • The variable is not correlated with other predictors.
  • The variable is not important in predicting the output.
A high VIF value (generally greater than 5 or 10) indicates that a predictor is highly correlated with other predictors in the model. Actions to rectify this might include removing the variable from the model, combining it with other variables, or using techniques like PCA.

When data is normally distributed, approximately 95% of the data falls within ________ standard deviations of the mean.

  • Four
  • One
  • Three
  • Two
When data is normally distributed, approximately "95%" of the data falls within "Two" standard deviations of the mean. This is known as the empirical rule, or the 68-95-99.7 rule, a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution.

How do filter, wrapper, and embedded methods for feature selection differ from each other?

  • By the bias-variance tradeoff
  • By the computational complexity
  • By the problem-solving approach
  • By their use of machine learning models
Filter methods for feature selection evaluate the relevance of the input features based on their correlation with the target variable, and do not involve the use of any specific machine learning algorithm. Wrapper methods involve the use of a specific machine learning algorithm and select features that contribute to the performance of the model. Embedded methods integrate feature selection as part of the model training process.

The process of presenting data in a graphical format to help people understand the significance of the data is called ____________.

  • Data manipulation
  • Data transformation
  • Data validation
  • Data visualization
Data Visualization is the process of representing raw data in a graphical format that reveals the inherent patterns, correlations, trends, outliers, and significant features of the data, making it easy to comprehend and interpret.

_______ is typically used when the data analyst has no specific expectations from the data, whereas _______ is used when the analyst wants to confirm certain assumptions.

  • CDA, EDA
  • EDA, CDA
  • EDA, Predictive Modeling
  • Predictive Modeling, EDA
EDA (Exploratory Data Analysis) is typically used when the data analyst does not have specific expectations or hypotheses about the data. It is an open-ended process where we aim to discover patterns and anomalies in the data. CDA (Confirmatory Data Analysis), on the other hand, is used when the analyst wants to confirm or refute certain assumptions or hypotheses.

Imagine a dataset with a negative skewness and a low kurtosis. How would this influence your data interpretation and statistical tests?

  • It would not impact the interpretation or statistical tests.
  • The data would be less likely to have outliers and the distribution would be wider.
  • The data would be more likely to have outliers and the distribution would be narrow.
  • The mean of the dataset would be greater than the median.
Negative skewness means that the tail of the distribution extends towards more negative values and most values are clustered around the right tail. Low kurtosis (or platykurtic) suggests that the data is flatter and more spread out than a normal distribution, indicating less likelihood of extreme outliers.

How does the Z-score method perform when the data is not normally distributed?

  • It performs better
  • It performs the same
  • It performs worse
  • Its performance is independent of the data distribution
Z-score method assumes a Gaussian distribution and can perform poorly when data is not normally distributed, possibly leading to an over or under identification of outliers.