You are conducting a study on the effectiveness of a new drug. Patients rate their pain levels before and after the treatment on a scale of 1-10. What type of data are these ratings?

  • Continuous data
  • Nominal data
  • Ordinal data
  • Ratio data
Patients' pain levels are ordinal data as they're categorized into an order (1-10) but the intervals between the levels might not be equivalent.

How does the Z-score method perform when the data is not normally distributed?

  • It performs better
  • It performs the same
  • It performs worse
  • Its performance is independent of the data distribution
Z-score method assumes a Gaussian distribution and can perform poorly when data is not normally distributed, possibly leading to an over or under identification of outliers.

Define kurtosis in statistical data analysis.

  • It's the measure of how outliers are present in the data.
  • It's the measure of how the data is centered around the mean.
  • It's the measure of the "tailedness" of the distribution.
  • It's the measure of the spread of data.
Kurtosis in statistical data analysis is the measure of the "tailedness" of the distribution. It describes the extreme values in one versus the other tail. It is used to describe the peak of a distribution.

When outliers are present in the dataset, we prefer to use _____ scaling.

  • Min-Max
  • Robust
  • Standard
  • Z-score
When outliers are present in the dataset, we prefer to use Robust scaling. Robust scaling uses the median and interquartile range for scaling, thus it is less affected by outliers than other methods such as Min-Max and Z-score.

In a scenario where you are dealing with stock return data, the returns are exhibiting high positive kurtosis. What does this imply?

  • The stock return data has a high degree of negative skewness.
  • The stock return data is less likely to experience extreme events.
  • The stock return data is more likely to experience extreme events.
  • The stock return data is normally distributed.
High positive kurtosis in stock return data, known as leptokurtosis, means that the returns are prone to extreme jumps, i.e., the distribution has fatter tails. Therefore, the stock is more likely to experience extreme events than a normally distributed return.

What is the main goal of data visualization?

  • To display all data in a single graph
  • To make data look colorful and appealing
  • To transform data into a graphical format
  • To understand complex data through graphical representation
The main goal of data visualization is to help understand complex data sets by transforming them into a graphical representation. Good visualizations simplify complex data and make it understandable and interpretable, enabling more informed decision-making.

Suppose the Variance Inflation Factor (VIF) of a variable in your model is 10. What does this imply and what actions would you take?

  • The variable is causing overfitting.
  • The variable is highly correlated with other predictors.
  • The variable is not correlated with other predictors.
  • The variable is not important in predicting the output.
A high VIF value (generally greater than 5 or 10) indicates that a predictor is highly correlated with other predictors in the model. Actions to rectify this might include removing the variable from the model, combining it with other variables, or using techniques like PCA.

When data is normally distributed, approximately 95% of the data falls within ________ standard deviations of the mean.

  • Four
  • One
  • Three
  • Two
When data is normally distributed, approximately "95%" of the data falls within "Two" standard deviations of the mean. This is known as the empirical rule, or the 68-95-99.7 rule, a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution.

How do filter, wrapper, and embedded methods for feature selection differ from each other?

  • By the bias-variance tradeoff
  • By the computational complexity
  • By the problem-solving approach
  • By their use of machine learning models
Filter methods for feature selection evaluate the relevance of the input features based on their correlation with the target variable, and do not involve the use of any specific machine learning algorithm. Wrapper methods involve the use of a specific machine learning algorithm and select features that contribute to the performance of the model. Embedded methods integrate feature selection as part of the model training process.

The process of presenting data in a graphical format to help people understand the significance of the data is called ____________.

  • Data manipulation
  • Data transformation
  • Data validation
  • Data visualization
Data Visualization is the process of representing raw data in a graphical format that reveals the inherent patterns, correlations, trends, outliers, and significant features of the data, making it easy to comprehend and interpret.