Which of the following types of analysis provides the least assumptions about data: EDA, CDA, or Predictive Modeling?

  • CDA
  • EDA
  • Predictive Modeling
  • They all make the same number of assumptions.
EDA makes the least assumptions about data. While CDA and Predictive Modeling typically require some assumptions about the data's distribution or the relationships between variables, EDA is a more open-ended exploration of the data's structure and patterns.

In which situations would you prefer a heatmap over a scatter plot?

  • When dealing with a single variable
  • When the data is non-numeric
  • When visualizing a time series
  • When visualizing the correlation between multiple variables
You would prefer a heatmap over a scatter plot when visualizing the correlation between multiple variables. A heatmap can visualize any number of variables at once and is particularly effective when the dataset contains many variables.

Improper handling of missing data can affect the ________ of a model, thereby impacting its ability to generalize on unseen data.

  • bias-variance tradeoff
  • overfitting
  • regularization
  • underfitting
Improper handling of missing data can adversely affect the bias-variance tradeoff of a model. This can lead to issues such as overfitting or underfitting, which impact the model's ability to generalize to unseen data.

What is the effect of standardization (z-score) on the mean and standard deviation of the dataset?

  • It changes the mean to 0 and standard deviation to 1
  • It changes the mean to 1 and standard deviation to 0
  • It changes the mean to the median of the dataset and standard deviation to 1
  • It doesn't affect the mean and standard deviation
The effect of standardization on a dataset is that it changes the mean to 0 and standard deviation to 1. After standardization, the dataset will have properties of a standard normal distribution with mean=0 and standard deviation=1.

In what scenario would you choose standardization over Min-Max scaling?

  • When the algorithm requires features to be on the same scale and the data is normally distributed
  • When the maximum and minimum values are unknown
  • When there are no outliers in the data
  • When you need to normalize the distribution
You would choose standardization over Min-Max scaling when the algorithm requires features to be on the same scale and the data is normally distributed. Standardization does not bound values to a specific range like Min-Max scaling, which can be useful for algorithms that do not require input features to be within a certain range.

Why is standardization (z-score) often used in machine learning algorithms?

  • Because it brings features to a comparable scale and is not bounded to a specific range
  • Because it's easy to compute
  • Because it's not affected by outliers
  • Because it's the only way to handle numerical data
Standardization, also known as Z-score normalization, is a scaling technique that subtracts the mean and divides by the standard deviation. It is often used in machine learning as it can handle features that are measured in different units by bringing them to a comparable scale. It also doesn't bound values to a specific range.

One major advantage of _______ methods over filter methods for feature selection is that they can capture the interaction between input features.

  • Embedded
  • Filter
  • PCA
  • Wrapper
One major advantage of wrapper methods over filter methods for feature selection is that they can capture the interaction between input features. Unlike filter methods that evaluate each feature independently, wrapper methods consider the subset of features and can thus capture interactions among features.

_____ data is a type of qualitative data that can be sorted into non-numerical categories.

  • Nominal
  • Ordinal
  • Qualitative
  • Quantitative
Nominal data is a type of qualitative data that can be sorted into non-numerical categories, with no order or priority.

In which plot can we see the distribution, median, quartiles, and outliers all at once?

  • Bar chart
  • Box plot
  • Pie chart
  • Scatter plot
A Box plot, also known as a whisker plot, displays a summary of the set of data values including minimum, first quartile (25th percentile), median, third quartile (75th percentile), and maximum. Outliers are also often indicated in box plots through the use of markers.

The degree of tailedness in a distribution is measured by _________.

  • Kurtosis
  • Skewness
  • Standard Deviation
  • Variance
Kurtosis is a statistical measure used to describe the distribution's tails and sharpness. It measures the degree of peakedness or flatness in a distribution, or in simple terms, the 'tailedness' of the distribution.

The diagonals of a pairplot often show the _____ of the individual variables.

  • frequency distribution
  • mean
  • mode
  • standard deviation
In pairplot, the diagonals often show the frequency distribution of the individual variables. This provides an understanding of the distribution of individual variables in addition to their relationships with other variables.

What is the key visual feature of a scatter plot that may indicate the presence of outliers?

  • Color coding
  • Legends
  • Points far away from the general grouping
  • Trend line
Points that are far away from the general grouping in a scatter plot may indicate the presence of outliers.