In what scenario would you choose standardization over Min-Max scaling?
- When the algorithm requires features to be on the same scale and the data is normally distributed
- When the maximum and minimum values are unknown
- When there are no outliers in the data
- When you need to normalize the distribution
You would choose standardization over Min-Max scaling when the algorithm requires features to be on the same scale and the data is normally distributed. Standardization does not bound values to a specific range like Min-Max scaling, which can be useful for algorithms that do not require input features to be within a certain range.
What is the effect of standardization (z-score) on the mean and standard deviation of the dataset?
- It changes the mean to 0 and standard deviation to 1
- It changes the mean to 1 and standard deviation to 0
- It changes the mean to the median of the dataset and standard deviation to 1
- It doesn't affect the mean and standard deviation
The effect of standardization on a dataset is that it changes the mean to 0 and standard deviation to 1. After standardization, the dataset will have properties of a standard normal distribution with mean=0 and standard deviation=1.
Improper handling of missing data can affect the ________ of a model, thereby impacting its ability to generalize on unseen data.
- bias-variance tradeoff
- overfitting
- regularization
- underfitting
Improper handling of missing data can adversely affect the bias-variance tradeoff of a model. This can lead to issues such as overfitting or underfitting, which impact the model's ability to generalize to unseen data.
In which situations would you prefer a heatmap over a scatter plot?
- When dealing with a single variable
- When the data is non-numeric
- When visualizing a time series
- When visualizing the correlation between multiple variables
You would prefer a heatmap over a scatter plot when visualizing the correlation between multiple variables. A heatmap can visualize any number of variables at once and is particularly effective when the dataset contains many variables.
Which of the following types of analysis provides the least assumptions about data: EDA, CDA, or Predictive Modeling?
- CDA
- EDA
- Predictive Modeling
- They all make the same number of assumptions.
EDA makes the least assumptions about data. While CDA and Predictive Modeling typically require some assumptions about the data's distribution or the relationships between variables, EDA is a more open-ended exploration of the data's structure and patterns.
You have a scatter plot with a strong positive correlation, but a few points are far from the correlation line. What might these points represent?
- Correlated data points
- False positives
- Normal data points
- Outliers
In a scatter plot, points that are far away from the correlation line often represent outliers.
You create a histogram of a dataset and notice that the frequency count is very high on the far right of the distribution but drops significantly after that. What can be inferred from this?
- Data has a negative skewness
- Data has a positive skewness
- Data is evenly distributed
- Data is normally distributed
If the frequency count in a histogram is very high on the far right but drops significantly after that, it can indicate that the data has a positive skewness.
In a box plot, outliers are typically represented as ______.
- boxes
- dots
- lines
- whiskers
In a box plot, outliers are typically represented as dots or points that fall outside the whiskers of the box.
In what way does improper handling of missing data affect the generalization capability of a model?
- Depends on the amount of missing data.
- Hampers generalization.
- Improves generalization.
- No effect on generalization.
Improper handling of missing data can lead to the model learning incorrect or misleading patterns from the data. This can hamper the model's ability to generalize well to unseen data.
What is the key visual feature of a scatter plot that may indicate the presence of outliers?
- Color coding
- Legends
- Points far away from the general grouping
- Trend line
Points that are far away from the general grouping in a scatter plot may indicate the presence of outliers.