When a model has very high variance and is too complex, which problem is it likely facing?
- Bias
- Noise
- Overfitting
- Underfitting
When a model has high variance and complexity, it is likely facing overfitting. Overfit models perform well on training data but poorly on new, unseen data, as they've learned to capture noise in the data, not the underlying patterns.
A common measure of performance in the multi-armed bandit problem is the cumulative ________ over time.
- Rewards
- Q-values
- States
- Actions
The cumulative rewards over time are a common measure of performance in the multi-armed bandit problem, as you aim to maximize total reward.
How does a high kurtosis value in a data set impact the Z-score method for outlier detection?
- It decreases the number of detected outliers
- It does not impact the detection of outliers
- It improves the accuracy of outlier detection
- It increases the number of detected outliers
A high kurtosis value means that the data has heavy tails or outliers. This can impact the Z-score method by increasing the number of detected outliers as Z-score is sensitive to extreme values.
In what way does improper handling of missing data affect regularization techniques in a machine learning model?
- Depends on the regularization technique used.
- Does not impact regularization.
- Makes regularization less effective.
- Makes regularization more effective.
If missing data are not handled correctly, it can skew the model's learning and affect its complexity, making regularization techniques (which aim to control model complexity) less effective.
You are analyzing a dataset with a high degree of negative skewness. How might this affect your choice of machine learning model?
- It might lead to a preference for models that are based on median values.
- It might lead to a preference for models that are not sensitive to outliers.
- It might lead to a preference for models that are sensitive to outliers.
- It would not affect the choice of the machine learning model.
A high degree of negative skewness indicates the possibility of extreme values towards the negative end of the distribution. This might influence the choice of machine learning models, preferring those that are not sensitive to outliers, such as tree-based models, or those that make fewer assumptions about the data distribution.
You are given a dataset with several missing values that are missing at random. You decided to use multiple imputation. What steps will you follow in applying this method?
- Create several imputed datasets, analyze separately, then average results
- Create several imputed datasets, analyze them together, then interpret results
- Impute only once, then analyze
- Impute several times using different methods, then analyze
The correct approach for multiple imputation is to create several imputed datasets, analyze them separately, and then combine the results. This accounts for the uncertainty around the missing values and results in valid statistical inferences.
What is skewness in the context of data analysis?
- The asymmetry of the distribution.
- The peak of the distribution.
- The range of the distribution.
- The symmetry of the distribution.
Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data. If the curve of a data distribution is skewed to the left or to the right, it means the data are asymmetrical.
While using regression imputation, you encounter a situation where the predicted value for the missing data is outside the expected range. How might you resolve this issue?
- Constrain the predictions within the expected range
- Ignore the problem
- Transform the data
- Use a different imputation method
When the predicted value for missing data is outside the expected range, you might want to constrain the predictions within the expected range. By setting logical bounds, you can make sure that the imputed values are consistent with the known characteristics of the data.
The square root of the ________ gives the standard deviation of a data set.
- Mean
- Median
- Range
- Variance
The "Variance" of a dataset is the average of the squared differences from the mean. The "Standard Deviation" is the square root of the variance. This means it's in the same unit as the data, which helps us understand the dispersion better.
Continuous data typically can be divided into which two main types?
- Discrete and ordinal data
- Interval and ratio data
- Ordinal and nominal data
- Qualitative and quantitative data
Continuous data can typically be divided into two main types: interval and ratio data. Interval data have a consistent scale but no true zero, while ratio data have a consistent scale and a true zero.
What is the Interquartile Range (IQR)?
- The average spread of the data
- The range of all the data
- The range of the middle 50% of the data
- The spread of the most common data
The Interquartile Range (IQR) is the "Range of the middle 50% of the data". It is calculated as the difference between the upper quartile (Q3) and the lower quartile (Q1).
In what scenario would a Poisson Distribution be a better fit than a Normal Distribution?
- When modeling the number of times an event occurs in a fixed interval
- When the data are continuous
- When the data are negatively skewed
- When the data are positively skewed
A Poisson Distribution would be a better fit when modeling the number of times an event occurs in a fixed interval of time or space. The Poisson Distribution is discrete while the Normal Distribution is continuous.