What is the Interquartile Range (IQR)?
- The average spread of the data
- The range of all the data
- The range of the middle 50% of the data
- The spread of the most common data
The Interquartile Range (IQR) is the "Range of the middle 50% of the data". It is calculated as the difference between the upper quartile (Q3) and the lower quartile (Q1).
Continuous data typically can be divided into which two main types?
- Discrete and ordinal data
- Interval and ratio data
- Ordinal and nominal data
- Qualitative and quantitative data
Continuous data can typically be divided into two main types: interval and ratio data. Interval data have a consistent scale but no true zero, while ratio data have a consistent scale and a true zero.
The square root of the ________ gives the standard deviation of a data set.
- Mean
- Median
- Range
- Variance
The "Variance" of a dataset is the average of the squared differences from the mean. The "Standard Deviation" is the square root of the variance. This means it's in the same unit as the data, which helps us understand the dispersion better.
Mishandling missing data can lead to a high level of ________, impacting model performance.
- bias
- precision
- recall
- variance
If missing data is handled improperly, it can lead to biased training data, which can cause the model to learn incorrect or irrelevant patterns and, as a result, adversely affect its performance.
How does multiple imputation handle missing data?
- It deletes rows with missing data
- It estimates multiple values for each missing value
- It fills missing data with mode values
- It replaces missing data with a single value
Multiple imputation estimates multiple values for each missing value, instead of filling in a single value for each missing point. It reflects the uncertainty around the true value and provides more realistic estimates.
Your EDA reveals a non-normal distribution of data in your dataset. How might this insight affect your choice of machine learning models or algorithms?
- You should always normalize your data
- You should use only non-parametric models
- You should use only unsupervised learning models
- Your choice of ML models might be influenced, as some models make certain assumptions about the data distribution
The distribution of data can influence the choice of machine learning models or algorithms. Some models, such as linear and logistic regression, make certain assumptions about the data distribution (i.e., they expect the input or output to be normally distributed). If these assumptions are violated, the model may perform poorly. Therefore, understanding the data distribution can guide you in choosing the most appropriate models or in deciding whether to transform your data.
What is the key characteristic of a Uniform Distribution?
- All values are equally likely
- Most values are around the mean
- Values are skewed to the left
- Values are skewed to the right
In a Uniform Distribution, all values have the same frequency/probability. That is, they are all equally likely.
How is the whisker of a box plot usually calculated?
- Mean ± Standard Deviation
- Median ± Interquartile Range
- Minimum and maximum values of the dataset
- Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
The whisker of a box plot is typically calculated using the formula: Q1 - 1.5 * IQR and Q3 + 1.5 * IQR.
Given a machine learning algorithm that is highly sensitive to the range of input values, which scaling technique should you implement?
- Min-Max scaling because it scales all values between 0 and 1
- No scaling, as the original data values should be maintained
- Robust scaling because it is not affected by outliers
- Z-score standardization because it creates a normal distribution
Min-Max scaling is suitable when the algorithm is sensitive to the range of input values, as it scales all feature values into a specified range (usually 0-1). This ensures that all features have the same scale.
Your data shows a notable difference between the mean and the median values. Which type of scaling would be least affected by this discrepancy?
- All scaling methods are affected by this discrepancy
- Min-Max scaling because it scales all values between 0 and 1
- Robust scaling because it uses median and quartile ranges
- Z-score standardization because it creates a normal distribution
Robust scaling uses the median and interquartile range to scale the data, so it is not affected by the mean and is thus least affected by a discrepancy between the mean and the median.