You notice that using the Z-score method for a particular data set is yielding too many outliers. What modifications can you make to the method to reduce the number of outliers detected?
- Decrease the Z-score threshold
- Increase the Z-score threshold
- Use the IQR method instead
- Use the modified Z-score method instead
Increasing the Z-score threshold will mean fewer points will exceed it, thus fewer outliers will be identified.
A market research survey collects data on customer age, gender, and preference for a product (Yes/No). Identify the types of data present in this survey.
- Age: continuous, Gender: nominal, Preference: ordinal
- Age: nominal, Gender: ordinal, Preference: interval
- Age: ordinal, Gender: interval, Preference: ratio
- Age: ratio, Gender: ordinal, Preference: nominal
Age is a continuous data type because it can take on any value within a range. Gender is nominal as it's categorical with no order or priority. Preference is ordinal as it's categorical with a clear order (Yes is preferred to No).
If the variance of a data set is zero, then all data points are ________.
- Equal
- Infinite
- Negative
- Positive
If the "Variance" of a data set is zero, then all data points are "Equal". Variance is a measure of how far a set of numbers is spread out from their average value. A variance of zero indicates that all the values within a set of data are identical.
As a data scientist, you've realized that your dataset contains missing values. How would you handle this situation as part of your EDA process?
- Always replace missing values with the mean or median
- Choose an appropriate imputation method depending on the nature of the data and the type of missingness
- Ignore the missing values and proceed with analysis
- Remove all instances with missing values
Handling missing values is an important part of the EDA process. The method used to handle them depends on the nature of the data and the type of missingness (MCAR, MAR, or NMAR). Various imputation methods can be used, such as mean/median/mode imputation for MCAR or MAR data, and advanced imputation methods like regression imputation, multiple imputation, or model-based methods for NMAR data.
You are working on a dataset and found that the model performance is poor. On further inspection, you found some data points that are far from the rest. What could be a possible reason for the poor performance of your model?
- Outliers
- Overfitting
- Underfitting
- nan
The poor performance of the model might be due to outliers in the dataset. Outliers can have a significant impact on the performance of machine learning models.
What is a primary assumption when using regression imputation?
- All data is normally distributed
- Missing data is missing completely at random (MCAR)
- Missing values are negligible
- The relationship between variables is linear
A primary assumption when using regression imputation is that the relationship between variables is linear. This is because regression imputation uses a regression model to predict missing values, and the basic form of regression models assumes a linear relationship between predictor and response variables.
Suppose you are comparing the dispersion of two different data sets. One has a higher range, but a lower IQR than the other. What might this tell you about each data set?
- The one with the higher range has more outliers
- The one with the higher range has more variability
- The one with the lower IQR has more variability
- The one with the lower IQR is more skewed
If one dataset has a higher range but a lower IQR than the other, it could suggest that "The one with the higher range has more outliers". The range is sensitive to extreme values, while the IQR focuses on the middle 50% of data and is not affected by outliers.
You have a dataset where the relationships between variables are not linear. Which correlation method is better to use and why?
- Covariance
- Kendall's Tau
- Pearson's correlation coefficient
- Spearman's correlation coefficient
For non-linear relationships between variables, Spearman's correlation coefficient would be a better choice. This is because Spearman's correlation measures the monotonic relationship between two variables and does not require the relationship to be linear.
You've created a histogram of your data and you notice a few bars standing alone far from the main distribution. What might this suggest?
- Data is evenly distributed
- Normal distribution
- Outliers
- Skewness
In a histogram, bars that stand alone far from the main distribution often suggest the presence of outliers.
What is the term for the measure of how spread out the values in a data set are?
- Central Tendency
- Dispersion
- Kurtosis
- Skewness
The term for the measure of how spread out the values in a data set are is called "Dispersion". It includes range, interquartile range (IQR), variance, and standard deviation.
What range of values does a dataset typically have after Min-Max scaling?
- -1 to 1
- 0 to 1
- Depends on the dataset
- Depends on the feature
Min-Max scaling transforms features by scaling each feature to a given range. The default range for the Min-Max scaling technique is 0 to 1. Therefore, after Min-Max scaling, the dataset will typically have values ranging from 0 to 1.
Consider a data distribution with a positive skewness and a high kurtosis. What does this scenario indicate about the distribution?
- It has a symmetrical distribution.
- It has evenly spread out values.
- It has many values clustered around the left tail with potential outliers.
- It has many values clustered around the right tail with potential outliers.
Positive skewness and high kurtosis imply that the data is heavily tailed to the right and the peak is sharp. Most of the data values are concentrated around the left tail, but there are potential outliers towards the more positive values.