When would it be appropriate to use 'transformation' as an outlier handling method?
- When the outliers are a result of data duplication
- When the outliers are errors in data collection
- When the outliers are extreme but legitimate data points
- When the outliers do not significantly impact the data analysis
Transformation is appropriate to use as an outlier handling method when the outliers are extreme but legitimate data points that carry valuable information.
Suppose you are comparing the dispersion of two different data sets. One has a higher range, but a lower IQR than the other. What might this tell you about each data set?
- The one with the higher range has more outliers
- The one with the higher range has more variability
- The one with the lower IQR has more variability
- The one with the lower IQR is more skewed
If one dataset has a higher range but a lower IQR than the other, it could suggest that "The one with the higher range has more outliers". The range is sensitive to extreme values, while the IQR focuses on the middle 50% of data and is not affected by outliers.
What is a primary assumption when using regression imputation?
- All data is normally distributed
- Missing data is missing completely at random (MCAR)
- Missing values are negligible
- The relationship between variables is linear
A primary assumption when using regression imputation is that the relationship between variables is linear. This is because regression imputation uses a regression model to predict missing values, and the basic form of regression models assumes a linear relationship between predictor and response variables.
You are working on a dataset and found that the model performance is poor. On further inspection, you found some data points that are far from the rest. What could be a possible reason for the poor performance of your model?
- Outliers
- Overfitting
- Underfitting
- nan
The poor performance of the model might be due to outliers in the dataset. Outliers can have a significant impact on the performance of machine learning models.
As a data scientist, you've realized that your dataset contains missing values. How would you handle this situation as part of your EDA process?
- Always replace missing values with the mean or median
- Choose an appropriate imputation method depending on the nature of the data and the type of missingness
- Ignore the missing values and proceed with analysis
- Remove all instances with missing values
Handling missing values is an important part of the EDA process. The method used to handle them depends on the nature of the data and the type of missingness (MCAR, MAR, or NMAR). Various imputation methods can be used, such as mean/median/mode imputation for MCAR or MAR data, and advanced imputation methods like regression imputation, multiple imputation, or model-based methods for NMAR data.
You've created a histogram of your data and you notice a few bars standing alone far from the main distribution. What might this suggest?
- Data is evenly distributed
- Normal distribution
- Outliers
- Skewness
In a histogram, bars that stand alone far from the main distribution often suggest the presence of outliers.
You have a dataset where the relationships between variables are not linear. Which correlation method is better to use and why?
- Covariance
- Kendall's Tau
- Pearson's correlation coefficient
- Spearman's correlation coefficient
For non-linear relationships between variables, Spearman's correlation coefficient would be a better choice. This is because Spearman's correlation measures the monotonic relationship between two variables and does not require the relationship to be linear.
Which of the following is a type of data distribution?
- Age Bracket Distribution
- Binomial Distribution
- Household Distribution
- Sales Distribution
The Binomial Distribution is a type of probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials each with the same probability of success.
How does Robust scaling minimize the effect of outliers?
- By ignoring them during the scaling process
- By removing the outliers
- By scaling based on the median and interquartile range instead of mean and variance
- By transforming the outliers
Robust scaling minimizes the effects of outliers by using the median and the interquartile range for scaling, instead of the mean and variance used by standardization. The interquartile range is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile). As the median and interquartile range are not affected by outliers, this method is robust to them.
Which measure of dispersion is defined as the difference between the largest and smallest values in a data set?
- Interquartile Range (IQR)
- Range
- Standard Deviation
- Variance
The "Range" is the measure of dispersion that is defined as the difference between the largest and smallest values in a data set.