A market research survey collects data on customer age, gender, and preference for a product (Yes/No). Identify the types of data present in this survey.
- Age: continuous, Gender: nominal, Preference: ordinal
- Age: nominal, Gender: ordinal, Preference: interval
- Age: ordinal, Gender: interval, Preference: ratio
- Age: ratio, Gender: ordinal, Preference: nominal
Age is a continuous data type because it can take on any value within a range. Gender is nominal as it's categorical with no order or priority. Preference is ordinal as it's categorical with a clear order (Yes is preferred to No).
You notice that using the Z-score method for a particular data set is yielding too many outliers. What modifications can you make to the method to reduce the number of outliers detected?
- Decrease the Z-score threshold
- Increase the Z-score threshold
- Use the IQR method instead
- Use the modified Z-score method instead
Increasing the Z-score threshold will mean fewer points will exceed it, thus fewer outliers will be identified.
How can EDA assist in identifying errors or anomalies in the dataset?
- By conducting a statistical test of normality
- By creating a correlation matrix of the variables
- By running the dataset through a predefined ML model
- By summarizing and visualizing the data, which can reveal unexpected values or patterns
EDA, especially through summarizing and visualizing data, can assist in identifying errors or anomalies in the dataset. Graphical representations of data often make it easier to spot unexpected values, patterns, or aberrations that may not be apparent in the raw data.
When applying regression imputation, what factors need to be taken into consideration?
- Both dependent and independent variables
- None of the variables
- Only the dependent variable
- Only the independent variables
When applying regression imputation, both dependent and independent variables need to be taken into consideration. A regression model is built using the complete cases and then this model is used to predict the missing values in the incomplete cases. Therefore, it is important to carefully consider which variables to include in the regression model.
When would it be appropriate to use 'transformation' as an outlier handling method?
- When the outliers are a result of data duplication
- When the outliers are errors in data collection
- When the outliers are extreme but legitimate data points
- When the outliers do not significantly impact the data analysis
Transformation is appropriate to use as an outlier handling method when the outliers are extreme but legitimate data points that carry valuable information.
Which of the following is a type of data distribution?
- Age Bracket Distribution
- Binomial Distribution
- Household Distribution
- Sales Distribution
The Binomial Distribution is a type of probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials each with the same probability of success.
How does Robust scaling minimize the effect of outliers?
- By ignoring them during the scaling process
- By removing the outliers
- By scaling based on the median and interquartile range instead of mean and variance
- By transforming the outliers
Robust scaling minimizes the effects of outliers by using the median and the interquartile range for scaling, instead of the mean and variance used by standardization. The interquartile range is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile). As the median and interquartile range are not affected by outliers, this method is robust to them.
Which measure of dispersion is defined as the difference between the largest and smallest values in a data set?
- Interquartile Range (IQR)
- Range
- Standard Deviation
- Variance
The "Range" is the measure of dispersion that is defined as the difference between the largest and smallest values in a data set.
The missing data mechanism where missingness is related only to the observed data is referred to as _________.
- All missing data
- MAR
- MCAR
- NMAR
In MAR (Missing at Random), the missingness is related only to the observed data.
You are given a dataset for an upcoming data analysis project. What initial EDA steps would you take before moving to model building?
- Explore the structure of the dataset, summarize the data, and create visualizations
- Perform a detailed statistical analysis
- Run a quick ML model to test the data
- Start cleaning and wrangling the data
Before moving to model building, it's important to first understand the dataset you're working with. The initial EDA steps would typically include exploring the structure of the dataset, summarizing the data (such as calculating central tendency measures and dispersion), and creating visualizations to uncover patterns, trends, and relationships.