You've created a histogram of your data and you notice a few bars standing alone far from the main distribution. What might this suggest?
- Data is evenly distributed
- Normal distribution
- Outliers
- Skewness
In a histogram, bars that stand alone far from the main distribution often suggest the presence of outliers.
You have a dataset where the relationships between variables are not linear. Which correlation method is better to use and why?
- Covariance
- Kendall's Tau
- Pearson's correlation coefficient
- Spearman's correlation coefficient
For non-linear relationships between variables, Spearman's correlation coefficient would be a better choice. This is because Spearman's correlation measures the monotonic relationship between two variables and does not require the relationship to be linear.
Which of the following is a type of data distribution?
- Age Bracket Distribution
- Binomial Distribution
- Household Distribution
- Sales Distribution
The Binomial Distribution is a type of probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials each with the same probability of success.
How does Robust scaling minimize the effect of outliers?
- By ignoring them during the scaling process
- By removing the outliers
- By scaling based on the median and interquartile range instead of mean and variance
- By transforming the outliers
Robust scaling minimizes the effects of outliers by using the median and the interquartile range for scaling, instead of the mean and variance used by standardization. The interquartile range is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile). As the median and interquartile range are not affected by outliers, this method is robust to them.
Which measure of dispersion is defined as the difference between the largest and smallest values in a data set?
- Interquartile Range (IQR)
- Range
- Standard Deviation
- Variance
The "Range" is the measure of dispersion that is defined as the difference between the largest and smallest values in a data set.
The missing data mechanism where missingness is related only to the observed data is referred to as _________.
- All missing data
- MAR
- MCAR
- NMAR
In MAR (Missing at Random), the missingness is related only to the observed data.
You are given a dataset for an upcoming data analysis project. What initial EDA steps would you take before moving to model building?
- Explore the structure of the dataset, summarize the data, and create visualizations
- Perform a detailed statistical analysis
- Run a quick ML model to test the data
- Start cleaning and wrangling the data
Before moving to model building, it's important to first understand the dataset you're working with. The initial EDA steps would typically include exploring the structure of the dataset, summarizing the data (such as calculating central tendency measures and dispersion), and creating visualizations to uncover patterns, trends, and relationships.
Which measure of central tendency divides a data set into two equal halves?
- Mean
- Median
- Mode
- nan
The "Median" is the measure of central tendency that divides a data set into two equal halves. It is the middle score for a set of ordered data such that 50% of the scores are above it, and 50% are below it.
__________ missing data occurs when the probability of an observation being missing depends on both observed and unobserved data.
- All missing data
- MAR
- MCAR
- NMAR
NMAR (Not Missing at Random) missing data occurs when the missingness depends on both observed and unobserved data.
How does the assumption of MAR differ from MCAR in terms of data missingness?
- MAR assumes the missingness is only related to the observed data
- MAR assumes the missingness is related to the unobserved data
- MAR assumes the missingness is unrelated to any variable
- There's no difference between MAR and MCAR
In MCAR, the missingness is completely random and doesn't depend on any variable. In MAR, the missingness is not random but is related only to the observed data, not the unobserved (missing) data.