The missing data mechanism where missingness is related only to the observed data is referred to as _________.
- All missing data
- MAR
- MCAR
- NMAR
In MAR (Missing at Random), the missingness is related only to the observed data.
You are given a dataset for an upcoming data analysis project. What initial EDA steps would you take before moving to model building?
- Explore the structure of the dataset, summarize the data, and create visualizations
- Perform a detailed statistical analysis
- Run a quick ML model to test the data
- Start cleaning and wrangling the data
Before moving to model building, it's important to first understand the dataset you're working with. The initial EDA steps would typically include exploring the structure of the dataset, summarizing the data (such as calculating central tendency measures and dispersion), and creating visualizations to uncover patterns, trends, and relationships.
How does standardization (z-score) affect the distribution of data?
- It doesn't affect the shape of the distribution
- It makes the distribution normal
- It makes the distribution uniform
- It skews the distribution
Standardization does not change the shape of the distribution of the feature; rather, it standardizes the scale. This means that it doesn't change the distribution's skewness or kurtosis but it does center the data around zero with a standard deviation of 1.
You are analyzing the number of calls received by a call center per hour. Which distribution would be most suitable for modeling this data and why?
- Binomial Distribution because it represents the number of successes in a given number of trials
- Normal Distribution because it represents continuous data
- Poisson Distribution because it models the number of events occurring in a fixed interval of time
- Uniform Distribution because all outcomes are equally likely
The Poisson Distribution is most suitable for modeling the number of calls received by a call center per hour because it models the number of events (calls) occurring in a fixed interval of time (per hour).
Consider a data distribution with a positive skewness and a high kurtosis. What does this scenario indicate about the distribution?
- It has a symmetrical distribution.
- It has evenly spread out values.
- It has many values clustered around the left tail with potential outliers.
- It has many values clustered around the right tail with potential outliers.
Positive skewness and high kurtosis imply that the data is heavily tailed to the right and the peak is sharp. Most of the data values are concentrated around the left tail, but there are potential outliers towards the more positive values.
What range of values does a dataset typically have after Min-Max scaling?
- -1 to 1
- 0 to 1
- Depends on the dataset
- Depends on the feature
Min-Max scaling transforms features by scaling each feature to a given range. The default range for the Min-Max scaling technique is 0 to 1. Therefore, after Min-Max scaling, the dataset will typically have values ranging from 0 to 1.
What is the term for the measure of how spread out the values in a data set are?
- Central Tendency
- Dispersion
- Kurtosis
- Skewness
The term for the measure of how spread out the values in a data set are is called "Dispersion". It includes range, interquartile range (IQR), variance, and standard deviation.
What is a key difference between qualitative data and quantitative data when it comes to analysis methods?
- All types of data are analyzed in the same way
- Qualitative data is always easier to analyze
- Qualitative data typically requires textual analysis, while quantitative data can be analyzed mathematically
- Quantitative data can't be used for statistical analysis
Qualitative data often requires textual or thematic analysis, categorizing the data based on traits or characteristics. Quantitative data, being numerical, can be analyzed using mathematical or statistical methods.
The _________ method in regression analysis can help reduce the impact of Multicollinearity.
- Chi-Square
- Least squares
- Logistic Regression
- Ridge Regression
Ridge Regression is a regularization technique that can help reduce the impact of multicollinearity. It adds a penalty equivalent to square of the magnitude of coefficients to the loss function, thereby shrinking the coefficients of correlated predictors and reducing their impact.
Which measure of central tendency divides a data set into two equal halves?
- Mean
- Median
- Mode
- nan
The "Median" is the measure of central tendency that divides a data set into two equal halves. It is the middle score for a set of ordered data such that 50% of the scores are above it, and 50% are below it.