You are analyzing customer purchasing behavior and the data exhibits high skewness. What could be the potential challenges and how can you address them?
- Data normality assumptions may be violated, address this by transformation techniques.
- No challenges would be encountered.
- Skewness would make the data easier to analyze.
- The mean would become more reliable, no action is needed.
High skewness may cause a violation of data normality assumptions often required for many statistical tests and machine learning models. One common way to address this is through data transformation techniques like log, square root, or inverse transformations to make the distribution more symmetrical.
In the context of EDA, you find that certain features in your dataset are highly correlated. How would you interpret this finding and how might it affect your analysis?
- The presence of multicollinearity may require you to consider it in your model selection or feature engineering steps
- You should combine the correlated features into one
- You should remove all correlated features
- You should use only correlated features in your analysis
High correlation between features indicates multicollinearity. This can be problematic in certain types of models (like linear regression) as it can destabilize the model and make the effects of predictor variables hard to separate. Depending on the severity of multicollinearity, you may need to consider it during model selection or feature engineering steps, such as removing highly correlated variables, combining them, or using regularization techniques.
Why is Multicollinearity a potential issue in data analysis and predictive modeling?
- It can cause instability in the coefficient estimates of regression models.
- It can cause the data to be skewed.
- It can cause the mean and median of the data to be significantly different.
- It can lead to overfitting in machine learning models.
Multicollinearity can cause instability in the coefficient estimates of regression models. This means that small changes in the data can lead to large changes in the model, making the interpretation of the output problematic and unreliable.
During a data analysis project, your team came up with a novel hypothesis after examining patterns and trends in your dataset. Which type of analysis will be the best for further exploring this hypothesis?
- All are equally suitable
- CDA
- EDA
- Predictive Modeling
EDA would be most suitable in this case as it provides a flexible framework for exploring patterns, trends, and relationships in the data, allowing for a deeper understanding and further exploration of the novel hypothesis.
Which method of handling missing data removes only the instances where certain variables are missing, preserving the rest of the data in the row?
- Listwise Deletion
- Mean Imputation
- Pairwise Deletion
- Regression Imputation
The 'Pairwise Deletion' method of handling missing data only removes the instances where certain variables are missing, preserving the rest of the data in the row. This approach can be beneficial because it retains as much data as possible, but it may lead to inconsistencies and bias if the missingness is not completely random.
How can incorrect handling of missing data impact the bias-variance trade-off in a machine learning model?
- Does not affect the bias-variance trade-off.
- Increases bias and reduces variance.
- Increases both bias and variance.
- Increases variance and reduces bias.
Improper handling of missing data, such as by naive imputation methods, can lead to an increase in bias and a decrease in variance. This is because the imputed values could be biased, leading the model to learn incorrect patterns.
How does the IQR method categorize a data point as an outlier?
- By comparing it to the mean
- By comparing it to the median
- By comparing it to the standard deviation
- By seeing if it falls below Q1-1.5IQR or above Q3+1.5IQR
The IQR method categorizes a data point as an outlier by seeing if it falls below Q1-1.5IQR or above Q3+1.5IQR.
You're working with a data set that does not follow a normal distribution. Which method, Z-score or IQR, should be used for detecting outliers?
- Both are suitable
- IQR
- Neither is suitable
- Z-score
In this case, the IQR method is a better choice as it does not assume any specific data distribution unlike the Z-score method, which assumes data is normally distributed.
You are visualizing a heatmap and notice a row with colors drastically different than the rest. What might this indicate about the corresponding variable?
- The variable has a unique distribution
- The variable has many missing values
- The variable is an outlier
- The variable is unrelated to the others
If a row in a heatmap has colors that are drastically different than the rest, it might indicate that the corresponding variable is unrelated or has very different relationships with the other variables in the dataset.
How does standard deviation differ in a sample versus a population?
- The denominator in the calculation of the sample standard deviation is (n-1)
- The standard deviation of a sample is always larger
- The standard deviation of a sample is always smaller
- They are calculated in the same way
The "Standard Deviation" in a sample differs from that in a population in the way it is calculated. For a sample, the denominator is (n-1) instead of n, which is Bessel's correction to account for sample bias.