_______ is typically used when the data analyst has no specific expectations from the data, whereas _______ is used when the analyst wants to confirm certain assumptions.
- CDA, EDA
- EDA, CDA
- EDA, Predictive Modeling
- Predictive Modeling, EDA
EDA (Exploratory Data Analysis) is typically used when the data analyst does not have specific expectations or hypotheses about the data. It is an open-ended process where we aim to discover patterns and anomalies in the data. CDA (Confirmatory Data Analysis), on the other hand, is used when the analyst wants to confirm or refute certain assumptions or hypotheses.
Imagine a dataset with a negative skewness and a low kurtosis. How would this influence your data interpretation and statistical tests?
- It would not impact the interpretation or statistical tests.
- The data would be less likely to have outliers and the distribution would be wider.
- The data would be more likely to have outliers and the distribution would be narrow.
- The mean of the dataset would be greater than the median.
Negative skewness means that the tail of the distribution extends towards more negative values and most values are clustered around the right tail. Low kurtosis (or platykurtic) suggests that the data is flatter and more spread out than a normal distribution, indicating less likelihood of extreme outliers.
How does the Z-score method perform when the data is not normally distributed?
- It performs better
- It performs the same
- It performs worse
- Its performance is independent of the data distribution
Z-score method assumes a Gaussian distribution and can perform poorly when data is not normally distributed, possibly leading to an over or under identification of outliers.
Why is variance considered a squared measure?
- Because it involves squaring the difference from the mean
- Because it is always a perfect square
- Because it's derived from the square of the data values
- Because it's the square root of the standard deviation
"Variance" is considered a squared measure "Because it involves squaring the difference from the mean". Squaring is done to avoid cancellation of positive and negative differences.
What type of data is based on measurements or counts?
- Nominal data
- Ordinal data
- Qualitative data
- Quantitative data
Quantitative data is based on measurements or counts. It's typically numerical and can be used in mathematical and statistical operations.
Which measure of central tendency is calculated by adding all the numbers and dividing by the number of numbers?
- Mean
- Median
- Mode
- nan
The "Mean" is calculated by adding all the numbers in the data set and then dividing by the count of numbers. It is often referred to as the average and provides a single value representation of the center of the data.
What are some common methods to handle Multicollinearity in a dataset?
- All of these methods can be used.
- Increasing the sample size
- Performing Principal Component Analysis
- Removing highly correlated variables
All the mentioned methods can be used to handle Multicollinearity. Depending on the severity of the multicollinearity and the specific context, you might choose to remove highly correlated variables, increase your sample size, or perform Principal Component Analysis (PCA) to create a smaller set of uncorrelated variables.
Which type of data can take on any value within a certain range?
- Categorical data
- Continuous data
- Discrete data
- Nominal data
Continuous data can take on any value within a certain range. For example, the height of a person can be any value within the range of human heights.
Suppose you have an overfitting model. You identify that missing data was incorrectly filled with a constant value. How might this have contributed to overfitting?
- The model became too complex.
- The model learned noise from the data.
- The model was under-regularized.
- The model's hyperparameters were not optimized.
Filling missing data with a constant value could introduce noise into the data, causing the model to learn the noise along with the underlying patterns, thus leading to overfitting.
Which type of data analysis helps the most in feature selection for Machine Learning?
- All of them equally contribute.
- CDA
- EDA
- Predictive Modeling
EDA plays a significant role in feature selection for Machine Learning. Through the exploration of relationships between features and the target variable, and the identification of potential data issues like multicollinearity, EDA can help analysts determine which features are most relevant for a given machine learning model.