Why might you prefer to use multiple imputation over a simpler method like mean imputation?
- Mean imputation always leads to bias
- Multiple imputation is easier to use
- Multiple imputation is quicker
- Multiple imputation provides more accurate estimates
You might prefer to use multiple imputation over a simpler method like mean imputation because multiple imputation provides more accurate estimates. This is because it estimates multiple values for each missing value, reflecting the uncertainty around the true value. It also better preserves the relationships between variables.
_______ is a type of data analysis that helps in formulating hypotheses while the primary purpose of _______ is to test the formulated hypotheses.
- CDA, EDA
- EDA, CDA
- EDA, Predictive Modeling
- Predictive Modeling, EDA
EDA (Exploratory Data Analysis) is used to understand data patterns or trends and to formulate hypotheses, while CDA (Confirmatory Data Analysis) is applied to test those formulated hypotheses.
What is the name of the statistical measure that shows the degree of the relationship between two variables?
- Correlation coefficient
- Mean
- Standard deviation
- Variance
The statistical measure that shows the degree of relationship between two variables is called the correlation coefficient. It quantifies the direction and strength of the relationship between pairs of variables. Values range between -1 and +1, with -1 indicating a perfect negative correlation, +1 a perfect positive correlation, and 0 no correlation.
What is the role of Principal Component Analysis (PCA) in handling Multicollinearity?
- PCA assigns weights to variables based on their importance.
- PCA creates new uncorrelated variables from the original set of correlated variables.
- PCA eliminates variables with low variance.
- PCA increases the dimensionality of the data set.
PCA creates a new set of uncorrelated variables (principal components) from the original set of correlated variables. These principal components are linear combinations of the original variables, and they are orthogonal to each other, hence eliminating multicollinearity.
In which scenarios would the distinction between continuous and discrete data become crucial?
- All scenarios
- When cleaning the data
- When developing a regression model
- When selecting a data visualization technique
The distinction between continuous and discrete data becomes crucial when developing regression models, as different models may be suitable depending on whether the data is continuous or discrete. For instance, linear regression is used for continuous data, while logistic regression is used for discrete data.
When we use Min-Max scaling, the transformed data will fall into the range of ____ to ____.
- -1 to 1
- -5 to 5
- 0 to 1
- 0 to 10
In Min-Max scaling, the transformed data will fall into the range of 0 to 1. This is because Min-Max scaling subtracts the minimum value from each data point and then divides by the range of the data (maximum - minimum).
What do we call the technique of deleting pairs of data where one is missing in a pairwise analysis?
- Listwise Deletion
- Mean Imputation
- Pairwise Deletion
- Regression Imputation
The technique of deleting pairs of data where one is missing in a pairwise analysis is called 'Pairwise Deletion'. This method maximizes the amount of data retained, as it only removes the specific pairs with a missing value rather than the entire row, but it can lead to inconsistent results due to different pairs being used in different analyses.
During your EDA process, you identify several outliers in your dataset. How does this finding impact your subsequent steps in data analysis?
- You may need to collect more data
- You may need to ignore these outliers as they are anomalies
- You might consider robust methods or outlier treatment methods for your analysis
- You might decide to use a different dataset
Identifying outliers during the EDA process would influence the subsequent steps in data analysis. The outliers could indicate errors, but they could also be true data points. Depending on the context, you might need to investigate the reasons for their presence, treat them appropriately (for example, using robust statistical methods, data transformations, or outlier removal), or revise your analysis techniques to accommodate them.
In a scatter plot, outliers often appear as points that are far removed from the ___________.
- axes
- main concentration of data
- origin
- trend line
In a scatter plot, outliers are often represented as points that are far removed from the main concentration of data.
________ is a measure of dispersion that is particularly useful when the data set has outliers.
- Interquartile Range
- Range
- Standard Deviation
- Variance
The "Interquartile Range (IQR)" is particularly useful when the dataset has outliers because it only considers the middle 50% of the data. This makes it a robust measure of dispersion.