_______ is a type of data analysis that helps in formulating hypotheses while the primary purpose of _______ is to test the formulated hypotheses.
- CDA, EDA
- EDA, CDA
- EDA, Predictive Modeling
- Predictive Modeling, EDA
EDA (Exploratory Data Analysis) is used to understand data patterns or trends and to formulate hypotheses, while CDA (Confirmatory Data Analysis) is applied to test those formulated hypotheses.
What is the name of the statistical measure that shows the degree of the relationship between two variables?
- Correlation coefficient
- Mean
- Standard deviation
- Variance
The statistical measure that shows the degree of relationship between two variables is called the correlation coefficient. It quantifies the direction and strength of the relationship between pairs of variables. Values range between -1 and +1, with -1 indicating a perfect negative correlation, +1 a perfect positive correlation, and 0 no correlation.
What is the role of Principal Component Analysis (PCA) in handling Multicollinearity?
- PCA assigns weights to variables based on their importance.
- PCA creates new uncorrelated variables from the original set of correlated variables.
- PCA eliminates variables with low variance.
- PCA increases the dimensionality of the data set.
PCA creates a new set of uncorrelated variables (principal components) from the original set of correlated variables. These principal components are linear combinations of the original variables, and they are orthogonal to each other, hence eliminating multicollinearity.
In which scenarios would the distinction between continuous and discrete data become crucial?
- All scenarios
- When cleaning the data
- When developing a regression model
- When selecting a data visualization technique
The distinction between continuous and discrete data becomes crucial when developing regression models, as different models may be suitable depending on whether the data is continuous or discrete. For instance, linear regression is used for continuous data, while logistic regression is used for discrete data.
When we use Min-Max scaling, the transformed data will fall into the range of ____ to ____.
- -1 to 1
- -5 to 5
- 0 to 1
- 0 to 10
In Min-Max scaling, the transformed data will fall into the range of 0 to 1. This is because Min-Max scaling subtracts the minimum value from each data point and then divides by the range of the data (maximum - minimum).
What do we call the technique of deleting pairs of data where one is missing in a pairwise analysis?
- Listwise Deletion
- Mean Imputation
- Pairwise Deletion
- Regression Imputation
The technique of deleting pairs of data where one is missing in a pairwise analysis is called 'Pairwise Deletion'. This method maximizes the amount of data retained, as it only removes the specific pairs with a missing value rather than the entire row, but it can lead to inconsistent results due to different pairs being used in different analyses.
What kind of data visualization would be most suitable for high-dimensional datasets?
- Bar chart
- Parallel coordinates or a scatter plot matrix
- Pie chart
- Scatter plot
Visualizing high-dimensional datasets (those with many variables) can be challenging. However, techniques like parallel coordinates or a scatter plot matrix can help. Parallel coordinates plot each variable on a separate column, and lines connecting the columns represent individual data points. A scatter plot matrix shows all pairwise scatter plots of the variables.
In a scenario where you have to visualize real-time data for a live audience, what factors would you consider in your data visualization strategy?
- Complexity of the graph, because it needs to impress the audience
- Simplicity and clarity, because the audience needs to understand the data quickly
- The amount of data, because more data is always better
- The color scheme, because it needs to be eye-catching
When visualizing real-time data for a live audience, simplicity and clarity are key factors. The audience needs to understand the data quickly as it updates in real time. A clear and straightforward graph type, simple labels, and a thoughtful color scheme can help achieve this.
Why might you prefer to use multiple imputation over a simpler method like mean imputation?
- Mean imputation always leads to bias
- Multiple imputation is easier to use
- Multiple imputation is quicker
- Multiple imputation provides more accurate estimates
You might prefer to use multiple imputation over a simpler method like mean imputation because multiple imputation provides more accurate estimates. This is because it estimates multiple values for each missing value, reflecting the uncertainty around the true value. It also better preserves the relationships between variables.
During your EDA process, you identify several outliers in your dataset. How does this finding impact your subsequent steps in data analysis?
- You may need to collect more data
- You may need to ignore these outliers as they are anomalies
- You might consider robust methods or outlier treatment methods for your analysis
- You might decide to use a different dataset
Identifying outliers during the EDA process would influence the subsequent steps in data analysis. The outliers could indicate errors, but they could also be true data points. Depending on the context, you might need to investigate the reasons for their presence, treat them appropriately (for example, using robust statistical methods, data transformations, or outlier removal), or revise your analysis techniques to accommodate them.