You are working on a dataset with ordinal variables. You are interested in the correlation between these variables. Which correlation coefficient would be the best choice and why?
- Covariance
- Kendall's Tau
- Pearson's correlation coefficient
- Spearman's correlation coefficient
In a dataset with ordinal variables, Spearman's correlation coefficient would be the best choice. This is because Spearman's correlation coefficient does not assume that data is normally distributed and works with ranks, making it suitable for ordinal data.
Which type of missing data relies on information that is not included in the dataset?
- MAR
- MCAR
- NMAR
- nan
NMAR (Not Missing At Random) type of missing data relies on information that is not included in the dataset.
Consider a situation where a Spearman's correlation coefficient between two variables was found to be significantly different from zero, but the Pearson's correlation was not. How would you explain this?
- Data has outliers
- Data is normally distributed
- Relationship between variables is linear
- Relationship between variables is non-linear
If Spearman's correlation is significantly different from zero, but the Pearson's correlation is not, it suggests that the relationship between the variables is not linear. Spearman's correlation, unlike Pearson's, can capture monotonic relationships that are not necessarily linear.
The kernel in a kernel density plot is a _____ function.
- Exponential
- Linear
- Logarithmic
- Smoothing
In a kernel density plot, the kernel is a smoothing function. It takes the raw data and smooths it into a continuous curve, providing a clear picture of the distribution and density of data.
The _____ is the most appropriate measure of central tendency when the distribution of data is heavily skewed.
- Mean
- Median
- Mode
- Standard Deviation
The "Median" is the most appropriate measure of central tendency when the distribution of data is heavily skewed. This is because it is less affected by outliers and skewed data.
You have to present the sales data of a company over 10 years to the board of directors. What type of graph should you choose and why?
- Histogram, because it shows distributions
- Line graph, because it shows trends over time
- Pie chart, because it shows proportions
- Scatter plot, because it shows relationships between variables
A line graph would be the most suitable choice for presenting sales data over time. Line graphs are excellent for showing continuous data over time set at equal intervals, like years in this case. This would allow the board of directors to easily see trends, patterns, and fluctuations in sales data.
Multicollinearity refers to a situation where _________.
- All variables in a model are perfectly uncorrelated
- Two or more predictors in a regression model are highly correlated
- Two variables are uncorrelated
- Two variables have a correlation coefficient of zero
Multicollinearity refers to a situation in which two or more predictors in a regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.
Imagine you are analyzing a dataset with continuous variables with outliers. The focus is on understanding the linear relationship between these variables. What type of correlation coefficient should you consider?
- Covariance
- Kendall's Tau
- Pearson's correlation coefficient
- Spearman's correlation coefficient
In this case, considering that the data is continuous and there's an interest in understanding the linear relationship, Pearson's correlation coefficient should be used. However, one should be cautious about outliers as Pearson's correlation is sensitive to them.
The primary goal of EDA is to ________.
- Build a final model
- Predict the future outcomes
- Prove a preconceived notion
- Understand the underlying structure of the data
The primary goal of EDA is to understand the underlying structure of the data. This understanding includes gaining insights about the distribution, variability, and relationships among variables in the data, and it helps guide the choice of appropriate models or further data transformations.
The IQR method defines an outlier as any value below Q1 - _______ or above Q3 + _______.
- 1.5*IQR
- 2*IQR
- 2.5*IQR
- 3*IQR
In the IQR method, an outlier is any value below Q1 - 1.5IQR or above Q3 + 1.5IQR.
You have built a model for credit risk assessment with 100 features. Upon evaluation, you find that only 20 features have significant predictive power. How would you proceed?
- Increase the number of features
- Keep all the features
- Retrain the model using only the 20 significant features
- Use all the features for model training
If only 20 features have significant predictive power, it might be beneficial to retrain the model using only these features. Reducing the number of features can make the model simpler, easier to interpret, and faster to train. It can also reduce the risk of overfitting.
How does EDA facilitate the identification of important variables in a dataset?
- By exploring relationships between variables and their relation to the outcome variable
- By fitting a predictive model to the data
- By performing a cost-benefit analysis of each variable
- By running a feature importance algorithm on the dataset
EDA facilitates the identification of important variables by exploring relationships between variables and their relation to the outcome variable. Visualizations and summary statistics can highlight which variables have strong relationships with the outcome variable, and these variables are often important for predictive modeling.