What type of data is Spearman's correlation most suitable for?
- Categorical data
- Continuous, normally distributed data
- Nominal data
- Ordinal data
Spearman's correlation is most suitable for ordinal data. It assesses how well the relationship between two variables can be described using a monotonic function. Because it's based on ranks, it can be used with ordinal data, where the order is important but not the difference between values.
How is Multicollinearity typically detected in a dataset?
- By calculating the Variance Inflation Factor (VIF).
- By performing a simple linear regression.
- By performing a t-test.
- By visually inspecting the data.
Multicollinearity is typically detected by calculating the Variance Inflation Factor (VIF). A high VIF indicates a high degree of multicollinearity between the independent variables.
After exploring and interpreting your data, you would '______' your findings in the EDA process.
- communicate
- conclude
- question
- wrangle
After exploring and interpreting your data, you would 'conclude' your findings in the EDA process. This is where you draw actionable insights from the data that you have analyzed and explored.
Which type of graph would be most suitable for showing the relationship between two variables?
- Bar graph
- Histogram
- Pie chart
- Scatter plot
A scatter plot is most suitable for showing the relationship between two variables. Each point on the plot corresponds to two data values, with the position along the X and Y-axis representing the values of the two variables. This allows patterns and relationships to be identified visually.
You are required to create a complex statistical plot to identify and present possible correlations between multiple variables in your dataset. Which Python library would be the most appropriate for this task?
- Bokeh
- Matplotlib
- Plotly
- Seaborn
Seaborn is best suited for creating complex statistical plots. It provides high-level, attractive statistical plots and integrates well with pandas DataFrames, allowing direct use of column names for the axes and other arguments.
How does kurtosis impact the interpretation of data distribution?
- It affects how we perceive the outliers and tail risks.
- It affects the reliability of the mean.
- It changes the standard deviation of the dataset.
- It influences the choice of graph to use.
Kurtosis impacts the interpretation of data distribution by affecting how we perceive the outliers and tail risks. High kurtosis indicates a high probability of extreme outcomes, whereas low kurtosis suggests a lower chance of extreme outcomes.
You are given the variance of a data set. How can you use this information to find the standard deviation, and why might you want to do this?
- Add up all the variances to get the standard deviation
- Divide the variance by the number of data points to get the standard deviation
- Square the variance to get the standard deviation
- Take the square root of the variance to get the standard deviation
If you are given the variance, you can "Take the square root of the variance to get the standard deviation". This is useful because the standard deviation is in the same units as the original data, making it more interpretable.
What plot is particularly useful for comparing the distribution of data across levels of a categorical variable?
- Bar chart
- Pie chart
- Scatter plot
- Violin plot
Violin plots are useful for comparing the distribution of data across levels of a categorical variable. They combine the characteristics of box plots and density plots. The violin plot features a kernel density estimation of the underlying distribution of the data.
How does a scatter plot differ from a pairplot when representing bivariate relationships?
- A scatter plot can only represent one bivariate relationship at a time
- A scatter plot cannot represent bivariate relationships
- A scatter plot is only used for categorical variables
- A scatter plot uses colors to differentiate variables
A scatter plot differs from a pairplot in that it only represents one bivariate relationship at a time, while a pairplot shows all pairwise relationships between multiple variables.
What is the main characteristic of Robust Scaling?
- It is not affected by outliers
- It scales features to a specific range
- It scales the data to unit variance
- It's the most complex scaling technique
Robust scaling uses techniques that are robust to outliers. This method removes the median and scales the data according to the quantile range (Interquartile Range: IQR). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).