Which of the following best describes qualitative data?
- Data that can be categorized
- Data that can be ordered
- Data that can take any value
- Data that is numerical in nature
Qualitative data refers to non-numerical information that can be categorized based on traits and characteristics. It captures information that cannot be simply expressed in numbers.
What plot is particularly useful for comparing the distribution of data across levels of a categorical variable?
- Bar chart
- Pie chart
- Scatter plot
- Violin plot
Violin plots are useful for comparing the distribution of data across levels of a categorical variable. They combine the characteristics of box plots and density plots. The violin plot features a kernel density estimation of the underlying distribution of the data.
How does a scatter plot differ from a pairplot when representing bivariate relationships?
- A scatter plot can only represent one bivariate relationship at a time
- A scatter plot cannot represent bivariate relationships
- A scatter plot is only used for categorical variables
- A scatter plot uses colors to differentiate variables
A scatter plot differs from a pairplot in that it only represents one bivariate relationship at a time, while a pairplot shows all pairwise relationships between multiple variables.
What is the main characteristic of Robust Scaling?
- It is not affected by outliers
- It scales features to a specific range
- It scales the data to unit variance
- It's the most complex scaling technique
Robust scaling uses techniques that are robust to outliers. This method removes the median and scales the data according to the quantile range (Interquartile Range: IQR). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).
In a scenario where a machine learning model is showing unexpectedly high training time, how could incorrect handling of missing data be a factor?
- Missing data might have created outliers in the data.
- Missing data might have increased the complexity of the model.
- Missing data might have increased the dimensionality of the data.
- Missing data might have introduced multicollinearity in the data.
Incorrectly handling missing data (such as one-hot encoding missing values) can increase the dimensionality of the dataset, leading to a longer training time due to the curse of dimensionality.
A __________ graph would be most suitable for visualizing a dataset with two numerical variables.
- Bar chart
- Line chart
- Pie chart
- Scatter plot
A scatter plot would be most suitable for visualizing a dataset with two numerical variables. It provides a graphical view of the correlation, or relationship between two sets of data.
In the EDA process, where does the 'communication' step typically occur?
- After concluding
- After exploring
- Before questioning
- Before wrangling
In the EDA process, the 'communication' step typically occurs after concluding. It involves effectively conveying the findings, insights, or conclusions drawn from the data to relevant stakeholders.
Given a boxplot of a data set, how can you determine the IQR, and what does it tell you about the data?
- Add the value of the lower quartile to the upper quartile
- Divide the range by 2
- Subtract the value of the lower quartile from the upper quartile
- Take the square root of the range
From a boxplot, you can determine the "Interquartile Range (IQR)" by "Subtracting the value of the lower quartile from the upper quartile". The IQR measures the range of the middle 50% of the data, which gives you a sense of the spread of the central data.
Suppose you are dealing with time series data with some missing values and you decided to use regression imputation. What potential issues might arise and how could you address them?
- May lead to overfitting; Address by adding more data
- May violate independence assumption; Address by considering time dependence
- May violate uniform distribution; Address by transforming data
- No issues might arise
In time series data, observations are usually dependent on time, so the independence assumption of regression imputation may be violated. This issue can be addressed by considering time dependence in the regression model used for imputation, for example by including lagged variables.
How is Multicollinearity typically detected in a dataset?
- By calculating the Variance Inflation Factor (VIF).
- By performing a simple linear regression.
- By performing a t-test.
- By visually inspecting the data.
Multicollinearity is typically detected by calculating the Variance Inflation Factor (VIF). A high VIF indicates a high degree of multicollinearity between the independent variables.