What does a correlation coefficient close to 0 indicate about the relationship between two variables?
- A perfect negative linear relationship
- A perfect positive linear relationship
- A very strong linear relationship
- No linear relationship
A correlation coefficient close to 0 indicates that there is no linear relationship between the two variables. This means that changes in one variable are not consistently associated with changes in the other variable. It does not necessarily mean that there is no relationship at all, as there may be a non-linear relationship.
What step comes after 'wrangling' in the EDA process?
- Communicating
- Concluding
- Exploring
- Questioning
Once the data has been 'wrangled' i.e., cleaned and transformed, the next step in the EDA process is 'exploring'. This stage involves examining the data through statistical analysis and visual methods.
In a dataset with a categorical variable missing for some rows, why might mode imputation not be the best strategy?
- All of the above
- It can introduce bias if the data is not missing at random
- It could distort the original data distribution
- It may not capture the underlying data pattern
Mode imputation might not be the best strategy for a dataset with a categorical variable missing for some rows. Although it's simple to implement, it may fail to capture the underlying data pattern, introduce bias if the data is not missing at random, and distort the original data distribution by overrepresenting the mode.
In a scenario where your dataset has a Gaussian distribution, which scaling method is typically recommended and why?
- All scaling methods work equally well with Gaussian distributed data
- Min-Max scaling because it scales all values between 0 and 1
- Robust scaling because it is not affected by outliers
- Z-score standardization because it creates a normal distribution
Z-score standardization is typically recommended for a dataset with a Gaussian distribution. Although it doesn't create a normal distribution, it scales the data such that it has a mean of 0 and a standard deviation of 1, which aligns with the properties of a standard normal distribution.
How can mishandling missing data in a feature affect the feature's importance in a machine learning model?
- Decreases the feature's importance.
- Depends on the feature's initial importance.
- Has no effect on the feature's importance.
- Increases the feature's importance.
Mishandling missing data can distort the data distribution and skew the feature's statistical properties, which might lead to a decrease in its importance when the model is learning.
You're using a model that is sensitive to multicollinearity. How can feature selection help improve your model's performance?
- By adding more features
- By removing highly correlated features
- By transforming the features
- By using all features
If you're using a model that is sensitive to multicollinearity, feature selection can help improve the model's performance by removing highly correlated features. Multicollinearity can affect the stability and performance of some models, and removing features that are highly correlated with others can alleviate this problem.
What is the process of removing an entire row when any single data point within it is missing called?
- Listwise Deletion
- Mean Imputation
- Pairwise Deletion
- Regression Imputation
The process of removing an entire row when any single data point within it is missing is called 'Listwise Deletion'. Also known as 'Complete Case Analysis', this technique is straightforward and fast, but it can potentially discard valuable data and introduce bias if the missingness is not completely at random.
What functionality does the Seaborn library add over Matplotlib?
- 3D plotting
- Interactive plotting
- Real-time plotting
- Statistical plotting
While Matplotlib is a powerful library for creating a wide range of plots, Seaborn adds on to this by providing a number of high-level statistical plotting capabilities, allowing users to create more informative and attractive visualizations with fewer lines of code.
Which measure of central tendency can be used for both quantitative and qualitative data?
- Mean
- Median
- Mode
- nan
The "Mode" is the measure of central tendency that can be used for both quantitative and qualitative data. It is the value that appears most frequently in a data set, and it is the only measure of central tendency that can be used with nominal data.
Which method for dealing with missing data might introduce bias if the data is not missing completely at random?
- Listwise Deletion
- Mean/Median/Mode Imputation
- Pairwise Deletion
- Regression Imputation
Mean/Median/Mode Imputation might introduce bias if the data is not missing completely at random. If missing values have some systematic patterns, replacing them with mean, median, or mode might lead to incorrect estimation of variability and biased results.