Which outlier detection method is less sensitive to extreme values in a dataset?
- IQR method
- Standard deviation method
- Z-score method
- nan
The IQR (Interquartile Range) method is less sensitive to extreme values as compared to the z-score method or the standard deviation method. This is because IQR is a measure of statistical dispersion, being equal to the difference between upper and lower quartiles.
Imagine you're working with a dataset where the standard deviation is very small. How might this impact the effectiveness of z-score standardization?
- It will make the z-score standardization more effective
- It will not affect the z-score standardization
- The scaled values will be very large due to the small standard deviation
- The scaled values will be very small due to the small standard deviation
Z-score standardization scales data by subtracting the mean and dividing by the standard deviation. If the standard deviation is very small, the result of this division could be very large, leading to scaled values that are quite large.
What is the first step in the Exploratory Data Analysis process?
- Concluding
- Exploring
- Questioning
- Wrangling
The first step in the EDA process is questioning, i.e., defining the questions that the analysis aims to answer based on the problem's context and data available.
How does the Variance Inflation Factor (VIF) quantify the severity of Multicollinearity in a regression analysis?
- By calculating the square root of the variance of a predictor.
- By comparing the variance of a predictor to the variance of the outcome variable.
- By measuring how much the variance of an estimated regression coefficient is increased due to multicollinearity.
- By summing up the variances of all the predictors.
VIF provides a measure of multicollinearity by quantifying how much the variance of an estimated regression coefficient increases if predictors are correlated. If the predictors are uncorrelated, the VIF of each variable will be 1. The higher the value of VIF, the more severe the multicollinearity.
What information is needed to calculate a Z-score for a particular data point?
- Only the mean of the dataset
- Only the standard deviation of the dataset
- The mean and standard deviation of the dataset
- The median and interquartile range of the dataset
To calculate a Z-score for a particular data point, you need to know the mean and standard deviation of the dataset. The Z-score is calculated by subtracting the mean from the data point and then dividing by the standard deviation.
What are some factors to consider when choosing between a scatter plot, pairplot, correlation matrix, and heatmap?
- Just the number of variables
- Just the type of data
- Number of variables, Type of data, Audience's familiarity with the plots, All of these
- Only the audience's familiarity with the plots
Choosing between a scatter plot, pairplot, correlation matrix, and heatmap depends on several factors including: the number of variables you want to visualize, the type of data you're working with, and the level of familiarity your audience has with these types of plots.
Which machine learning models are more susceptible to the issue of feature redundancy?
- All of the above
- Decision Trees
- Linear Models
- Neural Networks
Linear models are more susceptible to the issue of feature redundancy as they assume independence among features. Redundant features violate this assumption and can cause problems.
Which of the following scenarios is an example of Multicollinearity?
- The age and the size of a car.
- The amount of time studying and the grade in an exam.
- The size of a house and its price.
- The temperature outside and the amount of sunlight in a day.
The temperature outside and the amount of sunlight in a day are likely to be highly correlated, as more sunlight generally corresponds to higher temperatures. This is an example of multicollinearity.
When a dataset is normally distributed, the mean, median, and mode will all be _____.
- Different
- The same
- Undefined
- Zero
In a normal distribution, the "Mean", "Median", and "Mode" are all the "Same", falling at the center of the distribution.
You're working on a high-dimensional dataset with many redundant features. Which feature selection methods might help reduce the dimensionality while maintaining the essential information?
- Embedded methods
- Filter methods
- Principal Component Analysis (PCA)
- Wrapper methods
Principal Component Analysis (PCA) is a dimensionality reduction technique that can be used when dealing with high-dimensional datasets with many redundant features. PCA transforms the original features into a new set of uncorrelated features, capturing the most variance in the data, thus helping to maintain the essential information while reducing the dimensionality.
The process of converting an actual range of values in a numeric feature column into a standard range of values is known as _____.
- Binning
- Data Encoding
- Data Integration
- Data Scaling
The process of converting an actual range of values in a numeric feature column into a standard range of values is known as Data Scaling. This is a fundamental step in data preprocessing, particularly important when dealing with machine learning algorithms.
Which scaling technique is most affected by the presence of outliers?
- Min-Max scaling
- Robust scaling
- Standardization
- nan
The Min-Max scaling technique, which scales the data to a fixed range (usually 0 to 1), is highly sensitive to the presence of outliers. It shrinks the range of the feature values, so the outliers can drastically change the ranges of the attributes.