What is the main characteristic of Robust Scaling?
- It is not affected by outliers
- It scales features to a specific range
- It scales the data to unit variance
- It's the most complex scaling technique
Robust scaling uses techniques that are robust to outliers. This method removes the median and scales the data according to the quantile range (Interquartile Range: IQR). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).
In a scenario where a machine learning model is showing unexpectedly high training time, how could incorrect handling of missing data be a factor?
- Missing data might have created outliers in the data.
- Missing data might have increased the complexity of the model.
- Missing data might have increased the dimensionality of the data.
- Missing data might have introduced multicollinearity in the data.
Incorrectly handling missing data (such as one-hot encoding missing values) can increase the dimensionality of the dataset, leading to a longer training time due to the curse of dimensionality.
A __________ graph would be most suitable for visualizing a dataset with two numerical variables.
- Bar chart
- Line chart
- Pie chart
- Scatter plot
A scatter plot would be most suitable for visualizing a dataset with two numerical variables. It provides a graphical view of the correlation, or relationship between two sets of data.
In the EDA process, where does the 'communication' step typically occur?
- After concluding
- After exploring
- Before questioning
- Before wrangling
In the EDA process, the 'communication' step typically occurs after concluding. It involves effectively conveying the findings, insights, or conclusions drawn from the data to relevant stakeholders.
Given a boxplot of a data set, how can you determine the IQR, and what does it tell you about the data?
- Add the value of the lower quartile to the upper quartile
- Divide the range by 2
- Subtract the value of the lower quartile from the upper quartile
- Take the square root of the range
From a boxplot, you can determine the "Interquartile Range (IQR)" by "Subtracting the value of the lower quartile from the upper quartile". The IQR measures the range of the middle 50% of the data, which gives you a sense of the spread of the central data.
When is it more appropriate to use a correlation matrix instead of a pairplot?
- When the dataset is very large
- When the dataset is very small
- When the variables are not numeric
- When there are only two variables
When dealing with a large number of variables, a correlation matrix can be a more appropriate choice than a pairplot. This is because pairplots can become too complex and unreadable when the number of variables increases.
What is variance in the context of a data set?
- The average deviation from the mean
- The average squared deviation from the mean
- The range of the data
- The square root of the average deviation from the mean
"Variance" in the context of a data set is the "Average squared deviation from the mean". It gives a measure of how data points vary from the mean and is used to calculate the standard deviation.
What does Min-Max scaling do to the dataset?
- It reduces the dimensionality of the dataset
- It removes the mean and scales the data to unit variance
- It scales the data based on median and interquartile range
- It scales the dataset so that all feature values are in the range 0 to 1
Min-Max scaling, also known as normalization, transforms features by scaling each feature to a specific range, typically 0 to 1. This is done using the values of the minimum and maximum feature in the dataset.
How does the 'hue' parameter in Seaborn alter the visual presentation of data?
- Changes the color of elements
- Changes the shape of markers
- Changes the size of markers
- Rotates the plot
In Seaborn, the 'hue' parameter changes the color of elements. It is used to provide a color encoding for a third (typically categorical) variable in addition to two numeric variables.
What is an 'outlier' in the context of data analysis?
- A data point that lies an abnormal distance from other values
- A method to visualize data
- A variable that is not significant
- An error in data collection
In data analysis, an outlier is a data point that lies an abnormal distance from other values in a random sample from a population.