What is the main characteristic of Robust Scaling?

  • It is not affected by outliers
  • It scales features to a specific range
  • It scales the data to unit variance
  • It's the most complex scaling technique
Robust scaling uses techniques that are robust to outliers. This method removes the median and scales the data according to the quantile range (Interquartile Range: IQR). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).

In a scenario where a machine learning model is showing unexpectedly high training time, how could incorrect handling of missing data be a factor?

  • Missing data might have created outliers in the data.
  • Missing data might have increased the complexity of the model.
  • Missing data might have increased the dimensionality of the data.
  • Missing data might have introduced multicollinearity in the data.
Incorrectly handling missing data (such as one-hot encoding missing values) can increase the dimensionality of the dataset, leading to a longer training time due to the curse of dimensionality.

A __________ graph would be most suitable for visualizing a dataset with two numerical variables.

  • Bar chart
  • Line chart
  • Pie chart
  • Scatter plot
A scatter plot would be most suitable for visualizing a dataset with two numerical variables. It provides a graphical view of the correlation, or relationship between two sets of data.

In the EDA process, where does the 'communication' step typically occur?

  • After concluding
  • After exploring
  • Before questioning
  • Before wrangling
In the EDA process, the 'communication' step typically occurs after concluding. It involves effectively conveying the findings, insights, or conclusions drawn from the data to relevant stakeholders.

Given a boxplot of a data set, how can you determine the IQR, and what does it tell you about the data?

  • Add the value of the lower quartile to the upper quartile
  • Divide the range by 2
  • Subtract the value of the lower quartile from the upper quartile
  • Take the square root of the range
From a boxplot, you can determine the "Interquartile Range (IQR)" by "Subtracting the value of the lower quartile from the upper quartile". The IQR measures the range of the middle 50% of the data, which gives you a sense of the spread of the central data.

When is it more appropriate to use a correlation matrix instead of a pairplot?

  • When the dataset is very large
  • When the dataset is very small
  • When the variables are not numeric
  • When there are only two variables
When dealing with a large number of variables, a correlation matrix can be a more appropriate choice than a pairplot. This is because pairplots can become too complex and unreadable when the number of variables increases.

What is variance in the context of a data set?

  • The average deviation from the mean
  • The average squared deviation from the mean
  • The range of the data
  • The square root of the average deviation from the mean
"Variance" in the context of a data set is the "Average squared deviation from the mean". It gives a measure of how data points vary from the mean and is used to calculate the standard deviation.

What does Min-Max scaling do to the dataset?

  • It reduces the dimensionality of the dataset
  • It removes the mean and scales the data to unit variance
  • It scales the data based on median and interquartile range
  • It scales the dataset so that all feature values are in the range 0 to 1
Min-Max scaling, also known as normalization, transforms features by scaling each feature to a specific range, typically 0 to 1. This is done using the values of the minimum and maximum feature in the dataset.

How does the 'hue' parameter in Seaborn alter the visual presentation of data?

  • Changes the color of elements
  • Changes the shape of markers
  • Changes the size of markers
  • Rotates the plot
In Seaborn, the 'hue' parameter changes the color of elements. It is used to provide a color encoding for a third (typically categorical) variable in addition to two numeric variables.

What is an 'outlier' in the context of data analysis?

  • A data point that lies an abnormal distance from other values
  • A method to visualize data
  • A variable that is not significant
  • An error in data collection
In data analysis, an outlier is a data point that lies an abnormal distance from other values in a random sample from a population.