What is the main characteristic of Robust Scaling?

It is not affected by outliers
It scales features to a specific range
It scales the data to unit variance
It's the most complex scaling technique

Robust scaling uses techniques that are robust to outliers. This method removes the median and scales the data according to the quantile range (Interquartile Range: IQR). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).

Discuss it

In a scenario where a machine learning model is showing unexpectedly high training time, how could incorrect handling of missing data be a factor?

Missing data might have created outliers in the data.
Missing data might have increased the complexity of the model.
Missing data might have increased the dimensionality of the data.
Missing data might have introduced multicollinearity in the data.

Incorrectly handling missing data (such as one-hot encoding missing values) can increase the dimensionality of the dataset, leading to a longer training time due to the curse of dimensionality.

Discuss it

A __________ graph would be most suitable for visualizing a dataset with two numerical variables.

Bar chart
Line chart
Pie chart
Scatter plot

A scatter plot would be most suitable for visualizing a dataset with two numerical variables. It provides a graphical view of the correlation, or relationship between two sets of data.

Discuss it

In the EDA process, where does the 'communication' step typically occur?

After concluding
After exploring
Before questioning
Before wrangling

In the EDA process, the 'communication' step typically occurs after concluding. It involves effectively conveying the findings, insights, or conclusions drawn from the data to relevant stakeholders.

Discuss it

Given a boxplot of a data set, how can you determine the IQR, and what does it tell you about the data?

Add the value of the lower quartile to the upper quartile
Divide the range by 2
Subtract the value of the lower quartile from the upper quartile
Take the square root of the range

From a boxplot, you can determine the "Interquartile Range (IQR)" by "Subtracting the value of the lower quartile from the upper quartile". The IQR measures the range of the middle 50% of the data, which gives you a sense of the spread of the central data.

Discuss it

When is it more appropriate to use a correlation matrix instead of a pairplot?

When the dataset is very large
When the dataset is very small
When the variables are not numeric
When there are only two variables

When dealing with a large number of variables, a correlation matrix can be a more appropriate choice than a pairplot. This is because pairplots can become too complex and unreadable when the number of variables increases.

Discuss it

What is variance in the context of a data set?

The average deviation from the mean
The average squared deviation from the mean
The range of the data
The square root of the average deviation from the mean

"Variance" in the context of a data set is the "Average squared deviation from the mean". It gives a measure of how data points vary from the mean and is used to calculate the standard deviation.

Discuss it

What does Min-Max scaling do to the dataset?

It reduces the dimensionality of the dataset
It removes the mean and scales the data to unit variance
It scales the data based on median and interquartile range
It scales the dataset so that all feature values are in the range 0 to 1

Min-Max scaling, also known as normalization, transforms features by scaling each feature to a specific range, typically 0 to 1. This is done using the values of the minimum and maximum feature in the dataset.

Discuss it

How does the 'hue' parameter in Seaborn alter the visual presentation of data?

Changes the color of elements
Changes the shape of markers
Changes the size of markers
Rotates the plot

In Seaborn, the 'hue' parameter changes the color of elements. It is used to provide a color encoding for a third (typically categorical) variable in addition to two numeric variables.

Discuss it

What is an 'outlier' in the context of data analysis?

A data point that lies an abnormal distance from other values
A method to visualize data
A variable that is not significant
An error in data collection

In data analysis, an outlier is a data point that lies an abnormal distance from other values in a random sample from a population.

Discuss it