How does multicollinearity affect feature selection?

It affects the accuracy of the model
It causes unstable parameter estimates
It makes the model less interpretable
It results in high variance of the model

Multicollinearity, which refers to the high correlation between predictor variables, can affect feature selection by causing unstable estimates of the parameters. This instability can lead to strange and unreliable predictions, making the feature selection process less accurate.

Discuss it

What is the main characteristic of Robust Scaling?

It is not affected by outliers
It scales features to a specific range
It scales the data to unit variance
It's the most complex scaling technique

Robust scaling uses techniques that are robust to outliers. This method removes the median and scales the data according to the quantile range (Interquartile Range: IQR). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).

Discuss it

In a scenario where a machine learning model is showing unexpectedly high training time, how could incorrect handling of missing data be a factor?

Missing data might have created outliers in the data.
Missing data might have increased the complexity of the model.
Missing data might have increased the dimensionality of the data.
Missing data might have introduced multicollinearity in the data.

Incorrectly handling missing data (such as one-hot encoding missing values) can increase the dimensionality of the dataset, leading to a longer training time due to the curse of dimensionality.

Discuss it

A __________ graph would be most suitable for visualizing a dataset with two numerical variables.

Bar chart
Line chart
Pie chart
Scatter plot

A scatter plot would be most suitable for visualizing a dataset with two numerical variables. It provides a graphical view of the correlation, or relationship between two sets of data.

Discuss it

In the EDA process, where does the 'communication' step typically occur?

After concluding
After exploring
Before questioning
Before wrangling

In the EDA process, the 'communication' step typically occurs after concluding. It involves effectively conveying the findings, insights, or conclusions drawn from the data to relevant stakeholders.

Discuss it

Given a boxplot of a data set, how can you determine the IQR, and what does it tell you about the data?

Add the value of the lower quartile to the upper quartile
Divide the range by 2
Subtract the value of the lower quartile from the upper quartile
Take the square root of the range

From a boxplot, you can determine the "Interquartile Range (IQR)" by "Subtracting the value of the lower quartile from the upper quartile". The IQR measures the range of the middle 50% of the data, which gives you a sense of the spread of the central data.

Discuss it

Suppose you are dealing with time series data with some missing values and you decided to use regression imputation. What potential issues might arise and how could you address them?

May lead to overfitting; Address by adding more data
May violate independence assumption; Address by considering time dependence
May violate uniform distribution; Address by transforming data
No issues might arise

In time series data, observations are usually dependent on time, so the independence assumption of regression imputation may be violated. This issue can be addressed by considering time dependence in the regression model used for imputation, for example by including lagged variables.

Discuss it

How is Multicollinearity typically detected in a dataset?

By calculating the Variance Inflation Factor (VIF).
By performing a simple linear regression.
By performing a t-test.
By visually inspecting the data.

Multicollinearity is typically detected by calculating the Variance Inflation Factor (VIF). A high VIF indicates a high degree of multicollinearity between the independent variables.

Discuss it

After exploring and interpreting your data, you would '______' your findings in the EDA process.

communicate
conclude
question
wrangle

After exploring and interpreting your data, you would 'conclude' your findings in the EDA process. This is where you draw actionable insights from the data that you have analyzed and explored.

Discuss it

Which type of graph would be most suitable for showing the relationship between two variables?

Bar graph
Histogram
Pie chart
Scatter plot

A scatter plot is most suitable for showing the relationship between two variables. Each point on the plot corresponds to two data values, with the position along the X and Y-axis representing the values of the two variables. This allows patterns and relationships to be identified visually.

Discuss it