You've received feedback that your box plots are not providing a clear visual of the distribution of your dataset. What alternative plot could you use and why?

Bar graph
Line graph
Scatter plot
Violin plot

If box plots are not providing a clear visualization, an alternative could be Violin plots. Violin plots are similar to box plots, but also show the probability density of the data at different values. This can provide a more detailed depiction of the distribution of the dataset.

Discuss it

A ________ correlation indicates a strong negative relationship between two variables.

Negative
Neutral
Positive
Zero

A negative correlation indicates a strong negative relationship between two variables. This means as one variable increases, the other decreases.

Discuss it

What are the pitfalls to avoid when trying to improve the readability of a graph?

Avoiding color altogether
Making the graph too simple
Overloading the graph with too much information
Using uncommon graph types

While improving readability, a common pitfall is overloading the graph with too much information. Too many data points, variables, or details can confuse the audience and obscure the main message. It's crucial to strike a balance, providing enough information to convey the message accurately, but not so much that it overwhelms the audience.

Discuss it

How could the handling of missing data influence the interpretability of a machine learning model?

Depends on the model used.
Does not impact model interpretability.
Makes the model less interpretable.
Makes the model more interpretable.

If missing data are handled incorrectly, it may lead to inaccurate learning and prediction, which makes the model's decisions less understandable and hence reduces its interpretability.

Discuss it

Which Python visualization library would be most suited to creating a complex, layered, "small multiple" style plot?

Bokeh
Matplotlib
Plotly
Seaborn

Seaborn is particularly well-suited for creating complex, layered "small multiple" style plots. The 'FacetGrid' class in Seaborn makes this type of plot easy to create.

Discuss it

In EDA, "data wrangling" involves ________.

Building predictive models
Cleaning and transforming raw data
Performing statistical tests
Visualizing data

In EDA, "data wrangling" involves cleaning and transforming raw data into a more suitable format for analysis. This could include handling missing values, dealing with outliers, encoding categorical variables, and other data preprocessing steps.

Discuss it

How can a logarithmic transformation of the axes affect the identification of outliers in a scatter plot?

It can convert outliers to normal data points
It can hide outliers
It can highlight outliers
It does not affect outlier identification

A logarithmic transformation of the axes can highlight outliers in a scatter plot by compressing the scale where the larger mass of the data points are and expanding the scale for the potential outliers.

Discuss it

Replacing missing data with a constant value can introduce ________ in a machine learning model.

bias
noise
precision
variance

Filling missing values with a constant can add artificial noise to the data, as it does not consider the variability of the data. This added noise can mislead the model during training.

Discuss it

How can outlier handling techniques potentially impact the accuracy of a predictive model?

They can decrease the accuracy by removing important information
They can either increase or decrease the accuracy depending on the dataset and model
They can increase the accuracy by reducing noise
nan

Outlier handling techniques can either increase or decrease the accuracy of a predictive model depending on the dataset and model. Properly handled outliers can improve model accuracy, but incorrectly handled outliers or the removal of important information can decrease model accuracy.

Discuss it

Imagine you have a dataset where only 5% of the rows contain missing values. What potential problems could arise if you choose to use listwise deletion?

It could cause all of the above problems
It may distort the original data distribution
It may lead to a significant reduction in sample size
It might introduce selection bias

Even though only 5% of the rows contain missing values, using listwise deletion could still lead to a significant reduction in sample size, potential distortion in the original data distribution, and introduce selection bias. These problems may affect the statistical power and the representativeness of the analysis.

Discuss it