You are analyzing a dataset where the variable 'income' has a skewed distribution due to a few high-income individuals. What method would you recommend to handle these outliers?

  • Binning
  • Removal
  • Transformation
  • nan
In this case, the transformation method, such as log transformation, would be the best fit. It will help to reduce the skewness of the data by pulling in high values.

A ________ correlation indicates a strong negative relationship between two variables.

  • Negative
  • Neutral
  • Positive
  • Zero
A negative correlation indicates a strong negative relationship between two variables. This means as one variable increases, the other decreases.

What are the pitfalls to avoid when trying to improve the readability of a graph?

  • Avoiding color altogether
  • Making the graph too simple
  • Overloading the graph with too much information
  • Using uncommon graph types
While improving readability, a common pitfall is overloading the graph with too much information. Too many data points, variables, or details can confuse the audience and obscure the main message. It's crucial to strike a balance, providing enough information to convey the message accurately, but not so much that it overwhelms the audience.

How could the handling of missing data influence the interpretability of a machine learning model?

  • Depends on the model used.
  • Does not impact model interpretability.
  • Makes the model less interpretable.
  • Makes the model more interpretable.
If missing data are handled incorrectly, it may lead to inaccurate learning and prediction, which makes the model's decisions less understandable and hence reduces its interpretability.

Which Python visualization library would be most suited to creating a complex, layered, "small multiple" style plot?

  • Bokeh
  • Matplotlib
  • Plotly
  • Seaborn
Seaborn is particularly well-suited for creating complex, layered "small multiple" style plots. The 'FacetGrid' class in Seaborn makes this type of plot easy to create.

In EDA, "data wrangling" involves ________.

  • Building predictive models
  • Cleaning and transforming raw data
  • Performing statistical tests
  • Visualizing data
In EDA, "data wrangling" involves cleaning and transforming raw data into a more suitable format for analysis. This could include handling missing values, dealing with outliers, encoding categorical variables, and other data preprocessing steps.

How can a logarithmic transformation of the axes affect the identification of outliers in a scatter plot?

  • It can convert outliers to normal data points
  • It can hide outliers
  • It can highlight outliers
  • It does not affect outlier identification
A logarithmic transformation of the axes can highlight outliers in a scatter plot by compressing the scale where the larger mass of the data points are and expanding the scale for the potential outliers.

Replacing missing data with a constant value can introduce ________ in a machine learning model.

  • bias
  • noise
  • precision
  • variance
Filling missing values with a constant can add artificial noise to the data, as it does not consider the variability of the data. This added noise can mislead the model during training.

How can outlier handling techniques potentially impact the accuracy of a predictive model?

  • They can decrease the accuracy by removing important information
  • They can either increase or decrease the accuracy depending on the dataset and model
  • They can increase the accuracy by reducing noise
  • nan
Outlier handling techniques can either increase or decrease the accuracy of a predictive model depending on the dataset and model. Properly handled outliers can improve model accuracy, but incorrectly handled outliers or the removal of important information can decrease model accuracy.

Imagine you have a dataset where only 5% of the rows contain missing values. What potential problems could arise if you choose to use listwise deletion?

  • It could cause all of the above problems
  • It may distort the original data distribution
  • It may lead to a significant reduction in sample size
  • It might introduce selection bias
Even though only 5% of the rows contain missing values, using listwise deletion could still lead to a significant reduction in sample size, potential distortion in the original data distribution, and introduce selection bias. These problems may affect the statistical power and the representativeness of the analysis.