The _____ Distribution is used for modeling the number of times an event occurs in an interval of time or space.

Binomial
Normal
Poisson
Uniform

The Poisson Distribution is used for modeling the number of times an event occurs in an interval of time or space.

How does 'binning' help in dealing with outliers in a dataset?

By dividing the data into intervals and replacing outlier values
By eliminating irrelevant variables
By identifying and removing outliers
By normalizing the data

Binning helps in dealing with outliers by dividing the data into intervals or 'bins' and replacing outlier values with summary statistics like the bin mean or median.

Discuss it

Suppose you have a data set with many missing values and outliers. In which step of the EDA process would you primarily deal with these issues?

In the communicating phase
In the exploring phase
In the questioning phase
In the wrangling phase

During the 'wrangling' phase of the EDA process, data analysts deal with data cleaning tasks which includes handling missing values and dealing with outliers. Data wrangling involves transforming and cleaning data to enable further exploration and analysis.

Discuss it

How can one interpret the colors in a heatmap?

Colors have no significance in a heatmap
Colors represent different categories of data
Colors represent the magnitude of the data
Darker colors always mean higher values

In a heatmap, colors represent the magnitude of the data. Usually, a color scale is provided for reference, where darker colors often correspond to higher values and lighter colors to lower values. However, the color scheme can vary.

Discuss it

In what situations is it more appropriate to use the interquartile range instead of the standard deviation to measure dispersion?

When the data has no outliers
When the data is normally distributed
When the data is perfectly symmetrical
When the data is skewed or has outliers

The Interquartile Range (IQR) is a more appropriate measure of dispersion when the data is "Skewed or has outliers" as it is not affected by extreme values.

Discuss it

For data with outliers, the _____ is typically a better measure of central tendency as it is less sensitive to extreme values.

Mean
Median
Mode
Variance

The "Median" is less sensitive to extreme values, or outliers, in a dataset. Therefore, it's often a better measure of central tendency when outliers are present.

Discuss it

If you are working with a large data set and need to produce interactive visualizations for a web application, which Python library would be the most suitable?

Bokeh
Matplotlib
Plotly
Seaborn

Plotly is well-suited for creating interactive visualizations and can handle large data sets efficiently. It also supports rendering in web applications, making it ideal for this scenario.

Discuss it

What type of bias could be introduced by mean/median/mode imputation, particularly if the data is not missing at random?

Confirmation bias
Overfitting bias
Selection bias
Underfitting bias

Mean/Median/Mode Imputation, particularly when data is not missing at random, could introduce a type of bias known as 'Selection Bias'. This is because it might lead to incorrect estimation of variability and distorted representation of true relationships between variables, as the substituted values may not accurately reflect the reasons behind the missingness.

Discuss it

How can regularization techniques contribute to feature selection?

By adding a penalty term to the loss function
By avoiding overfitting
By reducing model complexity
By shrinking coefficients towards zero

Regularization techniques contribute to feature selection by shrinking the coefficients of less important features towards zero. This has the effect of effectively removing these features from the model, thus achieving feature selection.

Discuss it

What type of data visualization method is typically color-coded to represent different values?

Heatmap
Histogram
Line plot
Scatter plot

Heatmaps are typically color-coded to represent different values. In a heatmap, data values are represented as colors, making it an excellent tool for visualizing large amounts of data and the correlation between different variables.

Discuss it