How does improper handling of missing data impact the precision-recall trade-off in a model?

Degrades both precision and recall.
Degrades precision but improves recall.
Improves both precision and recall.
Improves precision but degrades recall.

Incorrectly handling missing data can lead to incorrect learning and misclassification, degrading both the precision (incorrectly identified positives) and recall (missed true positives) of the model.

Discuss it

Suppose you have a data set with many missing values and outliers. In which step of the EDA process would you primarily deal with these issues?

In the communicating phase
In the exploring phase
In the questioning phase
In the wrangling phase

During the 'wrangling' phase of the EDA process, data analysts deal with data cleaning tasks which includes handling missing values and dealing with outliers. Data wrangling involves transforming and cleaning data to enable further exploration and analysis.

Discuss it

How does 'binning' help in dealing with outliers in a dataset?

By dividing the data into intervals and replacing outlier values
By eliminating irrelevant variables
By identifying and removing outliers
By normalizing the data

Binning helps in dealing with outliers by dividing the data into intervals or 'bins' and replacing outlier values with summary statistics like the bin mean or median.

Discuss it

The _____ Distribution is used for modeling the number of times an event occurs in an interval of time or space.

Binomial
Normal
Poisson
Uniform

The Poisson Distribution is used for modeling the number of times an event occurs in an interval of time or space.

Discuss it

When would a scatter plot be less effective in identifying outliers?

When the data has no correlation
When the data is normally distributed
When the data points are closely grouped
When there are many data points

A scatter plot may be less effective in identifying outliers when the data points are closely grouped because it would be hard to visually identify points that are far away from the others.

Discuss it

What are the key statistical tools used in Confirmatory Data Analysis (CDA)?

Box-Plot, Scatter Plot, Histogram, and Density Plots
Hypothesis Testing, Regression Analysis, Chi-Squared Test, and ANOVA
PCA, LDA, t-SNE, and UMAP
Random Forests, SVM, Neural Networks, and Gradient Boosting

In CDA, the primary goal is to confirm or refute the hypotheses that were generated during EDA. Key statistical tools used in CDA include Hypothesis Testing, Regression Analysis, Chi-Squared Test, and Analysis of Variance (ANOVA).

Discuss it

The choice of graph for data visualization largely depends on the __________ of the dataset.

File format
Shape
Size
Type of variables

The choice of graph for data visualization largely depends on the type of variables in the dataset. For example, categorical variables are best represented with bar charts or pie charts, while continuous variables might be better shown with histograms or box plots.

Discuss it

The ________ correlation coefficient is based on the ranks of data rather than the actual values.

Covariance
Kendall's Tau
Pearson's
Spearman's

The Spearman's correlation coefficient is based on the ranks of data rather than the actual values. This makes it suitable for use with ordinal variables and resistant to outliers.

Discuss it

Incorrect handling of missing data can lead to a(n) ________ in model performance.

amplification
boost
degradation
improvement

Incorrectly handling missing data can distort the data, thereby negatively affecting the model's ability to learn accurately from it and leading to a degradation in the model's performance.

Discuss it

In what situations is it more appropriate to use the interquartile range instead of the standard deviation to measure dispersion?

When the data has no outliers
When the data is normally distributed
When the data is perfectly symmetrical
When the data is skewed or has outliers

The Interquartile Range (IQR) is a more appropriate measure of dispersion when the data is "Skewed or has outliers" as it is not affected by extreme values.

Discuss it

How can one interpret the colors in a heatmap?

Colors have no significance in a heatmap
Colors represent different categories of data
Colors represent the magnitude of the data
Darker colors always mean higher values

In a heatmap, colors represent the magnitude of the data. Usually, a color scale is provided for reference, where darker colors often correspond to higher values and lighter colors to lower values. However, the color scheme can vary.

Discuss it

What type of bias could be introduced by mean/median/mode imputation, particularly if the data is not missing at random?

Confirmation bias
Overfitting bias
Selection bias
Underfitting bias

Mean/Median/Mode Imputation, particularly when data is not missing at random, could introduce a type of bias known as 'Selection Bias'. This is because it might lead to incorrect estimation of variability and distorted representation of true relationships between variables, as the substituted values may not accurately reflect the reasons behind the missingness.

Discuss it