In what scenario would a modified Z-score be beneficial to use for outlier detection?

When data is bimodal
When data is normally distributed
When data is skewed or has outliers
When data is uniformly distributed

A modified Z-score is beneficial to use for outlier detection when data is skewed or has outliers, as it is more robust to outliers than the traditional Z-score.

Discuss it

How is model-based method different from the other two imputation methods?

It deletes missing data
It estimates missing values based on a statistical model
It is not different from the others
It uses the mode value for imputation

The model-based method is different from the other imputation methods as it estimates missing values based on a statistical model. This method assumes a specific statistical model (like a linear regression, logistic regression, etc.) that generates the data, and missing values are filled in based on this model.

Discuss it

Under what conditions would the median be a better measure of central tendency than the mean?

When the data has an even number of observations
When the data has outliers or is skewed
When the data is normally distributed
When the data is uniformly distributed

The median would be a better measure of central tendency than the mean when the data has outliers or is skewed. In these cases, the mean can be heavily influenced by the extreme values, while the median, being the middle value, remains more robust and representative of the central location of the data.

Discuss it

How does the 'subplot' function in Matplotlib differ from 'FacetGrid' in Seaborn?

FacetGrid allows the creation of multi-plot grids based on row and column-wise grouping of the data
FacetGrid supports interactive plotting
Subplot can create only single plots
Subplot does not allow the sharing of axes

The 'subplot' function in Matplotlib is used for creating sub-plots in a single figure, but it does not allow easy creation of grids of plots based on categorical variables. On the other hand, 'FacetGrid' in Seaborn allows the creation of complex multi-plot grids based on row and column-wise grouping of the data.

Discuss it

_____ data is numerical in nature and can be ordered or measured.

Nominal
Ordinal
Qualitative
Quantitative

Quantitative data is numerical, measurable, and can be used with mathematical operations.

Discuss it

In a situation where the initial 'questioning' phase did not yield actionable insights, what might be the next step in the EDA process?

Jump to the concluding phase to draw insights
Proceed to the exploring phase without adjustment
Revisit the questioning phase to refine or develop new questions
Skip to the communication phase

If the initial 'questioning' phase does not yield actionable insights, it is necessary to revisit the questioning phase to refine or develop new questions. The questions set the direction of the analysis and are crucial for subsequent steps. If the questions are not well defined or not actionable, it could lead to an ineffective analysis.

Discuss it

________ is a measure of dispersion that is particularly useful when the data set has outliers.

Interquartile Range
Range
Standard Deviation
Variance

The "Interquartile Range (IQR)" is particularly useful when the dataset has outliers because it only considers the middle 50% of the data. This makes it a robust measure of dispersion.

Discuss it

In a scatter plot, outliers often appear as points that are far removed from the ___________.

axes
main concentration of data
origin
trend line

In a scatter plot, outliers are often represented as points that are far removed from the main concentration of data.

Discuss it

How does the role of data visualization differ in EDA, CDA, and Predictive Modeling?

Data visualization is not essential in any of these processes.
Data visualization is only used in EDA.
Data visualization is used in EDA to explore, in CDA to confirm, and in Predictive Modeling to represent the final model.
Data visualization plays the same role in EDA, CDA, and Predictive Modeling.

Data visualization plays different roles in each of these processes. In EDA, it is used to explore data and identify initial patterns or anomalies. In CDA, it can be used to represent statistical tests and confirm hypotheses. In Predictive Modeling, it is often used to represent the final model or visualize prediction results.

Discuss it

Can the Binomial Distribution be used to model the number of successes in a fixed number of Bernoulli trials?

No
Only for large sample sizes
Only for small sample sizes
Yes

Yes, the Binomial Distribution is used exactly for this purpose. It models the number of successes in a fixed number of independent Bernoulli trials each with the same probability of success.

Discuss it

How are outliers usually represented in a boxplot?

As points outside the box
As points outside the whiskers
As the median of the boxplot
As the quartiles of the boxplot

In a boxplot, outliers are typically represented as points that fall outside of the whiskers (the lines extending from the box, indicating variability outside the upper and lower quartiles).

Discuss it

Can you describe the basic idea behind the Interquartile Range (IQR) method for outlier detection?

It calculates the difference between the 75th and 25th percentile
It involves the calculation of Z-scores
It is based on mean
It is based on standard deviation

The basic idea behind the Interquartile Range (IQR) method for outlier detection is that it calculates the difference between the 75th percentile (Q3) and the 25th percentile (Q1). This range represents the middle 50% of the data.

Discuss it