Which of the following could be a possible reason for the presence of outliers in a dataset?

All of these
Data entry errors
Data processing errors
Measurement errors

Outliers can be caused by various factors like data entry errors, measurement errors, and data processing errors. For example, entering an extra digit while recording data or measurement device calibration errors.

Discuss it

You're conducting a study and have encountered missing data. You opt for the model-based method for imputation. Under what circumstances might this approach introduce bias?

If the chosen model fits poorly to the data
If the chosen model is a complex one
If the chosen model is a perfect fit for the data
If the missing data is missing completely at random

Bias might be introduced in model-based imputation if the chosen model fits poorly to the data. If the model used does not reflect the true data generation process, the imputed values might be systematically biased, leading to incorrect conclusions.

Discuss it

Why is EDA considered a crucial step before proceeding with Confirmatory Data Analysis (CDA)?

Because EDA helps to formulate hypotheses that can be tested in CDA
Because EDA involves applying ML models to the data
Because EDA is a requirement for most regulatory bodies
Because EDA results in a finalized data report

EDA is considered a crucial step before proceeding with CDA because it helps to formulate hypotheses that can be tested in CDA. EDA involves exploring the data to understand its main characteristics and patterns, which can then inform the formulation of hypotheses in the confirmatory phase.

Discuss it

In which stage of the data analysis process is Confirmatory Data Analysis (CDA) typically used?

After EDA
After Predictive Modeling
Before EDA
Before data collection

CDA typically comes after the EDA stage in the data analysis process. EDA allows analysts to explore the data and generate hypotheses while CDA involves statistical tests to confirm or refute these hypotheses.

Discuss it

In a data set where values are uniformly distributed across the range, how would the mean, median and mode compare?

Mean would be the highest
Median would be the highest
Mode would be the highest
They would all be the same

In a uniform distribution, all values occur with the same frequency, so the "Mean", "Median", and "Mode" would all be the same, falling in the center of the distribution.

Discuss it

In your EDA process, you notice that one particular feature has negligible variance. How would you interpret this in the context of your analysis and the overall dataset?

This feature is the least important one
This feature is the most important one
This feature should be converted into a binary feature
This feature should be used to create new features

In the context of your analysis, a feature with negligible variance might have little influence on the outcome variable. This is because, with very little variance, the feature is nearly constant and hence, provides no new information for the model. Depending on the context and the objectives of your analysis, you might consider dropping this feature.

Discuss it

What type of graph is often used to understand the underlying frequency distribution of data?

Bar chart
Histogram
Line graph
Pie chart

A Histogram is often used to understand the underlying frequency distribution of data. It groups numbers into ranges (bins) and the height of each bar depicts the number of values that fall into each range.

Discuss it

Why might an outlier not be visible in a box plot?

If the box plot is not correctly drawn
If the data is normally distributed
If the dataset is very large
If the outlier is close to the whisker

An outlier might not be visible in a box plot if the outlier is close to the whisker because it might still fall within the range of the whisker and thus be considered part of the normal distribution.

Discuss it

What role does the audience play in choosing the right graph for data visualization?

The audience should be consulted during the design process
The audience's familiarity with different types of graphs is irrelevant
The audience's preferences should always dictate the type of graph
Understanding the audience's knowledge and familiarity with different types of graphs can help choose the most effective one

Understanding the audience's knowledge and familiarity with different types of graphs can significantly influence the choice of graph. For a general audience, simpler graphs like bar charts or line graphs may be more suitable, whereas a more technical audience might be comfortable with more complex visualizations like heatmaps or network diagrams.

Discuss it

What are some of the adverse impacts of Multicollinearity on the coefficients of a linear regression model?

All of the above.
It inflates the standard errors of the coefficients.
It makes the model unstable.
It weakens the statistical power of the model.

Multicollinearity affects the coefficients of a linear regression model by making them unstable (small changes in the data cause large swings in the coefficients), inflating the standard errors of the coefficients (making them less statistically significant), and weakening the statistical power of the model (decreasing the chances of finding valid effects).

Discuss it