Predictive Modeling differs from EDA and CDA in that it focuses on creating _______ for making future predictions.

assumptions
hypotheses
models
visualizations

The main aim of Predictive Modeling is to create models which can predict future outcomes based on historical data.

Which phase of data analysis does EDA typically precede?

Data Cleaning
Data Gathering
Data Modelling
Data Visualization

In the data analysis pipeline, EDA typically precedes Data Modelling. EDA offers a critical foundation before moving on to the modeling phase. It allows for in-depth understanding of data, recognition of key trends, identification of outliers and anomalies, and discovery of underlying patterns and relationships. This knowledge informs the choice of the most appropriate data models and validates their assumptions.

Discuss it

Why is EDA an essential step in data analysis?

All of the mentioned
It can help to detect errors in the data
It facilitates more accurate hypothesis or model selection
It helps to understand the underlying structure of data

EDA is essential because it allows data analysts to understand the complex structures of data, detect potential issues such as outliers and errors, and formulate more accurate hypotheses for later stages of analysis. Furthermore, by conducting EDA, analysts can assess the quality and cleanliness of data, decide on the necessary preprocessing steps, and determine the most suitable analytical models.

Discuss it

Which of the following could be a possible reason for the presence of outliers in a dataset?

All of these
Data entry errors
Data processing errors
Measurement errors

Outliers can be caused by various factors like data entry errors, measurement errors, and data processing errors. For example, entering an extra digit while recording data or measurement device calibration errors.

Discuss it

How is the standard deviation related to the variance of a data set?

It's the average of the variance
It's the median of the variance
It's the square of the variance
It's the square root of the variance

The "Standard Deviation" is the "Square root of the Variance" of a data set. It measures the average distance that the data points deviate from the mean.

Discuss it

Which technique is NOT commonly used for handling outliers in a dataset?

Discretization
Smoothing
Standardization
Truncation

Smoothing is not a common technique for handling outliers. It is typically used to remove noise from data rather than to handle outliers.

Discuss it

Which Python library is typically used for creating basic 2D plots?

Matplotlib
Pandas
Plotly
Seaborn

Matplotlib is the primary Python library used for creating static, animated, and interactive visualizations in Python. It is highly customizable and forms the foundation for many other visualization libraries.

Discuss it

You're faced with a dataset where missing values are missing not at random (MNAR). What advanced imputation method would you choose and why?

Mean imputation, as it's simple
Model-based method, as it can model the missing data mechanism
Multiple imputation, as it can handle large data
Regression imputation, as it considers relationships

For data missing not at random (MNAR), a model-based method is preferable, as it allows for the explicit modeling of the missing data mechanism. This enables the handling of the systematic pattern in the missingness, reducing bias in the imputed values.

Discuss it

Imagine you need to compare the distribution of ages across different genders. Which plot would you use and why?

Bar graph
Line graph
Scatter plot
Violin plot

A Violin plot would be an ideal choice to compare the distribution of ages across different genders. It offers a density estimation of the underlying distribution of the data and provides a comparison of the distribution across different categories.

Discuss it

In what scenarios could mean/median/mode imputation lead to a misleading interpretation of the data?

When data has many outliers
When data is missing completely at random
When data is missing systematically
When data is normally distributed

Mean/median/mode imputation could lead to a misleading interpretation of the data when data is missing systematically or 'not at random'. This is because this kind of imputation might introduce bias by not accurately reflecting the reasons behind the missingness and could distort the true distribution of the data.

Discuss it