Which phase of data analysis does EDA typically precede?

  • Data Cleaning
  • Data Gathering
  • Data Modelling
  • Data Visualization
In the data analysis pipeline, EDA typically precedes Data Modelling. EDA offers a critical foundation before moving on to the modeling phase. It allows for in-depth understanding of data, recognition of key trends, identification of outliers and anomalies, and discovery of underlying patterns and relationships. This knowledge informs the choice of the most appropriate data models and validates their assumptions.

Why is EDA an essential step in data analysis?

  • All of the mentioned
  • It can help to detect errors in the data
  • It facilitates more accurate hypothesis or model selection
  • It helps to understand the underlying structure of data
EDA is essential because it allows data analysts to understand the complex structures of data, detect potential issues such as outliers and errors, and formulate more accurate hypotheses for later stages of analysis. Furthermore, by conducting EDA, analysts can assess the quality and cleanliness of data, decide on the necessary preprocessing steps, and determine the most suitable analytical models.

Which of the following could be a possible reason for the presence of outliers in a dataset?

  • All of these
  • Data entry errors
  • Data processing errors
  • Measurement errors
Outliers can be caused by various factors like data entry errors, measurement errors, and data processing errors. For example, entering an extra digit while recording data or measurement device calibration errors.

How is the standard deviation related to the variance of a data set?

  • It's the average of the variance
  • It's the median of the variance
  • It's the square of the variance
  • It's the square root of the variance
The "Standard Deviation" is the "Square root of the Variance" of a data set. It measures the average distance that the data points deviate from the mean.

Which technique is NOT commonly used for handling outliers in a dataset?

  • Discretization
  • Smoothing
  • Standardization
  • Truncation
Smoothing is not a common technique for handling outliers. It is typically used to remove noise from data rather than to handle outliers.

Which Python library is typically used for creating basic 2D plots?

  • Matplotlib
  • Pandas
  • Plotly
  • Seaborn
Matplotlib is the primary Python library used for creating static, animated, and interactive visualizations in Python. It is highly customizable and forms the foundation for many other visualization libraries.

You're faced with a dataset where missing values are missing not at random (MNAR). What advanced imputation method would you choose and why?

  • Mean imputation, as it's simple
  • Model-based method, as it can model the missing data mechanism
  • Multiple imputation, as it can handle large data
  • Regression imputation, as it considers relationships
For data missing not at random (MNAR), a model-based method is preferable, as it allows for the explicit modeling of the missing data mechanism. This enables the handling of the systematic pattern in the missingness, reducing bias in the imputed values.

Imagine you need to compare the distribution of ages across different genders. Which plot would you use and why?

  • Bar graph
  • Line graph
  • Scatter plot
  • Violin plot
A Violin plot would be an ideal choice to compare the distribution of ages across different genders. It offers a density estimation of the underlying distribution of the data and provides a comparison of the distribution across different categories.

In what scenarios could mean/median/mode imputation lead to a misleading interpretation of the data?

  • When data has many outliers
  • When data is missing completely at random
  • When data is missing systematically
  • When data is normally distributed
Mean/median/mode imputation could lead to a misleading interpretation of the data when data is missing systematically or 'not at random'. This is because this kind of imputation might introduce bias by not accurately reflecting the reasons behind the missingness and could distort the true distribution of the data.

In which type of graph would the presence of outliers possibly distort the overall distribution of the data?

  • Bar Chart
  • Histogram
  • Line Graph
  • Pie Chart
Outliers can distort the overall distribution of the data in a Histogram because it uses binning and outliers can cause uneven bin sizes.