In a survey about income levels, some individuals chose not to disclose their earnings. How would you categorize this missing data?

  • MAR
  • MCAR
  • NMAR
  • Not missing data
This would also be NMAR (Not Missing at Random) because the missingness (income level) depends on the value of the unobserved data itself (i.e., people with higher or lower incomes may be more likely to omit this information).

Replacing missing values with the median of the existing values is known as _____ imputation.

  • Mean
  • Median
  • Mode
  • Pairwise
Replacing missing values with the median of the existing values is known as 'median' imputation. This technique is useful for skewed distributions as the median is less affected by outliers than the mean.

Consider you are using a correlation matrix to understand the relationship between multiple features. You come across a correlation coefficient of -0.85 between two features. What does this indicate?

  • A strong negative linear relationship
  • A strong positive linear relationship
  • A weak positive linear relationship
  • No relationship
A correlation coefficient of -0.85 indicates a strong negative linear relationship between two features. This means as one feature increases, the other decreases.

What are the potential downsides of removing outliers from your dataset?

  • It always improves the quality of the dataset
  • It might discard important information
  • It might introduce noise into the dataset
  • nan
Removing outliers might discard potentially important information that could significantly influence the analysis results.

If you want to represent both the distribution and density of data, a _____ plot is a good choice.

  • Bar
  • Line
  • Scatter
  • Violin
A Violin plot is a good choice to represent both the distribution and density of data. It combines aspects of both a box plot and a density plot, giving a fuller picture of the distribution.

Imagine you're working on a data project where the 'wrangle' phase is taking significantly longer than expected. How might this impact the rest of your EDA process?

  • It could delay subsequent steps and overall analysis timeline.
  • The communication phase will be quicker.
  • The explore phase might be shortened to make up for lost time.
  • The rest of the process will not be impacted.
If the 'wrangling' phase takes significantly longer than expected, it could delay subsequent steps and the overall timeline for the analysis. The EDA process is often iterative, and delays in one phase could impact the time available for later phases. Proper time management and planning are crucial for a successful data analysis project.

Outliers can make a histogram appear ____, hence, distorting the true distribution of the data.

  • skewed
  • spread out
  • symmetrical
  • uniform
Outliers can cause a histogram to appear skewed or distorted as they can create bars that stand alone far from the main distribution.

In a correlation matrix, the value -1 signifies a perfect _____ correlation between two variables.

  • negative
  • neutral
  • positive
  • random
In a correlation matrix, a value of -1 signifies a perfect negative correlation between two variables. This means that as one variable increases, the other decreases proportionally, and vice versa.

What happens to a model's performance when missing data is not handled correctly?

  • It depends on the model.
  • It deteriorates.
  • It improves.
  • It remains the same.
When missing data is not handled correctly, it can distort the underlying data distribution and lead to incorrect model learning, ultimately deteriorating the model's performance.

In the EDA process, what does 'wrangling' refer to?

  • Cleaning and transforming data
  • Formulating hypothesis
  • Interpreting data
  • Visualizing data
Wrangling in the EDA process refers to the cleaning and transforming of data to facilitate subsequent analysis. This could involve addressing missing values, correcting inconsistencies, or reshaping the data structure.

A correlation coefficient of 0 implies ________.

  • A strong negative relationship
  • A strong positive relationship
  • An uncertain relationship
  • No linear relationship
A correlation coefficient of 0 implies no linear relationship between the variables. However, it doesn't necessarily mean that there is no relationship at all, as the relationship could be non-linear.

How does EDA help in understanding the underlying structure of data?

  • By cleaning data
  • By modelling data
  • By summarizing data
  • By visualizing data
EDA, particularly data visualization, plays a crucial role in understanding the underlying structure of data. Visual techniques such as histograms, scatterplots, or box plots, can uncover patterns, trends, relationships, or outliers that would remain hidden in raw, numerical data. Visual exploration can guide statistical analysis and predictive modeling by revealing the underlying structure and suggesting hypotheses.