What type of plot is ideal for visualizing relationships among more than two variables?

  • Bar plot
  • Box plot
  • Pairplot
  • Scatter plot
Pairplot is a type of plot that is ideal for visualizing relationships among more than two variables. It creates a grid of Axes such that each variable in your data is shared in the y-axis across a single row and in the x-axis across a single column.

How does the uncertainty level differ in EDA, CDA, and Predictive Modeling?

  • Uncertainty is equally distributed among all three.
  • Uncertainty is highest in CDA, lower in Predictive Modeling, and lowest in EDA.
  • Uncertainty is highest in EDA, lower in CDA, and lowest in Predictive Modeling.
  • Uncertainty is highest in Predictive Modeling, lower in CDA, and lowest in EDA.
In EDA, where the primary aim is to explore patterns and relationships in the data, the level of uncertainty is highest. This reduces in CDA, which seeks to confirm the hypotheses generated during EDA. The uncertainty level is lowest in Predictive Modeling as it builds on the outcomes of EDA and CDA to make future predictions.

Can the steps of the EDA process be re-ordered or are they strictly sequential?

  • Some steps can be reordered, but not all.
  • The order of steps depends on the data set size.
  • They are strictly sequential and cannot be reordered.
  • They can be reordered based on the analysis needs.
The EDA process is generally sequential, starting from questioning and ending in communication. However, depending on the nature and needs of the analysis, some steps might be revisited. For instance, new questions might emerge during the explore phase, necessitating going back to the questioning phase. Or, additional data wrangling might be needed after exploring the data.

In what scenarios would it be more appropriate to use Kendall's Tau over Spearman's correlation coefficient?

  • Datasets with many tied ranks
  • Datasets with normally distributed data
  • Datasets without outliers
  • Large datasets with ordinal data
It might be more appropriate to use Kendall's Tau over Spearman's correlation coefficient in scenarios with datasets with many tied ranks. Kendall's Tau is better at handling ties than Spearman's correlation coefficient. It's often used in scenarios where the data have many tied ranks.

In a study on job satisfaction, employees with lower satisfaction scores are less likely to complete surveys. How would you categorize this missing data?

  • MAR
  • MCAR
  • NMAR
  • Not missing data
This would be NMAR (Not Missing at Random) because the missingness depends on the unobserved data itself (i.e., the job satisfaction score). If employees with lower job satisfaction are less likely to complete the survey, the missingness is related to the missing satisfaction scores.

You are analyzing customer purchasing behavior and the data exhibits high skewness. What could be the potential challenges and how can you address them?

  • Data normality assumptions may be violated, address this by transformation techniques.
  • No challenges would be encountered.
  • Skewness would make the data easier to analyze.
  • The mean would become more reliable, no action is needed.
High skewness may cause a violation of data normality assumptions often required for many statistical tests and machine learning models. One common way to address this is through data transformation techniques like log, square root, or inverse transformations to make the distribution more symmetrical.

In the context of EDA, you find that certain features in your dataset are highly correlated. How would you interpret this finding and how might it affect your analysis?

  • The presence of multicollinearity may require you to consider it in your model selection or feature engineering steps
  • You should combine the correlated features into one
  • You should remove all correlated features
  • You should use only correlated features in your analysis
High correlation between features indicates multicollinearity. This can be problematic in certain types of models (like linear regression) as it can destabilize the model and make the effects of predictor variables hard to separate. Depending on the severity of multicollinearity, you may need to consider it during model selection or feature engineering steps, such as removing highly correlated variables, combining them, or using regularization techniques.

In what circumstances can the IQR method lead to incorrect detection of outliers?

  • When data has a high standard deviation
  • When data is heavily skewed or bimodal
  • When data is normally distributed
  • When data is uniformly distributed
The IQR method might lead to incorrect detection of outliers in heavily skewed or bimodal distributions because it's based on percentiles which can be influenced by such irregularities.

A potential drawback of using regression imputation is that it can underestimate the ___________.

  • Mean
  • Median
  • Mode
  • Variance
One of the potential drawbacks of using regression imputation is that it can underestimate the variance. This is because it uses the relationship with other variables to estimate the missing values, which usually leads to less variability.

To ensure that the audience doesn't misinterpret a data visualization, it's important to avoid __________.

  • Bias and misleading scales
  • Using interactive elements
  • Using more than one type of graph
  • Using too many colors
To avoid misinterpretation of a data visualization, it's essential to avoid bias and misleading scales. These could skew the representation of the data and thus lead to inaccurate conclusions.