How can outliers influence the mean of a dataset?

  • Can either increase or decrease the mean
  • Decrease the mean
  • Does not affect the mean
  • Increase the mean
Outliers can have a big impact on the mean. Depending on whether the outlier is much higher or lower than the other values, it can significantly increase or decrease the mean, thereby skewing the data.

What is the biggest challenge in the 'wrangle' phase of the EDA process?

  • Communicating the insights
  • Dealing with missing values and other inconsistencies in the data
  • Defining the right questions
  • Drawing conclusions from the data
The wrangling phase of the EDA process can be challenging as it involves dealing with various data quality issues. These can include missing values, inconsistent data entries, outliers, and other anomalies. The analyst might need to make informed decisions about how to handle these issues without introducing bias or distorting the underlying information in the data.

How does the choice of model in a model-based method impact the imputation process?

  • The choice of model can cause overfitting
  • The choice of model can influence the accuracy of the imputations
  • The choice of model can introduce unnecessary complexity
  • The choice of model has no impact
The choice of model in a model-based method can significantly influence the accuracy of the imputations. If the chosen model closely matches the actual data generation process, then the imputations will be accurate. However, if the model is a poor fit, the imputed values may be far from the true values, leading to biased results.

A company has asked you to build a model that can predict customer churn based on a set of features. Which type of data analysis will you perform?

  • All are equally suitable
  • CDA
  • EDA
  • Predictive Modeling
Predictive Modeling would be most suitable in this case. It involves the application of machine learning algorithms to the data in order to make predictions about future outcomes, in this case, customer churn.

Which category of missing data implies that the probability of missingness is related to the observed data?

  • MAR
  • MCAR
  • NMAR
  • nan
MAR, which stands for Missing At Random, implies that the probability of missingness is related to the observed data.

In the context of data visualization, what is a pairplot primarily used for?

  • Comparing multiple variables at once
  • Showing the correlation between two variables
  • Visualizing the distribution of a single variable
  • Visualizing the relationship between two variables
Pairplots are primarily used for comparing multiple variables at once. It creates a grid of scatter plots for each pair of variables, which helps in understanding the relationships between all variables.

The '______' step in the EDA process involves formulating the questions you want to answer with your data.

  • communicating
  • concluding
  • questioning
  • wrangling
The first step in the EDA process, 'questioning,' involves formulating the questions that you want to answer with your data. It's during this step that you define what you want to achieve with your analysis and what problems you are trying to solve.

Under what circumstances is NMAR typically observed in a dataset?

  • All of the above
  • When data missingness is associated with the missing data itself
  • When data missingness is random
  • When data missingness is unrelated to observed and unobserved data
NMAR (Not Missing At Random) is typically observed when the missingness is related to the value of the missing data itself. This is the most challenging type of missingness to handle as it relies on unobserved data.

________ correlation is more appropriate when dealing with ordinal variables.

  • Covariance
  • Kendall's Tau
  • Pearson's
  • Spearman's
Spearman's correlation is more appropriate when dealing with ordinal variables. Unlike Pearson's, Spearman's correlation works with ranks, which makes it suitable for ordinal data.

Anomalies or outliers in the dataset can be identified through the process of ________.

  • CDA
  • EDA
  • Machine Learning
  • Predictive Modeling
Anomalies or outliers in the dataset can be identified through the process of EDA. Various techniques such as visualization methods (like box plots and scatter plots) and statistical methods (like the IQR method or the Z-score method) can be used to detect outliers during EDA.

When creating a dashboard for monthly sales data, which type of visualization would be best to show trends over time?

  • Bar Chart
  • Line Chart
  • Pie Chart
  • Scatter Plot
A line chart is the most suitable visualization for displaying trends over time, making it easy to observe how a specific metric, like monthly sales data, changes over a period. It connects data points with lines, allowing for a clear view of trends.

When considering scalability, what challenge might a stateful application present as opposed to a stateless one?

  • Stateful applications are inherently more scalable
  • Stateful applications require fewer resources
  • Stateful applications retain client session data, making load balancing complex
  • Stateless applications consume more bandwidth
Stateful applications, unlike stateless ones, retain client session data. This can make load balancing complex because the session data must be maintained consistently, potentially limiting scalability. Stateful applications often require additional strategies for handling session data, making them more challenging in terms of scalability.