You're working with a data set where a few observations are vastly different from the rest. Which method, Z-score or IQR, would be more robust to use for outlier detection?
- Either would work equally well
- IQR
- Neither would be effective
- Z-score
The IQR method is more robust than Z-score for outlier detection in this scenario, as Z-scores can be significantly affected by extreme values.
What is the underlying JavaScript library that Plotly uses to render its graphics?
- D3.js
- Node.js
- React.js
- jQuery
Plotly uses D3.js (Data-Driven Documents) under the hood to render its graphics. D3.js is a JavaScript library for producing dynamic and interactive data visualizations in web browsers.
Readability in data visualization refers to how easily the audience can __________.
- Analyze the underlying code
- Download the graph
- Interact with the graph
- Understand the represented data
Readability in data visualization refers to how easily the audience can understand the represented data. This includes the clarity of text elements (labels, title, caption), color scheme, and whether the choice of plot type makes sense for the represented data.
In the context of handling missing data, what does 'imputation' mean?
- Adding artificial data
- Deleting data points
- Filling in missing data with substituted values
- Transforming data
In the context of handling missing data, 'imputation' refers to the process of filling in missing data with substituted values. These values can be determined in a variety of ways such as using measures of central tendency (mean, median, mode), predictive models, or other techniques.
Imagine you are examining a correlation matrix and find that two variables have a correlation coefficient close to -1. What does this imply about the relationship between these two variables?
- Their relationship is random
- They are unrelated
- They have a strong negative relationship
- They have a weak positive relationship
A correlation coefficient close to -1 implies that the two variables have a strong negative relationship. This means that as one variable increases, the other decreases and vice versa.
What is the difference between skewness and kurtosis?
- Skewness measures asymmetry, kurtosis measures variability.
- Skewness measures center, kurtosis measures spread.
- Skewness measures spread, kurtosis measures center.
- Skewness measures symmetry, kurtosis measures tailedness.
The difference between skewness and kurtosis is that skewness measures the asymmetry of a data distribution around its mean, whereas kurtosis measures the "tailedness" of a data distribution. So, skewness is about the symmetry, and kurtosis is about the tails of the distribution.
Even after concluding, it's crucial to '______' effectively in the EDA process, as this step is where your findings are shared and potentially acted upon.
- communicate
- conclude
- question
- wrangle
Even after concluding, it's crucial to 'communicate' effectively in the EDA process, as this step is where your findings are shared and potentially acted upon. Communication is not only about presenting the findings, but also about making sure that they are understood and can be acted upon.
Consider you are using a correlation matrix to understand the relationship between multiple features. You come across a correlation coefficient of -0.85 between two features. What does this indicate?
- A strong negative linear relationship
- A strong positive linear relationship
- A weak positive linear relationship
- No relationship
A correlation coefficient of -0.85 indicates a strong negative linear relationship between two features. This means as one feature increases, the other decreases.
Replacing missing values with the median of the existing values is known as _____ imputation.
- Mean
- Median
- Mode
- Pairwise
Replacing missing values with the median of the existing values is known as 'median' imputation. This technique is useful for skewed distributions as the median is less affected by outliers than the mean.
In a survey about income levels, some individuals chose not to disclose their earnings. How would you categorize this missing data?
- MAR
- MCAR
- NMAR
- Not missing data
This would also be NMAR (Not Missing at Random) because the missingness (income level) depends on the value of the unobserved data itself (i.e., people with higher or lower incomes may be more likely to omit this information).