In a _____ plot, the width of the "violin" indicates the frequency or density of data.
- Bar
- Box
- Scatter
- Violin
In a Violin plot, the width of the "violin" (or the density plot on each side) varies with the estimated density of data points at a given level. The wider the plot, the higher the density of data points at that value.
During an experiment, you discover that a certain variable is presenting a high number of outliers. What might this suggest about your data collection process?
- Both are possible
- Data collection process is accurate
- Data collection process is flawed
- Neither of these is possible
A high number of outliers might suggest that there are issues with the data collection process, such as measurement errors or other issues.
What is the primary cause of outliers in normally distributed data?
- All of these
- Data entry errors
- Data processing errors
- Measurement errors
Outliers in normally distributed data can be a result of various factors such as data entry errors, measurement errors, or errors in data processing.
In Plotly, the ________ object is the top-level container for all plot attributes.
- Diagram
- Figure
- Graph
- Plot
In Plotly, the 'Figure' object is the top-level container in which all plot-related attributes such as data and layout are stored.
How does EDA help in understanding the underlying structure of data?
- By cleaning data
- By modelling data
- By summarizing data
- By visualizing data
EDA, particularly data visualization, plays a crucial role in understanding the underlying structure of data. Visual techniques such as histograms, scatterplots, or box plots, can uncover patterns, trends, relationships, or outliers that would remain hidden in raw, numerical data. Visual exploration can guide statistical analysis and predictive modeling by revealing the underlying structure and suggesting hypotheses.
What is a primary assumption when using regression imputation?
- All data is normally distributed
- Missing data is missing completely at random (MCAR)
- Missing values are negligible
- The relationship between variables is linear
A primary assumption when using regression imputation is that the relationship between variables is linear. This is because regression imputation uses a regression model to predict missing values, and the basic form of regression models assumes a linear relationship between predictor and response variables.
You are working on a dataset and found that the model performance is poor. On further inspection, you found some data points that are far from the rest. What could be a possible reason for the poor performance of your model?
- Outliers
- Overfitting
- Underfitting
- nan
The poor performance of the model might be due to outliers in the dataset. Outliers can have a significant impact on the performance of machine learning models.
As a data scientist, you've realized that your dataset contains missing values. How would you handle this situation as part of your EDA process?
- Always replace missing values with the mean or median
- Choose an appropriate imputation method depending on the nature of the data and the type of missingness
- Ignore the missing values and proceed with analysis
- Remove all instances with missing values
Handling missing values is an important part of the EDA process. The method used to handle them depends on the nature of the data and the type of missingness (MCAR, MAR, or NMAR). Various imputation methods can be used, such as mean/median/mode imputation for MCAR or MAR data, and advanced imputation methods like regression imputation, multiple imputation, or model-based methods for NMAR data.
If the variance of a data set is zero, then all data points are ________.
- Equal
- Infinite
- Negative
- Positive
If the "Variance" of a data set is zero, then all data points are "Equal". Variance is a measure of how far a set of numbers is spread out from their average value. A variance of zero indicates that all the values within a set of data are identical.
A market research survey collects data on customer age, gender, and preference for a product (Yes/No). Identify the types of data present in this survey.
- Age: continuous, Gender: nominal, Preference: ordinal
- Age: nominal, Gender: ordinal, Preference: interval
- Age: ordinal, Gender: interval, Preference: ratio
- Age: ratio, Gender: ordinal, Preference: nominal
Age is a continuous data type because it can take on any value within a range. Gender is nominal as it's categorical with no order or priority. Preference is ordinal as it's categorical with a clear order (Yes is preferred to No).