How does EDA help in understanding the underlying structure of data?

By cleaning data
By modelling data
By summarizing data
By visualizing data

EDA, particularly data visualization, plays a crucial role in understanding the underlying structure of data. Visual techniques such as histograms, scatterplots, or box plots, can uncover patterns, trends, relationships, or outliers that would remain hidden in raw, numerical data. Visual exploration can guide statistical analysis and predictive modeling by revealing the underlying structure and suggesting hypotheses.

Discuss it

What are the disadvantages of using backward elimination in feature selection?

It assumes a linear relationship
It can be computationally expensive
It can result in overfitting
It's sensitive to outliers

Backward elimination in feature selection involves starting with all variables and then removing the least significant variables one by one. This process can be computationally expensive, especially when dealing with datasets with a large number of features.

Discuss it

The 'style' and 'context' functions in Seaborn are used to set the ___________ of the plots.

aesthetic and context
axis labels
layout and structure
size and color

The 'style' function in Seaborn is used to set the overall aesthetic look of the plot, including background color, grids, and spines. The 'context' function allows you to set the context parameters, which adjust the scale of the plot elements based on the context in which the plot will be presented (e.g., paper, notebook, talk, poster).

Discuss it

How does the choice of the threshold affect the number of identified outliers using the Z-score method?

A higher threshold identifies more outliers
A lower threshold identifies more outliers
It has no effect
The threshold value is irrelevant in the Z-score method

The lower the threshold, the more data points will exceed it, and thus, more outliers will be identified.

Discuss it

Outliers are _________ observations that lie an abnormal distance from other values in a dataset.

Anomalous
Erroneous
Random
Statistical

Anomalous is the correct term. Outliers are anomalous observations that lie an abnormal distance from other values in a random sample from a population.

Discuss it

Suppose you are comparing the dispersion of two different data sets. One has a higher range, but a lower IQR than the other. What might this tell you about each data set?

The one with the higher range has more outliers
The one with the higher range has more variability
The one with the lower IQR has more variability
The one with the lower IQR is more skewed

If one dataset has a higher range but a lower IQR than the other, it could suggest that "The one with the higher range has more outliers". The range is sensitive to extreme values, while the IQR focuses on the middle 50% of data and is not affected by outliers.

Discuss it

What is a primary assumption when using regression imputation?

All data is normally distributed
Missing data is missing completely at random (MCAR)
Missing values are negligible
The relationship between variables is linear

A primary assumption when using regression imputation is that the relationship between variables is linear. This is because regression imputation uses a regression model to predict missing values, and the basic form of regression models assumes a linear relationship between predictor and response variables.

Discuss it

You are working on a dataset and found that the model performance is poor. On further inspection, you found some data points that are far from the rest. What could be a possible reason for the poor performance of your model?

Outliers
Overfitting
Underfitting
nan

The poor performance of the model might be due to outliers in the dataset. Outliers can have a significant impact on the performance of machine learning models.

Discuss it

As a data scientist, you've realized that your dataset contains missing values. How would you handle this situation as part of your EDA process?

Always replace missing values with the mean or median
Choose an appropriate imputation method depending on the nature of the data and the type of missingness
Ignore the missing values and proceed with analysis
Remove all instances with missing values

Handling missing values is an important part of the EDA process. The method used to handle them depends on the nature of the data and the type of missingness (MCAR, MAR, or NMAR). Various imputation methods can be used, such as mean/median/mode imputation for MCAR or MAR data, and advanced imputation methods like regression imputation, multiple imputation, or model-based methods for NMAR data.

Discuss it

If the variance of a data set is zero, then all data points are ________.

Equal
Infinite
Negative
Positive

If the "Variance" of a data set is zero, then all data points are "Equal". Variance is a measure of how far a set of numbers is spread out from their average value. A variance of zero indicates that all the values within a set of data are identical.

Discuss it