How does the 'explore' step in the EDA process aid in hypothesis generation?
- It aids in cleaning and transforming data.
- It helps in communicating the findings to stakeholders.
- It helps in defining the questions for analysis.
- It uncovers patterns, trends, relationships, and anomalies in the data.
The explore phase in the EDA process involves analyzing and investigating the data using statistical techniques and visualization methods. This step uncovers patterns, trends, relationships, and anomalies in the data, which can help in forming or refining hypotheses that could be formally tested in subsequent analysis steps.
A company surveyed its customers for their satisfaction scores, ranging from 1-10. The scores were heavily skewed to the right with a few customers giving a score of 1 or 2. Which measure of central tendency should the company use to present a typical customer experience?
- All are equally valid
- Mean
- Median
- Mode
The "Median" would be the best measure of central tendency in this scenario. Since the scores are heavily skewed to the right, the median would provide a more accurate representation of a typical customer's experience than the mean, which would be dragged down by the low scores.
The process of 'binning' to handle outliers involves grouping data into ________.
- Bins
- Deciles
- Percentiles
- Quartiles
In the process of binning, the data is grouped into 'bins', and the outliers are replaced with summary statistics like mean, median, or mode.
When is it more appropriate to use a correlation matrix instead of a pairplot?
- When the dataset is very large
- When the dataset is very small
- When the variables are not numeric
- When there are only two variables
When dealing with a large number of variables, a correlation matrix can be a more appropriate choice than a pairplot. This is because pairplots can become too complex and unreadable when the number of variables increases.
What is variance in the context of a data set?
- The average deviation from the mean
- The average squared deviation from the mean
- The range of the data
- The square root of the average deviation from the mean
"Variance" in the context of a data set is the "Average squared deviation from the mean". It gives a measure of how data points vary from the mean and is used to calculate the standard deviation.
What does Min-Max scaling do to the dataset?
- It reduces the dimensionality of the dataset
- It removes the mean and scales the data to unit variance
- It scales the data based on median and interquartile range
- It scales the dataset so that all feature values are in the range 0 to 1
Min-Max scaling, also known as normalization, transforms features by scaling each feature to a specific range, typically 0 to 1. This is done using the values of the minimum and maximum feature in the dataset.
How does the 'hue' parameter in Seaborn alter the visual presentation of data?
- Changes the color of elements
- Changes the shape of markers
- Changes the size of markers
- Rotates the plot
In Seaborn, the 'hue' parameter changes the color of elements. It is used to provide a color encoding for a third (typically categorical) variable in addition to two numeric variables.
How might the transformation method for handling outliers impact the overall shape of your data distribution?
- It can introduce multimodality into the distribution
- It can make the distribution more skewed
- It can make the distribution more symmetrical
- nan
The transformation method can make the distribution more symmetrical by pulling in extreme values.
Which method of analysis focuses on the exploration of patterns and relationships in the data?
- CDA
- Data Wrangling
- EDA
- Predictive Modeling
EDA (Exploratory Data Analysis) focuses on exploring patterns and relationships in the data. It is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
What is the formula used in the calculation of Min-Max scaling?
- (value - mean) / standard deviation
- (value - min) / (max - min)
- value - min
- value / max
The formula used in the calculation of Min-Max scaling is (value - min) / (max - min). This transformation scales and translates the feature to be within the range of 0 and 1.