When is it more appropriate to use a correlation matrix instead of a pairplot?

When the dataset is very large
When the dataset is very small
When the variables are not numeric
When there are only two variables

When dealing with a large number of variables, a correlation matrix can be a more appropriate choice than a pairplot. This is because pairplots can become too complex and unreadable when the number of variables increases.

Discuss it

What is variance in the context of a data set?

The average deviation from the mean
The average squared deviation from the mean
The range of the data
The square root of the average deviation from the mean

"Variance" in the context of a data set is the "Average squared deviation from the mean". It gives a measure of how data points vary from the mean and is used to calculate the standard deviation.

Discuss it

What does Min-Max scaling do to the dataset?

It reduces the dimensionality of the dataset
It removes the mean and scales the data to unit variance
It scales the data based on median and interquartile range
It scales the dataset so that all feature values are in the range 0 to 1

Min-Max scaling, also known as normalization, transforms features by scaling each feature to a specific range, typically 0 to 1. This is done using the values of the minimum and maximum feature in the dataset.

Discuss it

How does the 'hue' parameter in Seaborn alter the visual presentation of data?

Changes the color of elements
Changes the shape of markers
Changes the size of markers
Rotates the plot

In Seaborn, the 'hue' parameter changes the color of elements. It is used to provide a color encoding for a third (typically categorical) variable in addition to two numeric variables.

Discuss it

What is an 'outlier' in the context of data analysis?

A data point that lies an abnormal distance from other values
A method to visualize data
A variable that is not significant
An error in data collection

In data analysis, an outlier is a data point that lies an abnormal distance from other values in a random sample from a population.

Discuss it

Imagine you're analyzing a dataset for a real estate company. You observe that a few houses have an extraordinarily high price compared to the rest. What would these represent in your analysis?

Anomalies
Data manipulation
Errors in data collection
Outliers

These could represent outliers. In the context of a dataset, outliers are individual data points that are distant from other observations.

Discuss it

How does 'questioning' in the EDA process differ from 'concluding'?

Questioning involves data cleaning while concluding involves data visualization.
Questioning involves defining variables, while concluding focuses on outlier detection.
Questioning is about data transformation, while concluding is about hypothesis testing.
Questioning sets the analysis goals, while concluding involves drawing insights from the explored data.

In the EDA process, questioning is the stage where the goals of the analysis are set. These are typically in the form of questions that the analysis aims to answer. On the other hand, concluding involves drawing meaningful insights from the data that have been analyzed in the explore phase. This could involve formal or informal hypothesis testing and aids in shaping subsequent data analysis steps, reporting, or decision-making.

Discuss it

How might the transformation method for handling outliers impact the overall shape of your data distribution?

It can introduce multimodality into the distribution
It can make the distribution more skewed
It can make the distribution more symmetrical
nan

The transformation method can make the distribution more symmetrical by pulling in extreme values.

Discuss it

Which method of analysis focuses on the exploration of patterns and relationships in the data?

CDA
Data Wrangling
EDA
Predictive Modeling

EDA (Exploratory Data Analysis) focuses on exploring patterns and relationships in the data. It is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

Discuss it

What is the formula used in the calculation of Min-Max scaling?

(value - mean) / standard deviation
(value - min) / (max - min)
value - min
value / max

The formula used in the calculation of Min-Max scaling is (value - min) / (max - min). This transformation scales and translates the feature to be within the range of 0 and 1.

Discuss it