_____ data provides numerical measurements and it can be broken down into two subcategories: continuous and discrete.

Nominal
Ordinal
Qualitative
Quantitative

Quantitative data provides numerical measurements and it can be divided into two types: continuous (data that can take any value within a range) and discrete (data that can only take certain values).

Discuss it

Given a machine learning algorithm that is highly sensitive to the range of input values, which scaling technique should you implement?

Min-Max scaling because it scales all values between 0 and 1
No scaling, as the original data values should be maintained
Robust scaling because it is not affected by outliers
Z-score standardization because it creates a normal distribution

Min-Max scaling is suitable when the algorithm is sensitive to the range of input values, as it scales all feature values into a specified range (usually 0-1). This ensures that all features have the same scale.

Discuss it

How is the whisker of a box plot usually calculated?

Mean ± Standard Deviation
Median ± Interquartile Range
Minimum and maximum values of the dataset
Q1 - 1.5 * IQR, Q3 + 1.5 * IQR

The whisker of a box plot is typically calculated using the formula: Q1 - 1.5 * IQR and Q3 + 1.5 * IQR.

Discuss it

What is the key characteristic of a Uniform Distribution?

All values are equally likely
Most values are around the mean
Values are skewed to the left
Values are skewed to the right

In a Uniform Distribution, all values have the same frequency/probability. That is, they are all equally likely.

Discuss it

Your EDA reveals a non-normal distribution of data in your dataset. How might this insight affect your choice of machine learning models or algorithms?

You should always normalize your data
You should use only non-parametric models
You should use only unsupervised learning models
Your choice of ML models might be influenced, as some models make certain assumptions about the data distribution

The distribution of data can influence the choice of machine learning models or algorithms. Some models, such as linear and logistic regression, make certain assumptions about the data distribution (i.e., they expect the input or output to be normally distributed). If these assumptions are violated, the model may perform poorly. Therefore, understanding the data distribution can guide you in choosing the most appropriate models or in deciding whether to transform your data.

Discuss it

How does multiple imputation handle missing data?

It deletes rows with missing data
It estimates multiple values for each missing value
It fills missing data with mode values
It replaces missing data with a single value

Multiple imputation estimates multiple values for each missing value, instead of filling in a single value for each missing point. It reflects the uncertainty around the true value and provides more realistic estimates.

Discuss it

Mishandling missing data can lead to a high level of ________, impacting model performance.

bias
precision
recall
variance

If missing data is handled improperly, it can lead to biased training data, which can cause the model to learn incorrect or irrelevant patterns and, as a result, adversely affect its performance.

Discuss it

Your data shows a notable difference between the mean and the median values. Which type of scaling would be least affected by this discrepancy?

All scaling methods are affected by this discrepancy
Min-Max scaling because it scales all values between 0 and 1
Robust scaling because it uses median and quartile ranges
Z-score standardization because it creates a normal distribution

Robust scaling uses the median and interquartile range to scale the data, so it is not affected by the mean and is thus least affected by a discrepancy between the mean and the median.

Discuss it

How does platykurtic kurtosis shape the data distribution?

It results in a distribution with heavier tails and a flatter peak.
It results in a distribution with lighter tails and a flatter peak.
It results in a distribution with lighter tails and a higher peak.
It results in a perfectly symmetrical distribution.

Platykurtic kurtosis results in a data distribution that has lighter tails and a flatter peak compared to a normal distribution. This indicates a lower frequency of extreme values or outliers.

Discuss it

How does the application of Predictive Modeling differ from EDA and CDA in data-driven decision making?

Predictive Modeling does not play a role in data-driven decision making.
Predictive Modeling is used after EDA and CDA to make future predictions based on the data.
Predictive Modeling is used before EDA and CDA to anticipate the outcomes.
Predictive Modeling, EDA, and CDA all serve the same purpose.

Predictive Modeling, which is often performed after EDA and CDA, is used to make future predictions based on the data. These predictions can inform decision-making processes, particularly in data-driven organizations.

Discuss it

Which type of correlation is based on ranks and perfect for ordinal data?

Kendall's Tau
Pearson's correlation
Point-Biserial Correlation
Spearman's correlation

Spearman's correlation, also known as Spearman's rank correlation, is based on ranks and is perfect for ordinal data. It assesses how well the relationship between two variables can be described using a monotonic function. It is less sensitive to outliers and non-linear relationships compared to Pearson's correlation.

Discuss it

In the context of a Binomial Distribution, a "success" is defined as _____.

a positive outcome
a random event
an outcome of interest
an outcome that occurs most frequently

In the context of a Binomial Distribution, a "success" is defined as an outcome of interest, which could be positive, negative, or neutral depending on the context.

Discuss it