How does standard deviation differ in a sample versus a population?

The denominator in the calculation of the sample standard deviation is (n-1)
The standard deviation of a sample is always larger
The standard deviation of a sample is always smaller
They are calculated in the same way

The "Standard Deviation" in a sample differs from that in a population in the way it is calculated. For a sample, the denominator is (n-1) instead of n, which is Bessel's correction to account for sample bias.

Discuss it

You are visualizing a heatmap and notice a row with colors drastically different than the rest. What might this indicate about the corresponding variable?

The variable has a unique distribution
The variable has many missing values
The variable is an outlier
The variable is unrelated to the others

If a row in a heatmap has colors that are drastically different than the rest, it might indicate that the corresponding variable is unrelated or has very different relationships with the other variables in the dataset.

Discuss it

You're working with a data set that does not follow a normal distribution. Which method, Z-score or IQR, should be used for detecting outliers?

Both are suitable
IQR
Neither is suitable
Z-score

In this case, the IQR method is a better choice as it does not assume any specific data distribution unlike the Z-score method, which assumes data is normally distributed.

Discuss it

How does the IQR method categorize a data point as an outlier?

By comparing it to the mean
By comparing it to the median
By comparing it to the standard deviation
By seeing if it falls below Q1-1.5IQR or above Q3+1.5IQR

The IQR method categorizes a data point as an outlier by seeing if it falls below Q1-1.5IQR or above Q3+1.5IQR.

Discuss it

How can incorrect handling of missing data impact the bias-variance trade-off in a machine learning model?

Does not affect the bias-variance trade-off.
Increases bias and reduces variance.
Increases both bias and variance.
Increases variance and reduces bias.

Improper handling of missing data, such as by naive imputation methods, can lead to an increase in bias and a decrease in variance. This is because the imputed values could be biased, leading the model to learn incorrect patterns.

Discuss it

You're using a model that is sensitive to multicollinearity. How can feature selection help improve your model's performance?

By adding more features
By removing highly correlated features
By transforming the features
By using all features

If you're using a model that is sensitive to multicollinearity, feature selection can help improve the model's performance by removing highly correlated features. Multicollinearity can affect the stability and performance of some models, and removing features that are highly correlated with others can alleviate this problem.

Discuss it

How can mishandling missing data in a feature affect the feature's importance in a machine learning model?

Decreases the feature's importance.
Depends on the feature's initial importance.
Has no effect on the feature's importance.
Increases the feature's importance.

Mishandling missing data can distort the data distribution and skew the feature's statistical properties, which might lead to a decrease in its importance when the model is learning.

Discuss it

In a scenario where your dataset has a Gaussian distribution, which scaling method is typically recommended and why?

All scaling methods work equally well with Gaussian distributed data
Min-Max scaling because it scales all values between 0 and 1
Robust scaling because it is not affected by outliers
Z-score standardization because it creates a normal distribution

Z-score standardization is typically recommended for a dataset with a Gaussian distribution. Although it doesn't create a normal distribution, it scales the data such that it has a mean of 0 and a standard deviation of 1, which aligns with the properties of a standard normal distribution.

Discuss it

Which measure of central tendency can be used for both quantitative and qualitative data?

Mean
Median
Mode
nan

The "Mode" is the measure of central tendency that can be used for both quantitative and qualitative data. It is the value that appears most frequently in a data set, and it is the only measure of central tendency that can be used with nominal data.

Discuss it

What functionality does the Seaborn library add over Matplotlib?

3D plotting
Interactive plotting
Real-time plotting
Statistical plotting

While Matplotlib is a powerful library for creating a wide range of plots, Seaborn adds on to this by providing a number of high-level statistical plotting capabilities, allowing users to create more informative and attractive visualizations with fewer lines of code.

Discuss it

What is the process of removing an entire row when any single data point within it is missing called?

Listwise Deletion
Mean Imputation
Pairwise Deletion
Regression Imputation

The process of removing an entire row when any single data point within it is missing is called 'Listwise Deletion'. Also known as 'Complete Case Analysis', this technique is straightforward and fast, but it can potentially discard valuable data and introduce bias if the missingness is not completely at random.

Discuss it

What type of plot is often used for visualizing the relationship between two continuous variables?

Bar plot
Box plot
Histogram
Scatter plot

Scatter plots are ideal for visualizing the relationship between two continuous variables. Each point in the scatter plot corresponds to the values of two variables.

Discuss it