Principal Component Analysis (PCA) is a technique that reduces dimensionality by creating new uncorrelated variables called _______. These new variables retain most of the variability in the original dataset.

Eigenvalues
Eigenvectors
Factors
Principal components

Principal Component Analysis (PCA) is a technique that reduces dimensionality by creating new uncorrelated variables called principal components. These new variables retain most of the variability in the original dataset. PCA works by projecting the original data onto a new space, represented by the principal components, which are orthogonal to each other and thus uncorrelated.

Discuss it

Why is 'communication' crucial even after 'conclusion' in the EDA process?

Communication ensures the insights derived are effectively conveyed to the relevant stakeholders.
Communication helps in data cleaning after conclusion.
Communication is not crucial after conclusion.
Communication is only crucial for large datasets.

In the EDA process, communication is the final step and is crucial as it ensures the insights, findings, or conclusions derived from the analysis are effectively conveyed to the relevant stakeholders. This stage might involve the preparation of reports, presentation decks, or visual dashboards, and it helps in facilitating data-driven decision making.

Discuss it

You have a dataset in which the 'income' feature has some missing values. You decided to use mode imputation. Why could this lead to misleading results?

All of the above
Income is usually a continuous variable, and mode may not be an appropriate measure of central tendency
It could cause overfitting
It might introduce selection bias

If the 'income' feature, typically a continuous variable, has some missing values and mode imputation is used, it could lead to misleading results. The mode is a measure of central tendency more suitable for categorical variables, not for continuous ones like income, and hence might not accurately reflect the underlying data distribution.

Discuss it

What are the advantages and disadvantages of using a violin plot versus a box plot?

All of the above
Box plots are more visually complex than violin plots
Violin plots can be harder to read for a non-technical audience
Violin plots provide less information than box plots

Violin plots, while they provide more information (including the density of the distribution), can be harder to read for a non-technical audience. Box plots, while less information-rich, are generally easier to interpret.

Discuss it

You are working with a healthcare dataset and you observe that a patient's health status influences the probability of missingness in the data. What type of missing data is this?

MAR
MCAR
NMAR
Not missing data

This would be MAR (Missing at Random) because the missingness is related to an observed data (the patient's health status). The missing data is not random, but it doesn't depend on the unobserved data itself.

Discuss it

How does the regularization technique aid in addressing the Multicollinearity issue?

By constraining the coefficient estimates, potentially setting some to zero.
By increasing model complexity.
By increasing the variance of the model.
By reducing model bias.

Regularization techniques, such as Ridge and Lasso regression, can help address multicollinearity by adding a penalty term to the loss function that constrains the coefficients. In particular, Lasso regression can set some coefficients to zero, effectively performing feature selection.

Discuss it

The ________ is a measure of dispersion that considers the spread of the middle 50% of data.

Interquartile Range
Range
Standard Deviation
Variance

The "Interquartile Range" is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.

Discuss it

Which outlier handling technique would be suitable for a dataset with numerous extreme values distributed on both ends?

Binning
Removal
Transformation
nan

Transformation is a suitable technique for handling outliers when the dataset contains numerous extreme values distributed on both ends, as it can pull in these extreme values and make the data distribution more symmetrical.

Discuss it

Consider a scatter plot displaying a tight, downward sloping distribution of points. What can be inferred about the relationship between the two plotted variables?

There is a random relationship
There is a strong negative relationship
There is a strong positive relationship
There is no relationship

A tight, downward sloping distribution of points in a scatter plot implies a strong negative relationship between the two plotted variables. As one variable increases, the other variable decreases.

Discuss it

________ is one potential cause of outliers in a dataset.

Measurement error
Overfitting
Underfitting
nan

Measurement error is one potential cause of outliers in a dataset. This can occur due to inaccuracies in data collection, recording, or entry.

Discuss it

What role does EDA play in formulating hypothesis or model selection in data analysis?

All of the mentioned
It assists in defining the variables to be used in the model
It enables an understanding of the relationships among the variables
It helps in determining the type of model to apply

EDA plays a fundamental role in hypothesis formulation and model selection. It can guide the choice of the most suitable models based on the understanding of data structure and relationships between variables. It helps define the variables to use in the model, identify potential outliers, detect multicollinearity, and assess the need for variable transformation or creation. Therefore, EDA forms the foundation for further statistical or machine learning analysis.

Discuss it

The detection of outliers using histograms can be influenced by the choice of _________.

axis scale
bin size
color
orientation

The choice of bin size in a histogram can influence the detection of outliers. If the bins are too wide, outliers may not be visible, while if they're too narrow, normal variation in the data may appear as outliers.

Discuss it