If you are working with a large data set and need to produce interactive visualizations for a web application, which Python library would be the most suitable?

Bokeh
Matplotlib
Plotly
Seaborn

Plotly is well-suited for creating interactive visualizations and can handle large data sets efficiently. It also supports rendering in web applications, making it ideal for this scenario.

Discuss it

For data with outliers, the _____ is typically a better measure of central tendency as it is less sensitive to extreme values.

Mean
Median
Mode
Variance

The "Median" is less sensitive to extreme values, or outliers, in a dataset. Therefore, it's often a better measure of central tendency when outliers are present.

Discuss it

You are analyzing a data set that includes the number of visitors to a website per day. How would you categorize this data type?

Continuous data
Discrete data
Nominal data
Ordinal data

The number of visitors to a website per day would be discrete data as it is countable in a finite amount of time.

Discuss it

Which type of graph is frequently used to represent an estimate of a variable's probability density function?

Bar chart
Kernel Density plot
Pie chart
Scatter plot

A Kernel Density Plot is frequently used to represent an estimate of a variable's probability density function. This type of plot uses a smoothing kernel to create a curve and the area under the curve is equal to 1.

Discuss it

Why is it important to check the normality of residuals in regression analysis?

To ensure the accuracy of the model's predictive ability
To ensure the model is not overfitting
To make sure the regression line is the best fit
To satisfy one of the key assumptions of linear regression

It is important to check the normality of residuals in regression analysis because it is one of the key assumptions of linear regression. If the residuals are normally distributed, it validates the model's assumptions and ensures the accuracy of the hypothesis tests and confidence intervals.

Discuss it

How does incorrect imputation of missing data influence the accuracy of a predictive model?

Decreases accuracy.
Depends on the specific model.
Increases accuracy.
No effect on accuracy.

Incorrect imputation of missing data can lead to the model learning incorrect patterns, which in turn can significantly decrease the accuracy of predictions.

Discuss it

If a data point's Z-score is 0, it indicates that the data point is _______.

above the mean
an outlier
below the mean
on the mean

A Z-score of 0 indicates that the data point is on the mean.

Discuss it

What is the potential disadvantage of using listwise deletion for handling missing data?

It causes overfitting
It discards valuable data
It introduces random noise
It leads to multicollinearity

The potential disadvantage of using listwise deletion for handling missing data is that it can discard valuable data. If the missing values are not completely random, discarding the entire observation might lead to biased or incorrect results because it might exclude certain types of observations.

Discuss it

What type of data visualization method is typically color-coded to represent different values?

Heatmap
Histogram
Line plot
Scatter plot

Heatmaps are typically color-coded to represent different values. In a heatmap, data values are represented as colors, making it an excellent tool for visualizing large amounts of data and the correlation between different variables.

Discuss it

How can regularization techniques contribute to feature selection?

By adding a penalty term to the loss function
By avoiding overfitting
By reducing model complexity
By shrinking coefficients towards zero

Regularization techniques contribute to feature selection by shrinking the coefficients of less important features towards zero. This has the effect of effectively removing these features from the model, thus achieving feature selection.

Discuss it

You're in the 'explore' phase of the EDA process and you notice a potential error back in the 'wrangle' phase. How should you proceed?

Conclude the analysis with the current data.
Go back to the wrangling phase to correct the error.
Ignore the error and continue with the exploration.
Inform the stakeholders about the error.

If you notice a potential error in the 'wrangle' phase while you are in the 'explore' phase, you should go back to the 'wrangle' phase to correct the error. Ensuring the accuracy and quality of the data during the 'wrangle' phase is crucial for the validity of the insights drawn in subsequent phases.

Discuss it

Can multiple imputation be applied when data are missing completely at random (MCAR)?

No
Only if data is numerical
Only in rare cases
Yes

Yes, multiple imputation can be applied when data are missing completely at random (MCAR). In fact, it is a flexible method that can be applied in various missing data situations including MCAR, MAR (missing at random), and even NMAR (not missing at random).

Discuss it