You are analyzing a data set that includes the number of visitors to a website per day. How would you categorize this data type?

Continuous data
Discrete data
Nominal data
Ordinal data

The number of visitors to a website per day would be discrete data as it is countable in a finite amount of time.

For data with outliers, the _____ is typically a better measure of central tendency as it is less sensitive to extreme values.

Mean
Median
Mode
Variance

The "Median" is less sensitive to extreme values, or outliers, in a dataset. Therefore, it's often a better measure of central tendency when outliers are present.

Discuss it

If you are working with a large data set and need to produce interactive visualizations for a web application, which Python library would be the most suitable?

Bokeh
Matplotlib
Plotly
Seaborn

Plotly is well-suited for creating interactive visualizations and can handle large data sets efficiently. It also supports rendering in web applications, making it ideal for this scenario.

Discuss it

What type of bias could be introduced by mean/median/mode imputation, particularly if the data is not missing at random?

Confirmation bias
Overfitting bias
Selection bias
Underfitting bias

Mean/Median/Mode Imputation, particularly when data is not missing at random, could introduce a type of bias known as 'Selection Bias'. This is because it might lead to incorrect estimation of variability and distorted representation of true relationships between variables, as the substituted values may not accurately reflect the reasons behind the missingness.

Discuss it

How does the number of imputations affect the accuracy of multiple imputation?

More imputations, less accuracy
More imputations, more accuracy
Number of imputations doesn't affect accuracy
Only one imputation is needed for full accuracy

The number of imputations directly affects the accuracy of multiple imputation. More imputations result in more accurate estimates, up to a point. Although the exact number depends on the proportion and nature of the missing data, often 20 to 100 imputations are recommended in the literature.

Discuss it

In data analysis, EDA stands for _______.

Empirical Data Assessment
Exploratory Data Analysis
Exponential Data Analysis
Expressive Data Assimilation

In data analysis, EDA stands for Exploratory Data Analysis. It is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

Discuss it

Can multiple imputation be applied when data are missing completely at random (MCAR)?

No
Only if data is numerical
Only in rare cases
Yes

Yes, multiple imputation can be applied when data are missing completely at random (MCAR). In fact, it is a flexible method that can be applied in various missing data situations including MCAR, MAR (missing at random), and even NMAR (not missing at random).

Discuss it

You're in the 'explore' phase of the EDA process and you notice a potential error back in the 'wrangle' phase. How should you proceed?

Conclude the analysis with the current data.
Go back to the wrangling phase to correct the error.
Ignore the error and continue with the exploration.
Inform the stakeholders about the error.

If you notice a potential error in the 'wrangle' phase while you are in the 'explore' phase, you should go back to the 'wrangle' phase to correct the error. Ensuring the accuracy and quality of the data during the 'wrangle' phase is crucial for the validity of the insights drawn in subsequent phases.

Discuss it

What is the impact on training time if missing data is incorrectly handled in a large dataset?

Decreases dramatically.
Depends on the specific dataset.
Increases dramatically.
Remains largely the same.

If missing data is not handled correctly, particularly in a large dataset, the training time can increase significantly. This is because the model might struggle to learn from the distorted data, requiring more time to try to fit the data.

Discuss it

The _______ method of feature selection involves removing features one by one until the removal of further features decreases model accuracy.

Backward elimination
Forward selection
Recursive feature elimination
Stepwise selection

The backward elimination method of feature selection involves removing features one by one until the removal of further features decreases model accuracy. This process starts with a model trained on all features and iteratively removes the least important feature until the overall model performance declines.

Discuss it