When would a scatter plot be less effective in identifying outliers?
- When the data has no correlation
- When the data is normally distributed
- When the data points are closely grouped
- When there are many data points
A scatter plot may be less effective in identifying outliers when the data points are closely grouped because it would be hard to visually identify points that are far away from the others.
The _____ Distribution is used for modeling the number of times an event occurs in an interval of time or space.
- Binomial
- Normal
- Poisson
- Uniform
The Poisson Distribution is used for modeling the number of times an event occurs in an interval of time or space.
How does 'binning' help in dealing with outliers in a dataset?
- By dividing the data into intervals and replacing outlier values
- By eliminating irrelevant variables
- By identifying and removing outliers
- By normalizing the data
Binning helps in dealing with outliers by dividing the data into intervals or 'bins' and replacing outlier values with summary statistics like the bin mean or median.
Suppose you have a data set with many missing values and outliers. In which step of the EDA process would you primarily deal with these issues?
- In the communicating phase
- In the exploring phase
- In the questioning phase
- In the wrangling phase
During the 'wrangling' phase of the EDA process, data analysts deal with data cleaning tasks which includes handling missing values and dealing with outliers. Data wrangling involves transforming and cleaning data to enable further exploration and analysis.
How can one interpret the colors in a heatmap?
- Colors have no significance in a heatmap
- Colors represent different categories of data
- Colors represent the magnitude of the data
- Darker colors always mean higher values
In a heatmap, colors represent the magnitude of the data. Usually, a color scale is provided for reference, where darker colors often correspond to higher values and lighter colors to lower values. However, the color scheme can vary.
You are analyzing a data set that includes the number of visitors to a website per day. How would you categorize this data type?
- Continuous data
- Discrete data
- Nominal data
- Ordinal data
The number of visitors to a website per day would be discrete data as it is countable in a finite amount of time.
For data with outliers, the _____ is typically a better measure of central tendency as it is less sensitive to extreme values.
- Mean
- Median
- Mode
- Variance
The "Median" is less sensitive to extreme values, or outliers, in a dataset. Therefore, it's often a better measure of central tendency when outliers are present.
If you are working with a large data set and need to produce interactive visualizations for a web application, which Python library would be the most suitable?
- Bokeh
- Matplotlib
- Plotly
- Seaborn
Plotly is well-suited for creating interactive visualizations and can handle large data sets efficiently. It also supports rendering in web applications, making it ideal for this scenario.
What type of bias could be introduced by mean/median/mode imputation, particularly if the data is not missing at random?
- Confirmation bias
- Overfitting bias
- Selection bias
- Underfitting bias
Mean/Median/Mode Imputation, particularly when data is not missing at random, could introduce a type of bias known as 'Selection Bias'. This is because it might lead to incorrect estimation of variability and distorted representation of true relationships between variables, as the substituted values may not accurately reflect the reasons behind the missingness.
How can regularization techniques contribute to feature selection?
- By adding a penalty term to the loss function
- By avoiding overfitting
- By reducing model complexity
- By shrinking coefficients towards zero
Regularization techniques contribute to feature selection by shrinking the coefficients of less important features towards zero. This has the effect of effectively removing these features from the model, thus achieving feature selection.