Which of the following could be a possible reason for the presence of outliers in a dataset?
- All of these
- Data entry errors
- Data processing errors
- Measurement errors
Outliers can be caused by various factors like data entry errors, measurement errors, and data processing errors. For example, entering an extra digit while recording data or measurement device calibration errors.
In which stage of the data analysis process is Confirmatory Data Analysis (CDA) typically used?
- After EDA
- After Predictive Modeling
- Before EDA
- Before data collection
CDA typically comes after the EDA stage in the data analysis process. EDA allows analysts to explore the data and generate hypotheses while CDA involves statistical tests to confirm or refute these hypotheses.
In a data set where values are uniformly distributed across the range, how would the mean, median and mode compare?
- Mean would be the highest
- Median would be the highest
- Mode would be the highest
- They would all be the same
In a uniform distribution, all values occur with the same frequency, so the "Mean", "Median", and "Mode" would all be the same, falling in the center of the distribution.
In your EDA process, you notice that one particular feature has negligible variance. How would you interpret this in the context of your analysis and the overall dataset?
- This feature is the least important one
- This feature is the most important one
- This feature should be converted into a binary feature
- This feature should be used to create new features
In the context of your analysis, a feature with negligible variance might have little influence on the outcome variable. This is because, with very little variance, the feature is nearly constant and hence, provides no new information for the model. Depending on the context and the objectives of your analysis, you might consider dropping this feature.
What type of graph is often used to understand the underlying frequency distribution of data?
- Bar chart
- Histogram
- Line graph
- Pie chart
A Histogram is often used to understand the underlying frequency distribution of data. It groups numbers into ranges (bins) and the height of each bar depicts the number of values that fall into each range.
Why might an outlier not be visible in a box plot?
- If the box plot is not correctly drawn
- If the data is normally distributed
- If the dataset is very large
- If the outlier is close to the whisker
An outlier might not be visible in a box plot if the outlier is close to the whisker because it might still fall within the range of the whisker and thus be considered part of the normal distribution.
What role does the audience play in choosing the right graph for data visualization?
- The audience should be consulted during the design process
- The audience's familiarity with different types of graphs is irrelevant
- The audience's preferences should always dictate the type of graph
- Understanding the audience's knowledge and familiarity with different types of graphs can help choose the most effective one
Understanding the audience's knowledge and familiarity with different types of graphs can significantly influence the choice of graph. For a general audience, simpler graphs like bar charts or line graphs may be more suitable, whereas a more technical audience might be comfortable with more complex visualizations like heatmaps or network diagrams.
What are some of the adverse impacts of Multicollinearity on the coefficients of a linear regression model?
- All of the above.
- It inflates the standard errors of the coefficients.
- It makes the model unstable.
- It weakens the statistical power of the model.
Multicollinearity affects the coefficients of a linear regression model by making them unstable (small changes in the data cause large swings in the coefficients), inflating the standard errors of the coefficients (making them less statistically significant), and weakening the statistical power of the model (decreasing the chances of finding valid effects).
_______ methods for feature selection assess the relevance of a subset of features by considering their ability to predict the outcome with a particular learning algorithm.
- Embedded
- Filter
- PCA
- Wrapper
Wrapper methods for feature selection assess the relevance of a subset of features by considering their ability to predict the outcome with a particular learning algorithm. These methods involve training a model multiple times and selecting the subset of features that maximizes the model's performance.
You're conducting a study and have encountered missing data. You opt for the model-based method for imputation. Under what circumstances might this approach introduce bias?
- If the chosen model fits poorly to the data
- If the chosen model is a complex one
- If the chosen model is a perfect fit for the data
- If the missing data is missing completely at random
Bias might be introduced in model-based imputation if the chosen model fits poorly to the data. If the model used does not reflect the true data generation process, the imputed values might be systematically biased, leading to incorrect conclusions.