The mode is the only measure of central tendency that can be used for _____ data.
- Categorical
- Interval
- Numerical
- Ordinal
The "Mode" is the only measure of central tendency that can be used for "Categorical" data. This is because it simply represents the most frequently occurring category or value.
You are developing a linear regression model and notice that despite a high R-squared value, none of your independent variables are statistically significant. What might be the potential issue here?
- Data leakage
- High variance
- Multicollinearity
- Underfitting
This could be due to multicollinearity. Multicollinearity inflates the variances of the parameter estimates, which might lead to none of them being statistically significant. Despite this, the overall model might still be significant, leading to a high R-squared value.
You have a dataset with an odd number of observations. If you were to calculate both the mean and median, how would adding a very large value to the dataset affect these measures of central tendency?
- Both would increase
- Both would remain unchanged
- Only the mean would increase
- Only the median would increase
Adding a very large value to the dataset would increase the "Mean" because it takes into account all values in the data set. However, the "Median" would not be affected unless the new value changes the middle value of the ordered data set.
In _____ deletion, all data from a participant is discarded if any single value is missing.
- Listwise
- Pairwise
- Random
- Systematic
In 'listwise' deletion, all data from a participant is discarded if any single value is missing. It is the simplest form of dealing with missing data but can lead to significant loss of information if missing data is not completely at random.
A correlation coefficient of +1 between two variables indicates what kind of relationship?
- No relationship
- Perfect negative linear relationship
- Perfect positive linear relationship
- Weak relationship
A correlation coefficient of +1 between two variables indicates a perfect positive linear relationship. This means that if one variable increases, the other variable also increases at a constant rate, and vice versa.
You've 'explored' the data and drawn some conclusions, but upon 'communicating' your findings, stakeholders have additional questions. What would be the next step in the EDA process?
- Direct the stakeholders to the raw data
- Ignore the questions and conclude the analysis
- Revisit the questioning phase with these new questions
- Wrap up the communication phase quickly
In this situation, the next step should be to revisit the 'questioning' phase with these new questions from stakeholders. Additional questions from stakeholders might reflect aspects of the data that have not been covered or require further investigation. Revisiting the questioning phase will allow these aspects to be incorporated into the analysis.
What kind of distribution is indicated by a skewness of zero?
- A bimodal distribution.
- A negatively skewed distribution.
- A normal distribution.
- A positively skewed distribution.
A skewness of zero is indicative of a "Normal Distribution". In a perfect normal distribution, both tails are equal, so they balance each other out, and hence, the skewness is zero.
If missing data is not properly addressed, the model's ________ can be significantly affected.
- F1 score
- accuracy
- precision
- recall
If missing data is not handled correctly, it can lead to biases in the data, which can adversely affect the model's accuracy.
In a clinical trial, the average recovery time for a new drug is drastically increased due to a patient who took an unusually long time to recover. How would this patient's data point be classified in this context?
- A normal data point
- An error
- An outlier
- nan
This patient's data point would be classified as an outlier as it deviates significantly from the other data points.
A researcher measures the heights of a large group of individuals and finds that the data is symmetrically distributed with most of the values clustered around the mean. Which distribution does the data most likely follow?
- Binomial Distribution
- Normal Distribution
- Poisson Distribution
- Uniform Distribution
Given the characteristics of the data - symmetric distribution and most values clustered around the mean, it is most likely that the data follows a Normal Distribution.