A correlation coefficient of +1 between two variables indicates what kind of relationship?
- No relationship
- Perfect negative linear relationship
- Perfect positive linear relationship
- Weak relationship
A correlation coefficient of +1 between two variables indicates a perfect positive linear relationship. This means that if one variable increases, the other variable also increases at a constant rate, and vice versa.
You've 'explored' the data and drawn some conclusions, but upon 'communicating' your findings, stakeholders have additional questions. What would be the next step in the EDA process?
- Direct the stakeholders to the raw data
- Ignore the questions and conclude the analysis
- Revisit the questioning phase with these new questions
- Wrap up the communication phase quickly
In this situation, the next step should be to revisit the 'questioning' phase with these new questions from stakeholders. Additional questions from stakeholders might reflect aspects of the data that have not been covered or require further investigation. Revisiting the questioning phase will allow these aspects to be incorporated into the analysis.
What kind of distribution is indicated by a skewness of zero?
- A bimodal distribution.
- A negatively skewed distribution.
- A normal distribution.
- A positively skewed distribution.
A skewness of zero is indicative of a "Normal Distribution". In a perfect normal distribution, both tails are equal, so they balance each other out, and hence, the skewness is zero.
If missing data is not properly addressed, the model's ________ can be significantly affected.
- F1 score
- accuracy
- precision
- recall
If missing data is not handled correctly, it can lead to biases in the data, which can adversely affect the model's accuracy.
In a clinical trial, the average recovery time for a new drug is drastically increased due to a patient who took an unusually long time to recover. How would this patient's data point be classified in this context?
- A normal data point
- An error
- An outlier
- nan
This patient's data point would be classified as an outlier as it deviates significantly from the other data points.
What are the effects of outliers on the results of a hypothesis testing procedure?
- All of these
- Can affect the statistical significance
- Can lead to type I errors
- Can lead to type II errors
Outliers can affect the results of a hypothesis testing procedure in several ways. They can lead to Type I or Type II errors, and can also affect the statistical significance of the test, thereby potentially leading to incorrect conclusions.
What role does bin size play in outlier detection when using a histogram?
- Bin size can influence outlier detection
- Bin size does not influence outlier detection
- Larger bin size always increases outlier visibility
- Smaller bin size always increases outlier visibility
The bin size in a histogram can influence the visibility of outliers. Depending on how the data is binned, an outlier may or may not be clearly visible.
The mode is the only measure of central tendency that can be used for _____ data.
- Categorical
- Interval
- Numerical
- Ordinal
The "Mode" is the only measure of central tendency that can be used for "Categorical" data. This is because it simply represents the most frequently occurring category or value.
You are developing a linear regression model and notice that despite a high R-squared value, none of your independent variables are statistically significant. What might be the potential issue here?
- Data leakage
- High variance
- Multicollinearity
- Underfitting
This could be due to multicollinearity. Multicollinearity inflates the variances of the parameter estimates, which might lead to none of them being statistically significant. Despite this, the overall model might still be significant, leading to a high R-squared value.
You have a dataset with an odd number of observations. If you were to calculate both the mean and median, how would adding a very large value to the dataset affect these measures of central tendency?
- Both would increase
- Both would remain unchanged
- Only the mean would increase
- Only the median would increase
Adding a very large value to the dataset would increase the "Mean" because it takes into account all values in the data set. However, the "Median" would not be affected unless the new value changes the middle value of the ordered data set.