The graphical representation of residuals versus predicted values is known as a ________ plot.
- Box
- Histogram
- Residual
- Scatter
A Residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
What can the Mann-Whitney U test tell you about the shape of your distributions?
- It can confirm if your distributions are normal
- It can confirm if your distributions are skewed
- It can confirm if your distributions have equal variances
- It cannot tell you anything about the shape of your distributions
The Mann-Whitney U test does not provide information about the shape of the distributions. It is a non-parametric test that does not make any assumptions about the distribution of the data.
What is the purpose of multiple linear regression analysis?
- To classify data into different categories
- To cluster data into different groups
- To examine the relationship between several independent variables and a dependent variable
- To predict the outcome of a binary dependent variable
Multiple linear regression analysis is used to understand the relationship between several independent (explanatory) variables and a dependent (response) variable. It can also be used for predicting the mean value of the dependent variable given the values of the independent variables.
What is the relationship between the eigenvalue of a component and the variance of that component in PCA?
- It depends on the dataset
- There is no relationship
- They are directly proportional
- They are inversely proportional
The eigenvalue of a component in PCA is directly proportional to the variance of that component. In other words, a larger eigenvalue corresponds to a larger amount of variance explained by that principal component.
_________ sampling is a method where every individual in the population has an equal chance of being selected.
- Cluster
- Simple Random
- Stratified
- Systematic
Simple random sampling is a basic type of sampling method where each individual in the population has an equal chance of being selected. This ensures that the sample will be representative of the population, making it easier to make accurate inferences about the whole population.
In a 95% confidence interval, if the true population parameter lies outside of the interval, it is considered a _______ error.
- Alpha
- Standard
- Type I
- Type II
In a 95% confidence interval, if the true population parameter lies outside of the interval, it is considered a Type I error. This is when the null hypothesis is true, but is incorrectly rejected.
How does PCA help in reducing the dimensionality of the dataset?
- By creating new uncorrelated variables
- By grouping similar data together
- By removing unnecessary data
- By rotating the data to align with axes
PCA reduces the dimensionality of a dataset by creating new uncorrelated variables that successively maximize variance. These new variables or "principal components" can replace the original variables, thus reducing the data's dimensionality.
What are the implications of the Central Limit Theorem on statistical testing?
- It asserts that all statistical tests must involve the normal distribution.
- It eliminates the need for statistical testing.
- It guarantees that all results of statistical tests will be accurate.
- It states that sample means will be normally distributed regardless of the shape of the population distribution.
The Central Limit Theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables will be approximately normally distributed, regardless of the shape of the original distribution. This underpins many statistical methods, including hypothesis tests and confidence intervals, which may assume normality of the sampling distribution.
Which type of plot is particularly useful for identifying outliers in a dataset?
- Bar plot
- Box plot
- Histogram
- Scatter plot
Box plots are particularly useful for identifying outliers in a dataset. The box plot displays a summary of the data distribution including minimum, first quartile, median, third quartile, and maximum. Outliers are typically represented as individual points that are far from the 'box' and 'whiskers'.
A __________ is a subset of a population that is used to represent the entire group as a whole.
- Dataset
- Parameter
- Sample
- Statistic
A sample in statistics is a subset of individuals or observations from a larger population. Sampling is a key concept in statistics and data science because it allows us to collect and analyze a manageable amount of data that represents a larger group.