A ________ is a smaller group selected from the population of interest.

  • distribution
  • parameter
  • population
  • sample
In statistics, a sample is a smaller group or subset that is selected from the population of interest. It's a subset of the population that is used to represent the entire group as a whole. For example, if the population is all people living in a city, a sample might be 1,000 individuals selected randomly from that city.

In cluster analysis, a ________ is a group of similar data points.

  • cluster
  • factor
  • matrix
  • model
In cluster analysis, a cluster is a group of similar data points. The goal of cluster analysis is to group, or cluster, observations that are similar to each other.

What happens to the width of the confidence interval when the sample variability increases?

  • The interval becomes narrower
  • The interval becomes skewed
  • The interval becomes wider
  • The interval does not change
The width of the confidence interval increases as the variability in the sample increases. Greater variability leads to a larger standard error, which in turn leads to wider confidence intervals.

What can be the effect of overfitting in polynomial regression?

  • The model will be easier to interpret
  • The model will have high bias
  • The model will perform poorly on new data
  • The model will perform well on new data
Overfitting in polynomial regression means that the model fits the training data too closely, capturing not only the underlying pattern but also the noise. As a result, the model will perform well on the training data but poorly on new, unseen data. This is because the model has essentially 'memorized' the training data and fails to generalize well to new situations.

What are the consequences of violating the homoscedasticity assumption in multiple linear regression?

  • The R-squared value becomes negative
  • The estimated regression coefficients are biased
  • The regression line is not straight
  • The standard errors are no longer valid
Violating the assumption of homoscedasticity (constant variance of the errors) can lead to inefficient and invalid standard errors, which can result in incorrect inferences about the regression coefficients. The regression coefficients themselves remain unbiased.

The null hypothesis, represented as H0, is a statement about the population that either is believed to be _______ or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.

  • FALSE
  • Irrelevant
  • Neutral
  • TRUE
The null hypothesis is the status quo or the statement of no effect or no difference, which is assumed to be true until evidence suggests otherwise.

What are the assumptions made when using the VIF (Variance Inflation Factor) to detect multicollinearity?

  • The data should follow a normal distribution.
  • The relationship between variables should be linear.
  • The response variable should be binary.
  • There should be no outliers in the data.
The Variance Inflation Factor (VIF) assumes a linear relationship between the predictor variables. This is because VIF is derived from the R-squared value of the regression of one predictor on all the others.

How is the F-statistic used in the context of a multiple linear regression model?

  • It measures the correlation between the dependent and independent variables
  • It measures the degree of multicollinearity
  • It tests the overall significance of the model
  • It tests the significance of individual coefficients
The F-statistic in the context of a multiple linear regression model is used to test the overall significance of the model. The null hypothesis is that all of the regression coefficients are equal to zero, against the alternative that at least one does not.

What are the strategies to address the issue of overfitting in polynomial regression?

  • Add more independent variables
  • Increase the degree of the polynomial
  • Increase the number of observations
  • Use regularization techniques
Overfitting in polynomial regression can be addressed by using regularization techniques, such as Ridge or Lasso, which add a penalty term to the loss function to constrain the magnitude of the coefficients, resulting in a simpler model. Other strategies can include reducing the degree of the polynomial or using cross-validation to tune the complexity of the model.

When are non-parametric statistical methods most useful?

  • When the data does not meet the assumptions for parametric methods
  • When the data follows a normal distribution
  • When the data is free from outliers
  • When there is a large amount of data
Non-parametric statistical methods are most useful when the data does not meet the assumptions for parametric methods. For example, if the data does not follow a normal distribution, or if there are concerns about outliers or skewness, non-parametric methods may be appropriate.