How can the problem of heteroscedasticity be resolved in linear regression?

  • By adding more predictors
  • By changing the estimation method
  • By collecting more data
  • By transforming the dependent variable
Heteroscedasticity can be resolved by transforming the dependent variable, typically using a logarithmic transformation. This often stabilizes the variance of the residuals across different levels of the predictors.

When is a Poisson distribution used?

  • When each event is dependent on the previous event
  • When the events are independent and occur at a constant rate
  • When the events are normally distributed
  • When the events have only two possible outcomes
A Poisson distribution is used when we are counting the number of times an event happens over a fixed interval of time or space, and the events are independent and occur at a constant average rate. It's often used to model random events such as calls to a call center or arrivals at a website.

What is multicollinearity and how does it affect simple linear regression?

  • It is the correlation between dependent variables and it has no effect on regression
  • It is the correlation between errors and it makes the regression model more accurate
  • It is the correlation between independent variables and it can cause instability in the regression coefficients
  • It is the correlation between residuals and it causes bias in the regression coefficients
Multicollinearity refers to a high correlation among independent variables in a regression model. It does not reduce the predictive power or reliability of the model as a whole, but it can cause instability in the estimation of individual regression coefficients, making them difficult to interpret.

The distribution of all possible sample means is known as a __________.

  • Normal Distribution
  • Population Distribution
  • Sampling Distribution
  • Uniform Distribution
The sampling distribution in statistics is the probability distribution of a given statistic based on a random sample. For a statistic that is calculated from a sample, each different sample could (and likely will) provide a different value of that statistic. The sampling distribution shows us how those calculated statistics would be distributed.

How is 'K-means' clustering different from 'hierarchical' clustering?

  • Hierarchical clustering creates a hierarchy of clusters, while K-means does not
  • Hierarchical clustering uses centroids, while K-means does not
  • K-means requires the number of clusters to be defined beforehand, while hierarchical clustering does not
  • K-means uses a distance metric to group instances, while hierarchical clustering does not
K-means clustering requires the number of clusters to be defined beforehand, while hierarchical clustering does not. Hierarchical clustering forms a dendrogram from which the user can choose the number of clusters based on the problem requirements.

Under what conditions does a binomial distribution approximate a normal distribution?

  • When the events are not independent
  • When the number of trials is large and the probability of success is not too close to 0 or 1
  • When the number of trials is small
  • When the probability of success changes with each trial
The binomial distribution approaches the normal distribution as the number of trials gets large, provided that the probability of success is not too close to 0 or 1. This is known as the De Moivre–Laplace theorem.

How does stratified random sampling differ from simple random sampling?

  • Stratified random sampling always involves larger sample sizes than simple random sampling
  • Stratified random sampling involves dividing the population into subgroups and selecting individuals from each subgroup
  • Stratified random sampling is the same as simple random sampling
  • Stratified random sampling only selects individuals from a single subgroup
Stratified random sampling differs from simple random sampling in that it first divides the population into non-overlapping groups, or strata, based on specific characteristics, and then selects a simple random sample from each stratum. This can ensure that each subgroup is adequately represented in the sample, which can increase the precision of estimates.

Why are bar plots commonly used in data analysis?

  • To compare the frequency of categorical variables
  • To show the change of a variable over time
  • To show the distribution of a single variable
  • To show the relationship between two continuous variables
Bar plots are commonly used in data analysis to compare the frequency, count, or proportion of categorical variables. Each category is represented by a separate bar, and the length or height of the bar represents its corresponding value.

What does inference in multiple linear regression primarily involve?

  • Calculating the mean of the residuals
  • Creating the scatter plot
  • Drawing the best fit line
  • Interpreting the coefficients
Inference in multiple linear regression primarily involves interpreting the coefficients of the model, which represent the expected change in the response variable for each one-unit change in the respective explanatory variable, assuming all other variables are held constant.

What are the degrees of freedom in a Chi-square test for goodness of fit?

  • The number of categories minus 1
  • The number of categories plus 1
  • The number of observations minus 1
  • The number of observations plus 1
In a Chi-square test for goodness of fit, the degrees of freedom are calculated as the number of categories minus 1.