What assumption about the residuals of a linear regression model does homoscedasticity refer to?

  • The residuals are independent
  • The residuals are normally distributed
  • The residuals have a linear relationship with the dependent variable
  • The residuals have constant variance
Homoscedasticity refers to the assumption that the residuals (errors) have constant variance at each level of the independent variable(s). This is important for the reliability of the regression model.

How does stratified random sampling differ from simple random sampling?

  • Stratified random sampling always involves larger sample sizes than simple random sampling
  • Stratified random sampling involves dividing the population into subgroups and selecting individuals from each subgroup
  • Stratified random sampling is the same as simple random sampling
  • Stratified random sampling only selects individuals from a single subgroup
Stratified random sampling differs from simple random sampling in that it first divides the population into non-overlapping groups, or strata, based on specific characteristics, and then selects a simple random sample from each stratum. This can ensure that each subgroup is adequately represented in the sample, which can increase the precision of estimates.

Why are bar plots commonly used in data analysis?

  • To compare the frequency of categorical variables
  • To show the change of a variable over time
  • To show the distribution of a single variable
  • To show the relationship between two continuous variables
Bar plots are commonly used in data analysis to compare the frequency, count, or proportion of categorical variables. Each category is represented by a separate bar, and the length or height of the bar represents its corresponding value.

Conditional independence of A and B given C means that knowing that C has occurred does not change the ________ between A and B.

  • Difference
  • Intersection
  • Ratio
  • Relationship
Conditional independence of A and B given C means that knowing that C has occurred does not change the relationship between A and B. In other words, the occurrence of event C does not affect the independence of events A and B.

What is the assumption made when computing the Pearson correlation coefficient?

  • The correlation is zero
  • The variables are independent
  • The variables are normally distributed
  • There is a linear relationship between variables
When computing the Pearson correlation coefficient, it is assumed that there is a linear relationship between the variables. Furthermore, it's also assumed that the variables are continuous and that the data is homoscedastic (i.e., the variance of the errors is the same across all levels of the variables).

How is the variance related to the standard deviation in a data set?

  • The variance is the average of the standard deviation
  • The variance is the square of the standard deviation
  • The variance is the square root of the standard deviation
  • The variance is twice the standard deviation
The variance is the square of the standard deviation. Standard deviation is a measure of dispersion in a dataset and variance is a square of it, meaning that they both represent the same concept of dispersion, but in different units.

What does kurtosis measure in a dataset?

  • Central tendency
  • Dispersion
  • Skewness
  • The "tailedness" of the distribution
Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

A statistical test has more power to detect an effect if the effect size is ______.

  • Equal to the sample size
  • Large
  • Small
  • Unchanged
The power of a test is influenced by the effect size - the magnitude of the difference or relationship you're testing for. Larger effect sizes increase the power of a test because they create a larger signal relative to the noise, making it easier to detect an effect if one exists.

How does the height of a bar in a histogram relate to the frequency of the data?

  • It has no relation with the frequency
  • It represents the cumulative frequency
  • It represents the mean frequency
  • It represents the relative frequency
The height of a bar in a histogram represents the frequency (or relative frequency) of data for that particular bin. This means the taller the bar, the more data falls into that specific interval.

What is the purpose of 'normalization' or 'standardization' in the pre-processing step of cluster analysis?

  • To decrease the number of clusters
  • To ensure that all features contribute equally to the distance calculation
  • To handle missing values
  • To increase the computational complexity
Normalization or standardization ensures that all features contribute equally to the final distance calculation, regardless of their original scale. Without this step, features with larger scales would dominate the distance calculation, potentially leading to misleading clusters.