What is the purpose of hypothesis testing in statistics?

  • To compare the sample mean to the population mean
  • To make inferences about a population based on sample data
  • To understand the distribution of the data
  • To visualize the data
Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. It's an inferential statistic that allows us to infer if our observed results deviate from null hypothesis by chance or by a true statistical difference.

The p-value in a hypothesis test is the probability of getting a sample statistic as extreme as the test statistic, given that the _______ hypothesis is true.

  • Alternative
  • Null
  • Original
  • Random
In the context of hypothesis testing, the p-value is the probability of observing a test statistic as extreme as the one calculated, assuming that the null hypothesis is true.

What are the assumptions required for a distribution to be considered a Poisson distribution?

  • The events are dependent on each other
  • The events are occurring at a constant mean rate and independently of the time since the last event
  • The events have more than two possible outcomes
  • The number of trials is fixed
The key assumptions for a Poisson distribution are that the events are happening at a constant mean rate and independently of the time since the last event. This is often used for modeling the number of times an event occurs in a given interval of time or space.

What is the relationship between the mean and the standard deviation in a normal distribution?

  • The mean is always larger than the standard deviation
  • The mean is the midpoint of the distribution, and the standard deviation measures the spread
  • The standard deviation is always larger than the mean
  • There is no relationship between the mean and the standard deviation
In a normal distribution, the mean is the center of the distribution and represents the "average" value. The standard deviation measures the dispersion around the mean. Roughly 68% of the data falls within one standard deviation of the mean in a normal distribution.

_______ is a measure of how spread out the numbers in a dataset are around the mean.

  • Median
  • Range
  • Standard Deviation
  • Variance
Standard deviation is a measure of how spread out the numbers in a dataset are around the mean. It measures the average distance between each data point and the mean. The higher the standard deviation, the more spread out the data is.

In the context of cluster analysis, what is the 'centroid'?

  • The average distance between clusters
  • The geometric center of a cluster
  • The largest point in a cluster
  • The smallest point in a cluster
The centroid is the geometric center of a cluster. In other words, it's the mean value of all the points in a specific cluster.

What is the effect of monotonic transformations on Spearman’s rank correlation coefficient?

  • They decrease the coefficient
  • They don't affect the coefficient
  • They increase the coefficient
  • They make the coefficient negative
Monotonic transformations do not affect the Spearman’s rank correlation coefficient. This is because Spearman's correlation is based on the rank order of data, and monotonic transformations preserve this order.

What's the difference between a histogram and a bar plot?

  • Bar plots are for continuous data, histograms for categorical data
  • Both are for continuous data only
  • Histograms are for continuous data, bar plots for categorical data
  • There is no difference
The main difference between a histogram and a bar plot is the type of data they represent. A histogram is used for continuous data, where the bins represent ranges of data, while a bar plot is used for categorical data to compare the frequency or count of different categories.

What is the error term in a simple linear regression model?

  • It is the dependent variable
  • It is the difference between the observed and predicted values
  • It is the independent variable
  • It is the slope of the regression line
The error term in a simple linear regression model is the difference between the observed and predicted values. It captures the variability in the dependent variable that is not explained by the independent variable in the model.

What can be inferred if the residuals are not randomly distributed in the residual plot?

  • The data has no outliers
  • The data is perfectly linear
  • The linear regression model is a perfect fit for the data
  • The linear regression model is not a good fit for the data
If the residuals are not randomly distributed (e.g., if they form a pattern), it suggests that the linear regression model is not a good fit for the data. This could be because the relationship between the variables is not linear, or because the data exhibits heteroscedasticity (unequal variances of errors), among other reasons.