What is the effect of 'binning' on the overall variance of the dataset?

  • It can either increase or decrease the variance
  • It decreases the variance
  • It does not affect the variance
  • It increases the variance
Binning reduces the variance of a dataset by replacing outlier values with summary statistics like the bin mean or median, hence, reducing the spread of data.

Describe the impact of skewness and kurtosis on parametric testing.

  • They can improve the accuracy of parametric testing.
  • They can invalidate the results of parametric testing.
  • They can reduce the variance in parametric testing.
  • They do not impact parametric testing.
Skewness and kurtosis can invalidate the results of parametric testing. Many parametric tests assume that the data follows a normal distribution. If the data is highly skewed or has high kurtosis, these assumptions are violated, and the test results may not be valid.

If a distribution is leptokurtic, what does it signify about the data?

  • The data has a high variance.
  • The data is heavily tailed with potential outliers.
  • The data is less outlier-prone.
  • The data is normally distributed.
Leptokurtic distribution signifies that the data has heavy tails and a sharp peak, meaning there are substantial outliers (or extreme values). This kind of distribution often indicates that the data may have more frequent large jumps away from the mean.

A potential drawback of the Z-score method for outlier detection is that it assumes the data is _______ distributed.

  • exponentially
  • logistically
  • normally
  • uniformly
The Z-score method assumes that the data is normally distributed, which may not be the case with all datasets, and is a drawback.

Can the IQR method be applied to multimodal data sets for outlier detection? Explain.

  • No, it can only be applied to normally distributed data
  • No, it only works with unimodal distributions
  • Yes, but it may not be effective
  • Yes, it works well with any distribution
The IQR method can be applied to multimodal datasets for outlier detection, but it may not be effective as it's based on percentiles which can be influenced by multiple modes.

In _____ scaling, we scale the data between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

  • Min-Max
  • Robust
  • Standard
  • Z-score
In Robust scaling, we scale the data between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). This approach minimizes the impact of outliers.

What is the primary goal of Exploratory Data Analysis (EDA)?

  • To confirm a pre-existing hypothesis
  • To create an aesthetic representation of the data
  • To make precise predictions about future events
  • To understand the underlying structure of the data
The primary goal of EDA is to understand the underlying structure of the data, including distribution, variability, and relationships among variables. EDA allows analysts to make informed decisions about further data processing steps and analysis.

Data that follows a _____ Distribution has its values spread evenly across the range of possible outcomes.

  • Binomial
  • Normal
  • Poisson
  • Uniform
Data that follows a Uniform Distribution has its values spread evenly across the range of possible outcomes.

What is a key difference between qualitative data and quantitative data when it comes to analysis methods?

  • All types of data are analyzed in the same way
  • Qualitative data is always easier to analyze
  • Qualitative data typically requires textual analysis, while quantitative data can be analyzed mathematically
  • Quantitative data can't be used for statistical analysis
Qualitative data often requires textual or thematic analysis, categorizing the data based on traits or characteristics. Quantitative data, being numerical, can be analyzed using mathematical or statistical methods.

The _________ method in regression analysis can help reduce the impact of Multicollinearity.

  • Chi-Square
  • Least squares
  • Logistic Regression
  • Ridge Regression
Ridge Regression is a regularization technique that can help reduce the impact of multicollinearity. It adds a penalty equivalent to square of the magnitude of coefficients to the loss function, thereby shrinking the coefficients of correlated predictors and reducing their impact.