You are analyzing the number of calls received by a call center per hour. Which distribution would be most suitable for modeling this data and why?

Binomial Distribution because it represents the number of successes in a given number of trials
Normal Distribution because it represents continuous data
Poisson Distribution because it models the number of events occurring in a fixed interval of time
Uniform Distribution because all outcomes are equally likely

The Poisson Distribution is most suitable for modeling the number of calls received by a call center per hour because it models the number of events (calls) occurring in a fixed interval of time (per hour).

Discuss it

How does standardization (z-score) affect the distribution of data?

It doesn't affect the shape of the distribution
It makes the distribution normal
It makes the distribution uniform
It skews the distribution

Standardization does not change the shape of the distribution of the feature; rather, it standardizes the scale. This means that it doesn't change the distribution's skewness or kurtosis but it does center the data around zero with a standard deviation of 1.

Discuss it

You are given a dataset for an upcoming data analysis project. What initial EDA steps would you take before moving to model building?

Explore the structure of the dataset, summarize the data, and create visualizations
Perform a detailed statistical analysis
Run a quick ML model to test the data
Start cleaning and wrangling the data

Before moving to model building, it's important to first understand the dataset you're working with. The initial EDA steps would typically include exploring the structure of the dataset, summarizing the data (such as calculating central tendency measures and dispersion), and creating visualizations to uncover patterns, trends, and relationships.

Discuss it

The missing data mechanism where missingness is related only to the observed data is referred to as _________.

All missing data
MAR
MCAR
NMAR

In MAR (Missing at Random), the missingness is related only to the observed data.

Discuss it

Which measure of dispersion is defined as the difference between the largest and smallest values in a data set?

Interquartile Range (IQR)
Range
Standard Deviation
Variance

The "Range" is the measure of dispersion that is defined as the difference between the largest and smallest values in a data set.

Discuss it

How does Robust scaling minimize the effect of outliers?

By ignoring them during the scaling process
By removing the outliers
By scaling based on the median and interquartile range instead of mean and variance
By transforming the outliers

Robust scaling minimizes the effects of outliers by using the median and the interquartile range for scaling, instead of the mean and variance used by standardization. The interquartile range is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile). As the median and interquartile range are not affected by outliers, this method is robust to them.

Discuss it

Which of the following is a type of data distribution?

Age Bracket Distribution
Binomial Distribution
Household Distribution
Sales Distribution

The Binomial Distribution is a type of probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials each with the same probability of success.

Discuss it

What is the effect of 'binning' on the overall variance of the dataset?

It can either increase or decrease the variance
It decreases the variance
It does not affect the variance
It increases the variance

Binning reduces the variance of a dataset by replacing outlier values with summary statistics like the bin mean or median, hence, reducing the spread of data.

Discuss it

How does the assumption of MAR differ from MCAR in terms of data missingness?

MAR assumes the missingness is only related to the observed data
MAR assumes the missingness is related to the unobserved data
MAR assumes the missingness is unrelated to any variable
There's no difference between MAR and MCAR

In MCAR, the missingness is completely random and doesn't depend on any variable. In MAR, the missingness is not random but is related only to the observed data, not the unobserved (missing) data.

Discuss it

__________ missing data occurs when the probability of an observation being missing depends on both observed and unobserved data.

All missing data
MAR
MCAR
NMAR

NMAR (Not Missing at Random) missing data occurs when the missingness depends on both observed and unobserved data.

Discuss it

Which measure of central tendency divides a data set into two equal halves?

Mean
Median
Mode
nan

The "Median" is the measure of central tendency that divides a data set into two equal halves. It is the middle score for a set of ordered data such that 50% of the scores are above it, and 50% are below it.

Discuss it

The _________ method in regression analysis can help reduce the impact of Multicollinearity.

Chi-Square
Least squares
Logistic Regression
Ridge Regression

Ridge Regression is a regularization technique that can help reduce the impact of multicollinearity. It adds a penalty equivalent to square of the magnitude of coefficients to the loss function, thereby shrinking the coefficients of correlated predictors and reducing their impact.

Discuss it