What is a primary assumption when using regression imputation?

All data is normally distributed
Missing data is missing completely at random (MCAR)
Missing values are negligible
The relationship between variables is linear

A primary assumption when using regression imputation is that the relationship between variables is linear. This is because regression imputation uses a regression model to predict missing values, and the basic form of regression models assumes a linear relationship between predictor and response variables.

Discuss it

You are working on a dataset and found that the model performance is poor. On further inspection, you found some data points that are far from the rest. What could be a possible reason for the poor performance of your model?

Outliers
Overfitting
Underfitting
nan

The poor performance of the model might be due to outliers in the dataset. Outliers can have a significant impact on the performance of machine learning models.

Discuss it

As a data scientist, you've realized that your dataset contains missing values. How would you handle this situation as part of your EDA process?

Always replace missing values with the mean or median
Choose an appropriate imputation method depending on the nature of the data and the type of missingness
Ignore the missing values and proceed with analysis
Remove all instances with missing values

Handling missing values is an important part of the EDA process. The method used to handle them depends on the nature of the data and the type of missingness (MCAR, MAR, or NMAR). Various imputation methods can be used, such as mean/median/mode imputation for MCAR or MAR data, and advanced imputation methods like regression imputation, multiple imputation, or model-based methods for NMAR data.

Discuss it

If the variance of a data set is zero, then all data points are ________.

Equal
Infinite
Negative
Positive

If the "Variance" of a data set is zero, then all data points are "Equal". Variance is a measure of how far a set of numbers is spread out from their average value. A variance of zero indicates that all the values within a set of data are identical.

Discuss it

A market research survey collects data on customer age, gender, and preference for a product (Yes/No). Identify the types of data present in this survey.

Age: continuous, Gender: nominal, Preference: ordinal
Age: nominal, Gender: ordinal, Preference: interval
Age: ordinal, Gender: interval, Preference: ratio
Age: ratio, Gender: ordinal, Preference: nominal

Age is a continuous data type because it can take on any value within a range. Gender is nominal as it's categorical with no order or priority. Preference is ordinal as it's categorical with a clear order (Yes is preferred to No).

Discuss it

Which measure of dispersion is defined as the difference between the largest and smallest values in a data set?

Interquartile Range (IQR)
Range
Standard Deviation
Variance

The "Range" is the measure of dispersion that is defined as the difference between the largest and smallest values in a data set.

Discuss it

The missing data mechanism where missingness is related only to the observed data is referred to as _________.

All missing data
MAR
MCAR
NMAR

In MAR (Missing at Random), the missingness is related only to the observed data.

Discuss it

You are given a dataset for an upcoming data analysis project. What initial EDA steps would you take before moving to model building?

Explore the structure of the dataset, summarize the data, and create visualizations
Perform a detailed statistical analysis
Run a quick ML model to test the data
Start cleaning and wrangling the data

Before moving to model building, it's important to first understand the dataset you're working with. The initial EDA steps would typically include exploring the structure of the dataset, summarizing the data (such as calculating central tendency measures and dispersion), and creating visualizations to uncover patterns, trends, and relationships.

Discuss it

How does standardization (z-score) affect the distribution of data?

It doesn't affect the shape of the distribution
It makes the distribution normal
It makes the distribution uniform
It skews the distribution

Standardization does not change the shape of the distribution of the feature; rather, it standardizes the scale. This means that it doesn't change the distribution's skewness or kurtosis but it does center the data around zero with a standard deviation of 1.

Discuss it

You are analyzing the number of calls received by a call center per hour. Which distribution would be most suitable for modeling this data and why?

Binomial Distribution because it represents the number of successes in a given number of trials
Normal Distribution because it represents continuous data
Poisson Distribution because it models the number of events occurring in a fixed interval of time
Uniform Distribution because all outcomes are equally likely

The Poisson Distribution is most suitable for modeling the number of calls received by a call center per hour because it models the number of events (calls) occurring in a fixed interval of time (per hour).

Discuss it