How is the shape of a Normal Distribution usually described?

  • Bell-shaped
  • Skewed to the left
  • Skewed to the right
  • Uniformly flat
A Normal Distribution is described as bell-shaped. It is symmetric around the mean, and most of the data falls close to the mean with fewer values further away.

Suppose your machine learning model shows a significant shift in performance when transitioning from the training set to the test set. How could mishandling missing data contribute to this issue?

  • It may have caused an imbalance in the data distribution between the sets.
  • It may have caused overfitting.
  • It may have led to the model learning irrelevant patterns.
  • It may have led to underfitting.
If the handling of missing data is not consistent between the training and test sets, it could lead to an imbalance in data distribution between the two sets, causing the model's performance to shift.

What kind of bias might be introduced into a model if missing data is not appropriately addressed?

  • All above.
  • Confirmation bias.
  • Observation bias.
  • Sampling bias.
Inappropriate handling of missing data can lead to sampling bias, where the model is trained on a non-representative subset of the data, hence the model's predictions could be biased.

What are the key components to focus on during the 'communicate' step in EDA?

  • Cleaning and transforming data
  • Ensuring the insights are effectively conveyed to relevant stakeholders
  • Only sharing the raw data
  • Reordering the EDA steps
During the communication phase of the EDA process, the key focus is to ensure that the insights, findings, or conclusions drawn from the analysis are effectively conveyed to the relevant stakeholders. This might involve presenting the insights in a simple and understandable manner, making use of visualizations, and tailoring the communication to the audience's needs and context.

If a row with at least one missing value is deleted, the process is known as _____.

  • Listwise Deletion
  • Mean Imputation
  • Mode Imputation
  • Pairwise Deletion
If a row with at least one missing value is deleted, the process is known as 'listwise deletion'. Although it is a simple method, it can result in loss of valuable information if the missing data is not completely random.

Given a set of data that follows a Binomial Distribution, how would you estimate the parameters of the distribution?

  • By applying the Central Limit Theorem
  • By computing the mean and standard deviation
  • By taking the square root of the data
  • By using a chi-squared test
The parameters of a Binomial Distribution can be estimated by computing the mean and standard deviation of the data.

The _____ of a histogram can significantly influence the representation of data.

  • Bin width
  • Color
  • Shape
  • Size
The bin width of a histogram is critical in data representation. If it's too large, it may smooth over the details of the distribution. If it's too small, the histogram may be too cluttered or noisy.

Outliers can potentially _______ the interpretation of the data.

  • Complicate
  • Improve
  • Simplify
  • Skew
Outliers can skew the interpretation of the data. They can affect the mean and standard deviation, thus distorting the overall understanding of the data.

Under what conditions might a model-based method be preferred over other imputation methods?

  • When a known and well-fitting model can be assumed for the data
  • When the amount of missing data is negligible
  • When the data is missing completely at random
  • When the data is not missing at random
A model-based method might be preferred over other imputation methods when a known and well-fitting model can be assumed for the data. The model-based method is a principled method of handling missing data under the assumption that the data follows a specific statistical model. It could be any model like linear regression, logistic regression, etc.

When the outlier is a result of a data entry error, the best approach would often be ________.

  • Binning
  • Imputation
  • Removal
  • Transformation
When outliers are due to data entry errors, they do not provide meaningful information, hence removing them would be the most appropriate method.