Can the Binomial Distribution be used to model the number of successes in a fixed number of Bernoulli trials?

  • No
  • Only for large sample sizes
  • Only for small sample sizes
  • Yes
Yes, the Binomial Distribution is used exactly for this purpose. It models the number of successes in a fixed number of independent Bernoulli trials each with the same probability of success.

How does the role of data visualization differ in EDA, CDA, and Predictive Modeling?

  • Data visualization is not essential in any of these processes.
  • Data visualization is only used in EDA.
  • Data visualization is used in EDA to explore, in CDA to confirm, and in Predictive Modeling to represent the final model.
  • Data visualization plays the same role in EDA, CDA, and Predictive Modeling.
Data visualization plays different roles in each of these processes. In EDA, it is used to explore data and identify initial patterns or anomalies. In CDA, it can be used to represent statistical tests and confirm hypotheses. In Predictive Modeling, it is often used to represent the final model or visualize prediction results.

You've received feedback that your box plots are not providing a clear visual of the distribution of your dataset. What alternative plot could you use and why?

  • Bar graph
  • Line graph
  • Scatter plot
  • Violin plot
If box plots are not providing a clear visualization, an alternative could be Violin plots. Violin plots are similar to box plots, but also show the probability density of the data at different values. This can provide a more detailed depiction of the distribution of the dataset.

How would the mean change if an additional number far away from the current mean were added to the dataset?

  • It would always decrease
  • It would always increase
  • It would increase or decrease depending on the value
  • It would not change
The addition of an additional number far away from the current mean would either increase or decrease the mean, depending on the value. If the added number is greater than the current mean, the mean will increase; if less, the mean will decrease. This illustrates how sensitive the mean is to outliers or extreme values.

Consider you have a regression model that is underfitting. On investigation, you discover missing data was dropped instead of imputed. What might be the reason for underfitting in this context?

  • The model didn't have enough data to learn from.
  • The model was over-regularized.
  • The model's complexity was too low.
  • The model's hyperparameters were not optimized.
Dropping missing data can significantly reduce the size of the training set. If much of the data is discarded, the model may not have enough data to learn the underlying patterns, leading to underfitting.

Why is listwise deletion not recommended when the data missingness is systematic or 'not at random'?

  • It can cause overfitting
  • It can introduce bias
  • It can introduce random noise
  • It can lead to underfitting
Listwise deletion is not recommended when the data missingness is systematic or 'not at random' because it can introduce bias. If missing values are related to any underlying unobservable phenomena, listwise deletion might result in biased or misleading results by excluding certain types of observations.

In a machine learning project, your data is not normally distributed, which is causing problems in your model. What are some strategies you could use to address this issue?

  • All of the above
  • Change the type of machine learning model to one that does not assume a normal distribution
  • Use data transformation techniques like logarithmic or square root transformations
  • Use non-parametric statistical methods
Several strategies can be used to address non-normal data in a machine learning project: data can be transformed using methods like logarithmic or square root transformations; non-parametric statistical methods that do not assume a normal distribution can be used; or a different type of machine learning model that does not assume a normal distribution can be chosen.

You're examining a dataset on company revenues and discover a significant jump in revenue for one quarter, which is not consistent with the rest of the data. What could this jump in revenue be considered in the context of your analysis?

  • A random fluctuation
  • A seasonal effect
  • A trend
  • An outlier
This significant jump in revenue could be considered an outlier in the context of your analysis, as it deviates significantly from the other data points.

You are analyzing a dataset where the variable 'income' has a skewed distribution due to a few high-income individuals. What method would you recommend to handle these outliers?

  • Binning
  • Removal
  • Transformation
  • nan
In this case, the transformation method, such as log transformation, would be the best fit. It will help to reduce the skewness of the data by pulling in high values.

How does the data handling in Seaborn differ from that in Matplotlib?

  • Matplotlib supports larger datasets
  • Seaborn can't handle missing values
  • Seaborn integrates better with pandas DataFrames
  • Seaborn requires arrays
Seaborn integrates better with pandas DataFrames. In Seaborn, we can directly use column names for the axes and other arguments, while Matplotlib primarily handles arrays.