Which type of data is usually represented in categories?

  • Categorical data
  • Continuous data
  • Ordinal data
  • Quantitative data
Categorical data is usually represented in categories. It's a type of qualitative data that can be divided into groups but does not have a numerical significance.

A _____ plot can give us a detailed view of the data distribution including its quartiles and outliers.

  • Bar
  • Box
  • Line
  • Scatter
A box plot provides a detailed view of the distribution of a dataset, showing the median (second quartile), first quartile, third quartile, and potential outliers.

The __________ of a graph refers to its overall visual appeal, including aspects such as color, layout, and style.

  • Aesthetics
  • Functionality
  • Interactivity
  • Readability
Aesthetics of a graph refers to its visual appeal, including aspects such as color, layout, and style. Good aesthetics can make data easier to interpret and enhance the audience's engagement and comprehension.

What measure of central tendency is often used in skewed distributions to best represent a "typical" value?

  • Mean
  • Median
  • Mode
  • nan
In skewed distributions, the "Median" is often used as the best representation of a "typical" value. The median is less affected by outliers or extreme values, which makes it a more robust measure when dealing with skewed data.

Imagine a dataset representing ages of people in a certain city. The ages range from 0 to 100 with most people in their mid-40s. How would the choice of central tendency measure differ if the distribution is symmetrical versus if it is skewed to the right?

  • Mean for both distributions
  • Mean for symmetrical, median for skewed
  • Median for symmetrical, mean for skewed
  • The measure wouldn't differ
If the distribution is symmetrical, the "Mean" would be a suitable measure of central tendency as it would accurately represent the center. If it's skewed to the right, the "Median" would be a better choice, as it is not affected by the skewness or outliers.

You are tasked with preparing a dataset for use in a machine learning algorithm that does not assume any specific distribution of the data. Which scaling method might be most appropriate?

  • Min-Max scaling because it scales all values between 0 and 1
  • Robust scaling because it is not affected by outliers
  • The choice of scaling method does not depend on the distribution of the data
  • Z-score standardization because it creates a normal distribution
The choice of scaling method does not depend on the distribution of the data but rather on the properties of the data and the requirements of the specific algorithm being used. All scaling methods could potentially be appropriate depending on other factors such as the presence of outliers, the need to maintain the range of the data, etc.

In a skewed distribution, a good method to handle outliers might be to use a ________ transformation.

  • Box-Cox
  • Inverse
  • Logarithmic
  • Square root
Logarithmic transformations are often used in skewed distributions to handle outliers. They help in reducing the skewness of the data by pulling in high values.

How can outliers significantly impact the Pearson's correlation coefficient value?

  • Outliers can decrease the Pearson's correlation coefficient value
  • Outliers can distort the Pearson's correlation coefficient value
  • Outliers can increase the Pearson's correlation coefficient value
  • Outliers do not impact the Pearson's correlation coefficient value
Outliers can distort the Pearson's correlation coefficient value. Because Pearson's correlation measures the linear relationship between two variables, it is sensitive to outliers. An outlier can cause a high or low correlation value, providing a misleading view of the strength of the relationship between the variables.

Seaborn simplifies data visualization in Python by providing a high-level interface for creating stylish, informative statistical graphics based on ___________.

  • Bokeh
  • Matplotlib
  • Pandas
  • Plotly
Seaborn is built on top of Matplotlib and it integrates well with pandas DataFrames. It provides a high-level interface to Matplotlib, allowing for the creation of more visually appealing plots.

What is the Variance Inflation Factor (VIF) and how does it help in identifying Multicollinearity?

  • A mathematical formula to measure the correlation between variables.
  • A measure that estimates how much the variance of a coefficient is increased due to multicollinearity.
  • A statistical method to calculate the variance of a dataset.
  • A technique to visualize the relationship between multiple variables.
The Variance Inflation Factor (VIF) is a measure that estimates how much the variance of a regression coefficient is increased due to multicollinearity. VIF provides an index that measures how much the variance of an estimated regression coefficient is increased because of multicollinearity. In general, a VIF above 5 indicates a high multicollinearity.