How is Multicollinearity typically detected in a dataset?

  • By calculating the Variance Inflation Factor (VIF).
  • By performing a simple linear regression.
  • By performing a t-test.
  • By visually inspecting the data.
Multicollinearity is typically detected by calculating the Variance Inflation Factor (VIF). A high VIF indicates a high degree of multicollinearity between the independent variables.

After exploring and interpreting your data, you would '______' your findings in the EDA process.

  • communicate
  • conclude
  • question
  • wrangle
After exploring and interpreting your data, you would 'conclude' your findings in the EDA process. This is where you draw actionable insights from the data that you have analyzed and explored.

Which type of graph would be most suitable for showing the relationship between two variables?

  • Bar graph
  • Histogram
  • Pie chart
  • Scatter plot
A scatter plot is most suitable for showing the relationship between two variables. Each point on the plot corresponds to two data values, with the position along the X and Y-axis representing the values of the two variables. This allows patterns and relationships to be identified visually.

You are required to create a complex statistical plot to identify and present possible correlations between multiple variables in your dataset. Which Python library would be the most appropriate for this task?

  • Bokeh
  • Matplotlib
  • Plotly
  • Seaborn
Seaborn is best suited for creating complex statistical plots. It provides high-level, attractive statistical plots and integrates well with pandas DataFrames, allowing direct use of column names for the axes and other arguments.

How does kurtosis impact the interpretation of data distribution?

  • It affects how we perceive the outliers and tail risks.
  • It affects the reliability of the mean.
  • It changes the standard deviation of the dataset.
  • It influences the choice of graph to use.
Kurtosis impacts the interpretation of data distribution by affecting how we perceive the outliers and tail risks. High kurtosis indicates a high probability of extreme outcomes, whereas low kurtosis suggests a lower chance of extreme outcomes.

A company surveyed its customers for their satisfaction scores, ranging from 1-10. The scores were heavily skewed to the right with a few customers giving a score of 1 or 2. Which measure of central tendency should the company use to present a typical customer experience?

  • All are equally valid
  • Mean
  • Median
  • Mode
The "Median" would be the best measure of central tendency in this scenario. Since the scores are heavily skewed to the right, the median would provide a more accurate representation of a typical customer's experience than the mean, which would be dragged down by the low scores.

The process of 'binning' to handle outliers involves grouping data into ________.

  • Bins
  • Deciles
  • Percentiles
  • Quartiles
In the process of binning, the data is grouped into 'bins', and the outliers are replaced with summary statistics like mean, median, or mode.

When is it more appropriate to use a correlation matrix instead of a pairplot?

  • When the dataset is very large
  • When the dataset is very small
  • When the variables are not numeric
  • When there are only two variables
When dealing with a large number of variables, a correlation matrix can be a more appropriate choice than a pairplot. This is because pairplots can become too complex and unreadable when the number of variables increases.

What is variance in the context of a data set?

  • The average deviation from the mean
  • The average squared deviation from the mean
  • The range of the data
  • The square root of the average deviation from the mean
"Variance" in the context of a data set is the "Average squared deviation from the mean". It gives a measure of how data points vary from the mean and is used to calculate the standard deviation.

What does Min-Max scaling do to the dataset?

  • It reduces the dimensionality of the dataset
  • It removes the mean and scales the data to unit variance
  • It scales the data based on median and interquartile range
  • It scales the dataset so that all feature values are in the range 0 to 1
Min-Max scaling, also known as normalization, transforms features by scaling each feature to a specific range, typically 0 to 1. This is done using the values of the minimum and maximum feature in the dataset.