What does MAR signify in data analysis related to missing data?
- Missed At Random
- Missing And Regular
- Missing At Random
- Missing At Range
In data analysis, MAR signifies Missing At Random. This indicates that the missingness is not random, but that it is also not related to the missing data, only the observed data.
A __________ graph would be most suitable for visualizing a dataset with two numerical variables.
- Bar chart
- Line chart
- Pie chart
- Scatter plot
A scatter plot would be most suitable for visualizing a dataset with two numerical variables. It provides a graphical view of the correlation, or relationship between two sets of data.
In the EDA process, where does the 'communication' step typically occur?
- After concluding
- After exploring
- Before questioning
- Before wrangling
In the EDA process, the 'communication' step typically occurs after concluding. It involves effectively conveying the findings, insights, or conclusions drawn from the data to relevant stakeholders.
Given a boxplot of a data set, how can you determine the IQR, and what does it tell you about the data?
- Add the value of the lower quartile to the upper quartile
- Divide the range by 2
- Subtract the value of the lower quartile from the upper quartile
- Take the square root of the range
From a boxplot, you can determine the "Interquartile Range (IQR)" by "Subtracting the value of the lower quartile from the upper quartile". The IQR measures the range of the middle 50% of the data, which gives you a sense of the spread of the central data.
Suppose you are dealing with time series data with some missing values and you decided to use regression imputation. What potential issues might arise and how could you address them?
- May lead to overfitting; Address by adding more data
- May violate independence assumption; Address by considering time dependence
- May violate uniform distribution; Address by transforming data
- No issues might arise
In time series data, observations are usually dependent on time, so the independence assumption of regression imputation may be violated. This issue can be addressed by considering time dependence in the regression model used for imputation, for example by including lagged variables.
How is Multicollinearity typically detected in a dataset?
- By calculating the Variance Inflation Factor (VIF).
- By performing a simple linear regression.
- By performing a t-test.
- By visually inspecting the data.
Multicollinearity is typically detected by calculating the Variance Inflation Factor (VIF). A high VIF indicates a high degree of multicollinearity between the independent variables.
After exploring and interpreting your data, you would '______' your findings in the EDA process.
- communicate
- conclude
- question
- wrangle
After exploring and interpreting your data, you would 'conclude' your findings in the EDA process. This is where you draw actionable insights from the data that you have analyzed and explored.
Which type of graph would be most suitable for showing the relationship between two variables?
- Bar graph
- Histogram
- Pie chart
- Scatter plot
A scatter plot is most suitable for showing the relationship between two variables. Each point on the plot corresponds to two data values, with the position along the X and Y-axis representing the values of the two variables. This allows patterns and relationships to be identified visually.
You are required to create a complex statistical plot to identify and present possible correlations between multiple variables in your dataset. Which Python library would be the most appropriate for this task?
- Bokeh
- Matplotlib
- Plotly
- Seaborn
Seaborn is best suited for creating complex statistical plots. It provides high-level, attractive statistical plots and integrates well with pandas DataFrames, allowing direct use of column names for the axes and other arguments.
How does kurtosis impact the interpretation of data distribution?
- It affects how we perceive the outliers and tail risks.
- It affects the reliability of the mean.
- It changes the standard deviation of the dataset.
- It influences the choice of graph to use.
Kurtosis impacts the interpretation of data distribution by affecting how we perceive the outliers and tail risks. High kurtosis indicates a high probability of extreme outcomes, whereas low kurtosis suggests a lower chance of extreme outcomes.