The process of converting an actual range of values in a numeric feature column into a standard range of values is known as _____.
- Binning
- Data Encoding
- Data Integration
- Data Scaling
The process of converting an actual range of values in a numeric feature column into a standard range of values is known as Data Scaling. This is a fundamental step in data preprocessing, particularly important when dealing with machine learning algorithms.
In the presence of outliers, the ________ correlation coefficient can provide misleading results.
- Covariance
- Kendall's Tau
- Pearson's
- Spearman's
In the presence of outliers, the Pearson's correlation coefficient can provide misleading results. Pearson's correlation is sensitive to outliers and hence can be significantly affected by them.
Imagine you're dealing with a classification model. The dataset has a significant amount of missing data that was replaced with the mean. How could this decision have impacted the model's performance?
- It could distort the feature's statistical properties.
- It could increase the model's accuracy.
- It could lead to overfitting.
- It could lead to underfitting.
Replacing missing data with the mean can distort the feature's statistical properties (like variance), which could affect the model's learning and prediction capability.
A data point that lies outside the overall distribution of the dataset is called a(n) _______.
- Anomaly
- Error
- Inlier
- Outlier
A data point that lies outside the overall distribution of the dataset is called an outlier. These are unusual observations that differ significantly from the other data points.
What does the term "Multicollinearity" refer to in the context of Exploratory Data Analysis?
- A condition where the independent variables in a regression model are highly correlated
- A statistical method to determine the correlation between variables
- Correlation among three or more variables
- Correlation between two variables
Multicollinearity refers to a situation where two or more independent variables in a multiple regression model are highly correlated. If these variables are closely correlated, it can be hard for the model to determine the effect of each variable independently, which may lead to unstable estimates.
What factors should be considered when assessing the aesthetics of a data visualization?
- The balance, simplicity, clarity, and color scheme
- The designer's personal taste
- The latest trends in data visualization
- The time it took to create the visualization
Aesthetics in data visualization involve multiple factors including balance (equal weightage to all parts), simplicity (avoiding unnecessary complexity), clarity (clearly understandable), and the color scheme (which can direct attention, represent categories, or express quantities). Good aesthetics make the data easy to understand and the message memorable.
Which method of data imputation is generally most appropriate for MCAR data?
- Mean/Median imputation
- Prediction model
- Random Sample Imputation
- nan
For MCAR data, Random Sample Imputation is a good choice as it assumes that the data are missing completely at random. It works by taking random observations from the dataset and using these to replace the missing values.
When the data is skewed to the right, the _____ will usually be greater than the median.
- Mean
- Median
- Mode
- Range
When data is skewed to the right, it means there are a number of observations with large values, which pull the "Mean" up, making it greater than the median.
Given that you need to create a publication-quality figure, which Python library provides the best control over every aspect of the figure properties?
- Bokeh
- Matplotlib
- Plotly
- Seaborn
Matplotlib provides a low-level, object-oriented API for embedding plots into applications and gives the most control over every aspect of the figure properties. This makes it suitable for creating publication-quality figures.
A team member has used a histogram to represent a dataset but the representation seems biased. What could be the potential issue?
- Improper choice of bin width
- Poor color choice
- The data was not cleaned properly
- The scale of the axes is wrong
One of the most common reasons a Histogram might appear biased is due to an improper choice of bin width. The bin width greatly affects the resulting shape and patterns. If the bins are too wide, important features may be hidden. If they are too narrow, the representation may appear too cluttered or noisy.