What is the primary cause of outliers in normally distributed data?
- All of these
- Data entry errors
- Data processing errors
- Measurement errors
Outliers in normally distributed data can be a result of various factors such as data entry errors, measurement errors, or errors in data processing.
In Plotly, the ________ object is the top-level container for all plot attributes.
- Diagram
- Figure
- Graph
- Plot
In Plotly, the 'Figure' object is the top-level container in which all plot-related attributes such as data and layout are stored.
How does EDA help in understanding the underlying structure of data?
- By cleaning data
- By modelling data
- By summarizing data
- By visualizing data
EDA, particularly data visualization, plays a crucial role in understanding the underlying structure of data. Visual techniques such as histograms, scatterplots, or box plots, can uncover patterns, trends, relationships, or outliers that would remain hidden in raw, numerical data. Visual exploration can guide statistical analysis and predictive modeling by revealing the underlying structure and suggesting hypotheses.
What are the disadvantages of using backward elimination in feature selection?
- It assumes a linear relationship
- It can be computationally expensive
- It can result in overfitting
- It's sensitive to outliers
Backward elimination in feature selection involves starting with all variables and then removing the least significant variables one by one. This process can be computationally expensive, especially when dealing with datasets with a large number of features.
The 'style' and 'context' functions in Seaborn are used to set the ___________ of the plots.
- aesthetic and context
- axis labels
- layout and structure
- size and color
The 'style' function in Seaborn is used to set the overall aesthetic look of the plot, including background color, grids, and spines. The 'context' function allows you to set the context parameters, which adjust the scale of the plot elements based on the context in which the plot will be presented (e.g., paper, notebook, talk, poster).
When applying regression imputation, what factors need to be taken into consideration?
- Both dependent and independent variables
- None of the variables
- Only the dependent variable
- Only the independent variables
When applying regression imputation, both dependent and independent variables need to be taken into consideration. A regression model is built using the complete cases and then this model is used to predict the missing values in the incomplete cases. Therefore, it is important to carefully consider which variables to include in the regression model.
When would it be appropriate to use 'transformation' as an outlier handling method?
- When the outliers are a result of data duplication
- When the outliers are errors in data collection
- When the outliers are extreme but legitimate data points
- When the outliers do not significantly impact the data analysis
Transformation is appropriate to use as an outlier handling method when the outliers are extreme but legitimate data points that carry valuable information.
Suppose you are comparing the dispersion of two different data sets. One has a higher range, but a lower IQR than the other. What might this tell you about each data set?
- The one with the higher range has more outliers
- The one with the higher range has more variability
- The one with the lower IQR has more variability
- The one with the lower IQR is more skewed
If one dataset has a higher range but a lower IQR than the other, it could suggest that "The one with the higher range has more outliers". The range is sensitive to extreme values, while the IQR focuses on the middle 50% of data and is not affected by outliers.
What is a primary assumption when using regression imputation?
- All data is normally distributed
- Missing data is missing completely at random (MCAR)
- Missing values are negligible
- The relationship between variables is linear
A primary assumption when using regression imputation is that the relationship between variables is linear. This is because regression imputation uses a regression model to predict missing values, and the basic form of regression models assumes a linear relationship between predictor and response variables.
You are working on a dataset and found that the model performance is poor. On further inspection, you found some data points that are far from the rest. What could be a possible reason for the poor performance of your model?
- Outliers
- Overfitting
- Underfitting
- nan
The poor performance of the model might be due to outliers in the dataset. Outliers can have a significant impact on the performance of machine learning models.