In the context of outlier detection, what is the commonly used plot to visually detect outliers in a single variable?
- Box Plot
- Scatter Plot
- Histogram
- Line Chart
A Box Plot is a commonly used visualization for detecting outliers in a single variable. It displays the distribution of data and identifies potential outliers based on the interquartile range (IQR). Data points outside the whiskers of the box plot are often considered outliers. Box plots are useful for identifying data anomalies.
Which step in the Data Science Life Cycle is concerned with cleaning the data and handling missing values?
- Data Exploration
- Data Collection
- Data Preprocessing
- Data Visualization
Data Preprocessing is the step in the Data Science Life Cycle that involves cleaning the data, handling missing values, and preparing it for analysis. This step is crucial for ensuring the quality and reliability of the data used in subsequent analysis.
What is the most common measure of central tendency, which calculates the average value of a dataset?
- Median
- Mode
- Mean
- Standard Deviation
The mean, also known as the average, is a common measure of central tendency. It's calculated by adding up all the values in the dataset and then dividing by the number of data points. The mean provides a sense of the "typical" value in the dataset.
In the context of binary classification, which metric calculates the ratio of true positives to the sum of true positives and false negatives?
- Precision-Recall Curve
- F1 Score
- True Positive Rate (Sensitivity)
- Specificity
The True Positive Rate, also known as Sensitivity or Recall, calculates the ratio of true positives to the sum of true positives and false negatives. It measures the model's ability to correctly identify positive cases. It is an important metric in binary classification evaluation.
Which method for handling missing data involves using algorithms like k-NN to find similar records to impute the missing value?
- Mean imputation
- Median imputation
- k-NN imputation
- Mode imputation
k-NN imputation is a technique that uses the similarity of data points to impute missing values. It finds records with similar characteristics to the one with missing data and replaces the missing value with the imputed value from its nearest neighbors. Other options are simpler imputation methods.
In recurrent neural networks (RNNs), which variant is designed specifically to handle long-term dependencies by maintaining a cell state?
- LSTM (Long Short-Term Memory)
- GRU (Gated Recurrent Unit)
- SRU (Simple Recurrent Unit)
- ESN (Echo State Network)
Long Short-Term Memory (LSTM) is a variant of RNN designed to handle long-term dependencies by maintaining a cell state that can capture information over long sequences. LSTM's ability to store and retrieve information over extended time steps makes it well-suited for tasks involving long-term dependencies in data sequences.
Which metric provides a single score that balances the trade-off between precision and recall?
- F1 Score
- Accuracy
- ROC AUC
- Log Loss
The F1 Score is a metric that balances the trade-off between precision and recall. It is especially useful when dealing with imbalanced datasets or when you want to find a balance between correctly identifying positive cases (precision) and capturing all positive cases (recall). The F1 Score is the harmonic mean of precision and recall. It is a suitable choice for evaluating models when both precision and recall are essential.
An AI startup with limited computational resources is building an image classifier. They don't have the capability to train a deep neural network from scratch. What approach can they use to leverage the capabilities of deep learning without the extensive training time?
- Transfer learning
- Reinforcement learning
- Genetic algorithms
- Random forest classifier
Transfer learning allows the startup to use pre-trained deep neural networks (e.g., a pre-trained CNN) as a starting point. This approach significantly reduces training time and computational resources, while still benefiting from the capabilities of deep learning.
When evaluating models for a multi-class classification problem, which method computes the average metric score for each class, considering the other classes as the negative class?
- Micro-averaging
- Macro-averaging
- Weighted averaging
- Mini-batch averaging
Macro-averaging computes the average metric score for each class, treating all other classes as the "negative" class. It provides an equal weight to each class and is useful when you want to assess the model's overall performance while giving equal importance to each class, regardless of class size. Macro-averaging can be particularly useful in imbalanced multi-class classification problems.
Which technique considers the spread of data points around the median to identify outliers?
- Box Plot
- Z-Score (Standardization)
- One-Hot Encoding
- K-Means Clustering
The Box Plot, also known as a box-and-whisker plot, considers the spread of data points around the median and helps identify outliers based on the interquartile range (IQR). Outliers are data points that fall outside the whiskers of the box plot. Z-Score is used for standardization, One-Hot Encoding is used for categorical variables, and K-Means Clustering is a clustering technique and not used for identifying outliers.