When evaluating models for a multi-class classification problem, which method computes the average metric score for each class, considering the other classes as the negative class?
- Micro-averaging
- Macro-averaging
- Weighted averaging
- Mini-batch averaging
Macro-averaging computes the average metric score for each class, treating all other classes as the "negative" class. It provides an equal weight to each class and is useful when you want to assess the model's overall performance while giving equal importance to each class, regardless of class size. Macro-averaging can be particularly useful in imbalanced multi-class classification problems.
Which technique considers the spread of data points around the median to identify outliers?
- Box Plot
- Z-Score (Standardization)
- One-Hot Encoding
- K-Means Clustering
The Box Plot, also known as a box-and-whisker plot, considers the spread of data points around the median and helps identify outliers based on the interquartile range (IQR). Outliers are data points that fall outside the whiskers of the box plot. Z-Score is used for standardization, One-Hot Encoding is used for categorical variables, and K-Means Clustering is a clustering technique and not used for identifying outliers.
In Big Data processing, _______ operations filter and sort data, while _______ operations perform aggregations and transformations.
- Map, Reduce
- Filter, Join
- Shuffle, Merge
- Merge, Filter
In Big Data processing, the first blank should be filled with "Filter," and the second blank with "Join." Filtering and sorting are common operations in data preparation, while aggregations and transformations are typically done using join operations.
Which activation function can alleviate the vanishing gradient problem to some extent?
- Sigmoid
- ReLU (Rectified Linear Unit)
- Tanh (Hyperbolic Tangent)
- Leaky ReLU
The ReLU activation function is known for mitigating the vanishing gradient problem, which is a common issue in deep learning. ReLU allows gradients to flow more freely during backpropagation, making it easier to train deep neural networks.
In Tableau, you can connect to various data sources and create a unified view known as a _______.
- Dashboard
- Workbook
- Storyboard
- Data source
In Tableau, a "Workbook" is where you can connect to various data sources, design visualizations, and create a unified view of your data. It serves as a container for creating and organizing your data visualizations and analyses.
In L2 regularization, the penalty is proportional to the _______ of the magnitude of the coefficients.
- Square
- Absolute
- Exponential
- Logarithmic
In L2 regularization (Ridge), the penalty is proportional to the square of the magnitude of the coefficients. This regularization technique adds a penalty term to the loss function based on the sum of squared coefficients, which helps prevent overfitting by discouraging large coefficients.
Which statistical concept measures how much individual data points vary from the mean of the dataset?
- Standard Deviation
- Median Absolute Deviation (MAD)
- Mean Deviation
- Z-Score
Standard Deviation is a measure of the spread or variability of data points around the mean. It quantifies how much individual data points deviate from the average, making it a crucial concept in understanding data variability and distribution.
What is the main function of Hadoop's MapReduce?
- Data storage and retrieval
- Data visualization
- Data cleaning and preparation
- Distributed data processing
The main function of Hadoop's MapReduce is "Distributed data processing." MapReduce is a programming model and processing technique used to process and analyze large datasets in a distributed and parallel manner.
Which ensemble method adjusts weights for misclassified instances in iterative training?
- Bagging
- Gradient Boosting
- Random Forest
- K-Means Clustering
Gradient Boosting is an ensemble method that adjusts weights for misclassified instances in iterative training. It aims to correct the errors made by the previous models in the ensemble, with a focus on improving prediction accuracy. This method is particularly effective in building strong predictive models by iteratively focusing on the data points that are challenging to classify correctly.
You are a data engineer tasked with setting up a real-time data processing system for a large e-commerce platform. The goal is to analyze user behavior in real-time to provide instant recommendations. Which technology would be most appropriate for this task?
- Apache Hadoop
- Apache Kafka
- Apache Spark
- MySQL
Apache Spark is the most suitable choice for real-time data processing and analytics. It offers in-memory processing, which allows for fast data analysis, making it ideal for providing instant recommendations based on user behavior. Apache Kafka is used for data streaming, not real-time analytics. Hadoop and MySQL are not optimized for real-time processing.