Which type of filtering is often used to reduce the amount of noise in an image?
- Median Filtering
- Edge Detection
- Histogram Equalization
- Convolutional Filtering
Median filtering is commonly used to reduce noise in an image. It replaces each pixel value with the median value in a local neighborhood, making it effective for removing salt-and-pepper noise and preserving the edges and features in the image.
The process of ________ involves extracting vast amounts of data from different sources and converting it into a format suitable for analysis.
- Data Visualization
- Data Aggregation
- Data Preprocessing
- Data Ingestion
Data Ingestion is the process of extracting vast amounts of data from various sources and converting it into a format suitable for analysis. It is a crucial step in preparing data for analysis and reporting.
What is the primary challenge in real-time data processing as compared to batch processing?
- Scalability
- Latency
- Data Accuracy
- Complexity
The primary challenge in real-time data processing, as opposed to batch processing, is latency. Real-time processing requires low-latency data handling, meaning that data must be processed and made available for analysis almost immediately after it's generated. This can be a significant challenge, especially when dealing with large volumes of data and ensuring near-instantaneous processing and analysis.
Which EDA technique involves understanding the relationships between different variables in a dataset through scatter plots, correlation metrics, etc.?
- Data Wrangling
- Data Visualization
- Data Modeling
- Data Preprocessing
Data Visualization is the technique used to understand the relationships between variables in a dataset. This involves creating scatter plots, correlation matrices, and other visual representations to identify patterns and correlations in the data, which is an essential part of Exploratory Data Analysis (EDA).
A financial institution is looking to build a data warehouse to analyze historical transaction data over the last decade. They need a solution that allows complex analytical queries. Which type of schema would be most suitable for this use case?
- Star Schema
- Snowflake Schema
- Factless Fact Table
- NoSQL Database
A Star Schema is the best choice for a data warehouse designed for complex analytical queries. It provides a denormalized structure that optimizes query performance. Snowflake Schema is similar but more normalized. Factless Fact Table is used for scenarios without measures. NoSQL databases are not typically used for traditional data warehousing.
In MongoDB, the _______ operator can be used to test a regular expression against a string.
- $search
- $match
- $regex
- $find
In MongoDB, the $regex operator is used to test a regular expression against a string. It allows you to perform pattern matching on string fields in your documents. This is useful for querying and filtering data based on specific patterns or text matching requirements.
Which algorithm is used to split data into subsets while at the same time an associated decision tree is incrementally developed?
- K-Means Clustering
- Random Forest
- AdaBoost
- Gradient Boosting
The algorithm used for this purpose is Random Forest. It's an ensemble learning method that builds multiple decision trees and aggregates their results. As the data is split into subsets, the decision tree is developed incrementally, making it a powerful algorithm.
You are analyzing customer reviews for a product and want to automatically categorize each review as positive, negative, or neutral. Which NLP task would be most relevant for this purpose?
- Named Entity Recognition (NER)
- Text Summarization
- Sentiment Analysis
- Machine Translation
Sentiment Analysis is the NLP task most relevant for categorizing customer reviews as positive, negative, or neutral. It involves assessing the sentiment expressed in the text and assigning it to one of these categories based on the sentiment polarity. NER, Text Summarization, and Machine Translation serve different purposes and are not suitable for sentiment categorization.
The AUC-ROC curve is a performance measurement for classification problems at various _______ levels.
- Confidence
- Sensitivity
- Specificity
- Threshold
The AUC-ROC curve measures classification performance at various threshold levels. It represents the trade-off between true positive rate (Sensitivity) and false positive rate (1 - Specificity) at different threshold settings. The threshold affects the classification decisions, and the AUC-ROC summarizes this performance.
Which CNN architecture is known for its residual connections and improved training performance?
- LeNet
- VGGNet
- AlexNet
- ResNet
Residual Networks (ResNets) are known for their residual connections, which allow for easier training of very deep networks. ResNets have become a standard in deep learning due to their ability to mitigate the vanishing gradient problem, enabling the training of much deeper architectures.
Which layer type in a neural network is primarily responsible for feature extraction and spatial hierarchy?
- Input Layer
- Convolutional Layer
- Fully Connected Layer
- Recurrent Layer
Convolutional Layers in neural networks are responsible for feature extraction and learning spatial hierarchies, making them crucial in tasks such as image recognition. They apply filters to the input data, capturing different features.
In which type of data do you often encounter a mix of structured tables and unstructured text?
- Structured Data
- Semi-Structured Data
- Unstructured Data
- Multivariate Data
Semi-structured data often contains a mix of structured tables and unstructured text. It's a flexible data format that can combine organized data elements with more free-form content, making it suitable for a wide range of data types and use cases, such as web data and NoSQL databases.