Big Data technologies are primarily designed to handle data that exceeds the processing capability of _______ systems.
- Mainframe
- Personal computer
- Supercomputer
- Mobile device
Big Data technologies are specifically designed for data that exceeds the processing capabilities of traditional systems such as mainframes, personal computers, and mobile devices. These traditional systems are not equipped to efficiently process and analyze massive datasets, which is the focus of Big Data technologies.
Data that has some organizational properties, but not as strict as tables in relational databases, is termed as _______ data.
- Unstructured Data
- Semi-Structured Data
- Raw Data
- Big Data
Data that has some organization but doesn't adhere to a strict tabular structure is known as "Semi-Structured Data." It includes data formats like JSON, XML, and others that have a certain level of structure.
While preparing data for a machine learning model, you realize that the 'Height' column has some missing values. Upon closer inspection, you find that these missing values often correspond to records where the 'Age' column has values less than 1 year. What might be a reasonable way to handle these missing values?
- Impute missing values with the mean height
- Impute missing values with 0
- Leave missing values as they are
- Impute missing values based on 'Age'
In this case, it might be reasonable to leave missing values as they are. Imputing with the mean height or 0 may introduce bias, and imputing based on 'Age' should be done carefully, as infants may have different height characteristics than adults. Depending on the context and dataset size, leaving the missing values untouched might be the best choice.
In Gradient Boosting, what is adjusted at each step to minimize the residual errors?
- Learning rate
- Number of trees
- Feature importance
- Maximum depth of trees
In Gradient Boosting, the learning rate (Option A) is adjusted at each step to minimize residual errors. A smaller learning rate makes the model learn more slowly and often leads to better generalization, reducing the risk of overfitting.
The gradient explosion problem in deep learning can be mitigated using the _______ technique, which clips the gradients if they exceed a certain value.
- Data Augmentation
- Learning Rate Decay
- Gradient Clipping
- Early Stopping
Gradient clipping is a technique used to mitigate the gradient explosion problem in deep learning. It limits the magnitude of gradients during training, preventing them from becoming too large and causing instability.
The process of adjusting the contrast or brightness of an image is termed as _______ in image processing.
- Segmentation
- Normalization
- Histogram Equalization
- Enhancement
In image processing, adjusting the contrast or brightness of an image is termed as "Enhancement." Image enhancement techniques are used to improve the visual quality of an image by enhancing specific features such as brightness and contrast.
In MongoDB, the _______ operator can be used to test a regular expression against a string.
- $search
- $match
- $regex
- $find
In MongoDB, the $regex operator is used to test a regular expression against a string. It allows you to perform pattern matching on string fields in your documents. This is useful for querying and filtering data based on specific patterns or text matching requirements.
A financial institution is looking to build a data warehouse to analyze historical transaction data over the last decade. They need a solution that allows complex analytical queries. Which type of schema would be most suitable for this use case?
- Star Schema
- Snowflake Schema
- Factless Fact Table
- NoSQL Database
A Star Schema is the best choice for a data warehouse designed for complex analytical queries. It provides a denormalized structure that optimizes query performance. Snowflake Schema is similar but more normalized. Factless Fact Table is used for scenarios without measures. NoSQL databases are not typically used for traditional data warehousing.
Which EDA technique involves understanding the relationships between different variables in a dataset through scatter plots, correlation metrics, etc.?
- Data Wrangling
- Data Visualization
- Data Modeling
- Data Preprocessing
Data Visualization is the technique used to understand the relationships between variables in a dataset. This involves creating scatter plots, correlation matrices, and other visual representations to identify patterns and correlations in the data, which is an essential part of Exploratory Data Analysis (EDA).
What is the primary challenge in real-time data processing as compared to batch processing?
- Scalability
- Latency
- Data Accuracy
- Complexity
The primary challenge in real-time data processing, as opposed to batch processing, is latency. Real-time processing requires low-latency data handling, meaning that data must be processed and made available for analysis almost immediately after it's generated. This can be a significant challenge, especially when dealing with large volumes of data and ensuring near-instantaneous processing and analysis.