For graph processing in a distributed environment, Apache Spark provides the _______ library.
- GraphX
- HBase
- Pig
- Storm
Apache Spark provides the "GraphX" library for graph processing in a distributed environment. GraphX is a part of the Spark ecosystem and is used for graph analytics and computation. It's a powerful tool for analyzing graph data.
In computer vision, what process involves converting an image into an array of pixel values?
- Segmentation
- Feature Extraction
- Pre-processing
- Quantization
Pre-processing in computer vision typically includes steps like resizing, filtering, and transforming an image. It's during this phase that an image is converted into an array of pixel values, making it ready for subsequent analysis and feature extraction.
Which of the following is not typically a layer in a CNN?
- Convolutional Layer
- Fully Connected Layer
- Recurrent Layer
- Pooling Layer
Recurrent Layers are not typically used in Convolutional Neural Networks. They are more common in Recurrent Neural Networks (RNNs) and are used for sequential data processing, unlike CNNs, which are designed for grid-like data.
The operation in CNNs that combines the outputs of neuron clusters and produces a single output for the cluster is known as _______.
- Activation Function
- Pooling
- Convolutions
- Fully Connected
In CNNs, the operation that combines the outputs of neuron clusters and produces a single output for the cluster is called "Pooling." Pooling reduces the spatial dimensions of the feature maps, making them smaller and more computationally efficient while retaining important features.
A healthcare organization stores patient records in a database. Each record contains structured fields like name, age, and diagnosis. Additionally, there are scanned documents and notes from doctors. Which term best describes the type of data in this healthcare database?
- Structured data
- Semi-structured data
- Unstructured data
- Big data
The healthcare database contains a mix of structured data (name, age, diagnosis) and semi-structured data (scanned documents and doctor's notes). Semi-structured data includes elements with partial structure, like documents, which can be tagged or indexed for better retrieval.
When a model performs well on training data but poorly on unseen data, what issue might it be facing?
- Overfitting
- Underfitting
- Data leakage
- Bias-variance tradeoff
The model is likely facing the issue of overfitting. Overfitting occurs when the model learns the training data too well, including noise, resulting in excellent performance on the training set but poor generalization to unseen data. It's an example of a high-variance problem in the bias-variance tradeoff. To address overfitting, techniques like regularization and more data are often used.
Which type of database is ideal for handling hierarchical data and provides better scalability, MongoDB or MySQL?
- MongoDB
- MySQL
- Both MongoDB and MySQL
- Neither MongoDB nor MySQL
MongoDB is a NoSQL database that is ideal for handling hierarchical data and provides better scalability for unstructured data. MongoDB uses BSON (Binary JSON) format, which makes it a good choice for applications that require flexibility and scalability in dealing with complex data structures.
A company uses an AI model for recruitment, and it's observed that the model is selecting more male candidates than female candidates for a tech role, even when both genders have similar qualifications. What ethical concern does this scenario highlight?
- Data bias in AI
- Lack of transparency in AI
- Data security and privacy issues in AI
- Ethical AI governance and accountability
This scenario highlights the ethical concern of "Data bias in AI." The AI model's biased selection towards male candidates indicates that the training data may be biased, leading to unfair and discriminatory outcomes. Addressing data bias is essential to ensure fairness and diversity in AI-driven recruitment.
In EDA, which method can help in understanding how a single variable is distributed across various categories or groups?
- Histogram
- Box Plot
- Scatter Plot
- Bar Plot
A bar plot is used to visualize the distribution of a single variable across different categories or groups. It displays the data in rectangular bars, making it easy to compare and understand how the variable is distributed among the categories. Commonly used in Exploratory Data Analysis (EDA).
You're working with a dataset containing sales data from various regions. You want to identify sales patterns, seasonal trends, and anomalies. Which EDA techniques and visualization tools would be best suited for this?
- Scatter plots and t-SNE
- Box plots and bar charts
- Time series plots and heatmaps
- Histograms and parallel coordinates
For exploring sales patterns and seasonal trends, time series plots and heatmaps are excellent choices. Time series plots can reveal trends over time, and heatmaps can show correlations between different regions and sales data, helping identify anomalies and patterns.