Apache Spark offers an optimized engine that supports _______ computations, enabling faster data analytics.
- Batch
- Single-threaded
- Real-time
- Static
Apache Spark offers an optimized engine that supports real-time computations. This capability enables faster data analytics by allowing Spark to process data as it arrives, making it suitable for real-time data processing and analytics tasks. This is a key advantage of Spark over traditional batch processing systems.
Which statistical measure represents the middle value in a dataset when it's ordered from least to greatest?
- Mean
- Mode
- Median
- Range
The median is the middle value in a dataset when it's ordered. It's a measure of central tendency that's not affected by extreme values (outliers). To find the median, you arrange the data in ascending order, and if there's an even number of values, it's the average of the two middle values.
Hybrid recommender systems combine the features of both _______ and _______ methods.
- Collaborative, Clustering
- Content-Based, Matrix Factorization
- Dimensionality Reduction, Anomaly Detection
- Neural Networks, Regression
Hybrid recommender systems leverage both collaborative filtering (user-user/item-item) and content-based methods to provide more accurate recommendations. Collaborative filtering focuses on user behavior, while content-based filtering considers item attributes.
Which statistical test is used to determine if there's a significant difference between the means of two independent groups?
- Chi-squared test
- T-test (independent samples)
- ANOVA (Analysis of Variance)
- Correlation test
The T-test for independent samples is used to determine if there is a significant difference between the means of two independent groups. It is commonly employed in hypothesis testing to compare means. The chi-squared test is used for testing the independence of categorical variables, ANOVA for comparing more than two group means, and the correlation test for measuring the strength and direction of a linear relationship.
In CNNs, the _______ layer is used to detect local features such as edges and textures.
- Convolutional
- Pooling
- Recurrent
- Fully Connected
The Convolutional layer in Convolutional Neural Networks (CNNs) is responsible for detecting local features in the input data, such as edges and textures. It does this by applying convolution operations across the input data, which allows the network to recognize spatial patterns in images or other structured data.
Which approach in recommender systems involves recommending items by finding users who are similar to the target user?
- Collaborative Filtering
- Content-Based Filtering
- Hybrid Filtering
- Matrix Factorization
Collaborative Filtering is a recommendation approach that identifies users similar to the target user based on their interactions and recommends items liked by those similar users. It relies on user-user similarity for recommendations.
Which algorithm is commonly used for predicting a continuous target variable?
- Decision Trees
- K-Means Clustering
- Linear Regression
- Naive Bayes Classification
Linear Regression is a commonly used algorithm for predicting continuous target variables. It establishes a linear relationship between the input features and the target variable, making it suitable for tasks like price prediction or trend analysis in Data Science.
In a data warehouse, the _________ table is used to store aggregated data at multiple levels of granularity.
- Fact
- Dimension
- Staging
- Aggregate
In a data warehouse, the "Fact" table is used to store aggregated data at various levels of granularity. These tables contain measures or metrics, which are essential for analytical queries and business intelligence reporting.
In a Hadoop ecosystem, which tool is primarily used for data ingestion from various sources?
- HBase
- Hive
- Flume
- Pig
Apache Flume is primarily used in the Hadoop ecosystem for data ingestion from various sources. It is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of data to Hadoop's storage or other processing components. Flume is essential for handling data ingestion pipelines in Hadoop environments.
In which scenario would Min-Max normalization be a less ideal choice for data scaling?
- When outliers are present
- When the data has a normal distribution
- When the data will be used for regression analysis
- When interpretability of features is crucial
Min-Max normalization can be sensitive to outliers. If outliers are present in the data, this scaling method can compress the majority of data points into a narrow range, making it less suitable for preserving the information in the presence of outliers. In scenarios where outliers are a concern, alternative scaling methods like Robust Scaling may be preferred.