An e-commerce platform wants to store the activities and interactions of users in real-time. The data is not structured, and the schema might evolve. Which database is apt for this scenario?

Relational Database
Document Database
Event-Driven Database
Time-Series Database

An event-driven database, such as Apache Kafka, is suitable for capturing and storing real-time activities and interactions, especially when the data is unstructured, and the schema might evolve over time.

Discuss it

What is the primary goal of Exploratory Data Analysis (EDA)?

Predict future trends and insights
Summarize and explore data
Build machine learning models
Develop data infrastructure

The primary goal of EDA is to summarize and explore data. It involves visualizing and understanding the dataset's main characteristics and relationships before diving into more advanced tasks, such as model building or predictions. EDA helps identify patterns and anomalies in the data.

Discuss it

What is the primary characteristic that differentiates Big Data from traditional datasets?

Volume
Velocity
Variety
Veracity

The primary characteristic that differentiates Big Data from traditional datasets is "Variety." Big Data often includes a wide range of data types, including structured, unstructured, and semi-structured data, making it more diverse than traditional datasets.

Discuss it

In the context of Data Science, the concept of "data-driven decision-making" primarily emphasizes on what?

Making decisions based on intuition
Using data to inform decisions
Speeding up decision-making processes
Ignoring data when making decisions

"Data-driven decision-making" underscores the significance of using data to inform decisions. It implies that decisions should be backed by data and analysis rather than relying solely on intuition. This approach enhances the quality and reliability of decision-making.

Discuss it

Which metric is especially useful when the classes in a dataset are imbalanced?

Accuracy
Precision
Recall
F1 Score

Recall is particularly useful when dealing with imbalanced datasets because it measures the ability of a model to identify all relevant instances of a class. In such scenarios, accuracy can be misleading, as the model may predict the majority class more frequently, resulting in a high accuracy but poor performance on the minority class. Recall, also known as true positive rate, focuses on capturing as many true positives as possible.

Discuss it

In time series forecasting, which method involves using past observations as inputs for predicting future values?

Regression Analysis
ARIMA (AutoRegressive Integrated Moving Average)
Principal Component Analysis (PCA)
k-Nearest Neighbors (k-NN)

ARIMA is a time series forecasting method that utilizes past observations to predict future values. It incorporates autoregressive and moving average components, making it suitable for analyzing time series data. The other options are not specifically designed for time series forecasting and do not rely on past observations in the same way.

Discuss it

The process of organizing data to minimize redundancy and avoid undesirable characteristics like insertion, update, and deletion anomalies is called _______.

Data Duplication
Data Cleaning
Data Normalization
Data Validation

The process described is Data Normalization. It involves organizing data into tables and minimizing redundancy to ensure data integrity and prevent anomalies. This is a fundamental concept in database design. Normalization helps maintain data consistency and efficiency.

Discuss it

Regularization techniques add a _______ to the loss function to constrain the magnitude of the model parameters.

Weight penalty
Bias term
Learning rate
Activation function

Regularization techniques add a "Weight penalty" term to the loss function to constrain the magnitude of the model parameters, preventing them from becoming excessively large. This helps prevent overfitting and improves the model's generalization capabilities. Regularization is a crucial concept in machine learning and deep learning.

Discuss it

Which variant of RNN is specifically designed to combat the problem of vanishing and exploding gradients?

LSTM (Long Short-Term Memory)
GRU (Gated Recurrent Unit)
Bidirectional RNN
Simple RNN (Recurrent Neural Network)

Long Short-Term Memory (LSTM) is a variant of RNN that is designed to address the vanishing and exploding gradient problem. LSTMs use specialized gating mechanisms to better capture long-term dependencies in data, making them suitable for sequences with long-term dependencies.

Discuss it

You are working on a fraud detection system where false negatives have a higher cost than false positives. Which metric would be most crucial to optimize?

Precision
Recall
F1 Score
Accuracy

In this scenario, minimizing false negatives is critical, as failing to detect fraud has a higher cost. Recall (Option B) focuses on minimizing false negatives, making it the most crucial metric to optimize in this context. While precision is important, the emphasis here is on avoiding false negatives. F1 Score balances precision and recall but may not prioritize minimizing false negatives. Accuracy is not the most relevant metric.

Discuss it