Which method for handling missing data involves using algorithms like k-NN to find similar records to impute the missing value?

Mean imputation
Median imputation
k-NN imputation
Mode imputation

k-NN imputation is a technique that uses the similarity of data points to impute missing values. It finds records with similar characteristics to the one with missing data and replaces the missing value with the imputed value from its nearest neighbors. Other options are simpler imputation methods.

Discuss it

In recurrent neural networks (RNNs), which variant is designed specifically to handle long-term dependencies by maintaining a cell state?

LSTM (Long Short-Term Memory)
GRU (Gated Recurrent Unit)
SRU (Simple Recurrent Unit)
ESN (Echo State Network)

Long Short-Term Memory (LSTM) is a variant of RNN designed to handle long-term dependencies by maintaining a cell state that can capture information over long sequences. LSTM's ability to store and retrieve information over extended time steps makes it well-suited for tasks involving long-term dependencies in data sequences.

Discuss it

Which metric provides a single score that balances the trade-off between precision and recall?

F1 Score
Accuracy
ROC AUC
Log Loss

The F1 Score is a metric that balances the trade-off between precision and recall. It is especially useful when dealing with imbalanced datasets or when you want to find a balance between correctly identifying positive cases (precision) and capturing all positive cases (recall). The F1 Score is the harmonic mean of precision and recall. It is a suitable choice for evaluating models when both precision and recall are essential.

Discuss it

An AI startup with limited computational resources is building an image classifier. They don't have the capability to train a deep neural network from scratch. What approach can they use to leverage the capabilities of deep learning without the extensive training time?

Transfer learning
Reinforcement learning
Genetic algorithms
Random forest classifier

Transfer learning allows the startup to use pre-trained deep neural networks (e.g., a pre-trained CNN) as a starting point. This approach significantly reduces training time and computational resources, while still benefiting from the capabilities of deep learning.

Discuss it

A common architecture for real-time data processing involves using ________ to ingest and process streaming data.

Hadoop
Spark
Batch Processing
Data Lakes

In real-time data processing, Apache Spark is commonly used to ingest and process streaming data. Spark provides the capabilities to handle streaming data in real time, making it a popular choice for such applications.

Discuss it

In a skewed distribution, which measure of central tendency is most resistant to the effects of outliers?

Mean
Median
Mode
Geometric Mean

The median is the most resistant measure of central tendency in a skewed distribution. It is less affected by extreme values or outliers since it represents the middle value when the data is arranged in order. The mean, mode, and geometric mean can be heavily influenced by outliers, causing them to be less representative of the data's central location.

Discuss it

What is a common technique to prevent overfitting in linear regression models?

Increasing the model complexity
Reducing the number of features
Regularization
Using a smaller training dataset

Regularization is a common technique used to prevent overfitting in linear regression models. It adds a penalty term to the linear regression's cost function to discourage overly complex models. Regularization techniques include L1 (Lasso) and L2 (Ridge) regularization.

Discuss it

In which type of data do you often encounter a mix of structured tables and unstructured text?

Structured Data
Semi-Structured Data
Unstructured Data
Multivariate Data

Semi-structured data often contains a mix of structured tables and unstructured text. It's a flexible data format that can combine organized data elements with more free-form content, making it suitable for a wide range of data types and use cases, such as web data and NoSQL databases.

Discuss it

Which technique considers the spread of data points around the median to identify outliers?

Box Plot
Z-Score (Standardization)
One-Hot Encoding
K-Means Clustering

The Box Plot, also known as a box-and-whisker plot, considers the spread of data points around the median and helps identify outliers based on the interquartile range (IQR). Outliers are data points that fall outside the whiskers of the box plot. Z-Score is used for standardization, One-Hot Encoding is used for categorical variables, and K-Means Clustering is a clustering technique and not used for identifying outliers.

Discuss it

In Big Data processing, _ operations filter and sort data, while _ operations perform aggregations and transformations.

Map, Reduce
Filter, Join
Shuffle, Merge
Merge, Filter

In Big Data processing, the first blank should be filled with "Filter," and the second blank with "Join." Filtering and sorting are common operations in data preparation, while aggregations and transformations are typically done using join operations.

Discuss it

Which method for handling missing data involves using algorithms like k-NN to find similar records to impute the missing value?

In recurrent neural networks (RNNs), which variant is designed specifically to handle long-term dependencies by maintaining a cell state?

Which metric provides a single score that balances the trade-off between precision and recall?

An AI startup with limited computational resources is building an image classifier. They don't have the capability to train a deep neural network from scratch. What approach can they use to leverage the capabilities of deep learning without the extensive training time?

A common architecture for real-time data processing involves using ________ to ingest and process streaming data.

In a skewed distribution, which measure of central tendency is most resistant to the effects of outliers?

What is a common technique to prevent overfitting in linear regression models?

In which type of data do you often encounter a mix of structured tables and unstructured text?

Which technique considers the spread of data points around the median to identify outliers?

In Big Data processing, _______ operations filter and sort data, while _______ operations perform aggregations and transformations.

In Big Data processing, _ operations filter and sort data, while _ operations perform aggregations and transformations.