When handling outliers in a dataset with skewed distributions, which measure of central tendency is preferred for imputation?

  • Mean
  • Median
  • Mode
  • Geometric Mean
When dealing with skewed datasets, the median is preferred for imputation. The median is robust to extreme values and is less affected by outliers than the mean. Using the median as the measure of central tendency helps maintain the integrity of the dataset in the presence of outliers.

Which role in Data Science primarily focuses on collecting, storing, and processing large datasets efficiently?

  • Data Scientist
  • Data Engineer
  • Data Analyst
  • Machine Learning Engineer
Data Engineers are responsible for the efficient collection, storage, and processing of data. They create the infrastructure necessary for Data Scientists and Analysts to work with data effectively.

When a dataset has values ranging from 0 to 1000 in one column and 0 to 1 in another column, which transformation can be used to scale them to a similar range?

  • Normalization
  • Log Transformation
  • Standardization
  • Min-Max Scaling
Min-Max Scaling, also known as feature scaling, is used to transform values within a specific range (typically 0 to 1) for different features. It ensures that variables with different scales have a similar impact on the analysis.

For datasets with multiple features, EDA often involves dimensionality reduction techniques like PCA to visualize data in two or three _______.

  • Planes
  • Points
  • Dimensions
  • Directions
Exploratory Data Analysis (EDA) often employs dimensionality reduction techniques like Principal Component Analysis (PCA) to visualize data in lower-dimensional spaces (2 or 3 dimensions) for better understanding, hence the term "dimensions."

What is the primary goal of tokenization in NLP?

  • Removing stop words
  • Splitting text into words
  • Extracting named entities
  • Translating text to other languages
The primary goal of tokenization in NLP is to split text into words or tokens. This process is essential for various NLP tasks such as text analysis, language modeling, and information retrieval. Tokenization helps in breaking down text into meaningful units for analysis.

For models with a large number of layers, which technique helps in improving the internal covariate shift and accelerates the training?

  • Stochastic Gradient Descent (SGD) with a small learning rate
  • Batch Normalization
  • L1 Regularization
  • DropConnect
Batch Normalization is a technique used to improve the training of deep neural networks. It addresses the internal covariate shift problem by normalizing the activations of each layer. This helps in accelerating training and allows for the use of higher learning rates without the risk of divergence. It also aids in better gradient flow.

Which type of database is ideal for handling hierarchical data and provides better scalability, MongoDB or MySQL?

  • MongoDB
  • MySQL
  • Both MongoDB and MySQL
  • Neither MongoDB nor MySQL
MongoDB is a NoSQL database that is ideal for handling hierarchical data and provides better scalability for unstructured data. MongoDB uses BSON (Binary JSON) format, which makes it a good choice for applications that require flexibility and scalability in dealing with complex data structures.

A company uses an AI model for recruitment, and it's observed that the model is selecting more male candidates than female candidates for a tech role, even when both genders have similar qualifications. What ethical concern does this scenario highlight?

  • Data bias in AI
  • Lack of transparency in AI
  • Data security and privacy issues in AI
  • Ethical AI governance and accountability
This scenario highlights the ethical concern of "Data bias in AI." The AI model's biased selection towards male candidates indicates that the training data may be biased, leading to unfair and discriminatory outcomes. Addressing data bias is essential to ensure fairness and diversity in AI-driven recruitment.

Which algorithm would you use when you have a mix of input features (both categorical and continuous) and you need to ensure interpretability of the model?

  • Random Forest
  • Support Vector Machines (SVM)
  • Neural Networks
  • Naive Bayes Classifier
Random Forest is a suitable choice for mixed input features when interpretability is important. It combines decision trees and is often used for feature selection and interpretability, making it a good option for such cases.

In a relational database, what is used to ensure data integrity across multiple tables?

  • Primary Key
  • Foreign Key
  • Index
  • Trigger
A Foreign Key is used in a relational database to ensure data integrity by creating a link between tables. It enforces referential integrity, ensuring that values in one table match values in another. Primary Keys are used to uniquely identify records in a table, not to maintain integrity across tables. Indexes and Triggers serve different purposes.