Data that has some organizational properties, but not as strict as tables in relational databases, is termed as _______ data.
- Unstructured Data
- Semi-Structured Data
- Raw Data
- Big Data
Data that has some organization but doesn't adhere to a strict tabular structure is known as "Semi-Structured Data." It includes data formats like JSON, XML, and others that have a certain level of structure.
While preparing data for a machine learning model, you realize that the 'Height' column has some missing values. Upon closer inspection, you find that these missing values often correspond to records where the 'Age' column has values less than 1 year. What might be a reasonable way to handle these missing values?
- Impute missing values with the mean height
- Impute missing values with 0
- Leave missing values as they are
- Impute missing values based on 'Age'
In this case, it might be reasonable to leave missing values as they are. Imputing with the mean height or 0 may introduce bias, and imputing based on 'Age' should be done carefully, as infants may have different height characteristics than adults. Depending on the context and dataset size, leaving the missing values untouched might be the best choice.
In Gradient Boosting, what is adjusted at each step to minimize the residual errors?
- Learning rate
- Number of trees
- Feature importance
- Maximum depth of trees
In Gradient Boosting, the learning rate (Option A) is adjusted at each step to minimize residual errors. A smaller learning rate makes the model learn more slowly and often leads to better generalization, reducing the risk of overfitting.
The gradient explosion problem in deep learning can be mitigated using the _______ technique, which clips the gradients if they exceed a certain value.
- Data Augmentation
- Learning Rate Decay
- Gradient Clipping
- Early Stopping
Gradient clipping is a technique used to mitigate the gradient explosion problem in deep learning. It limits the magnitude of gradients during training, preventing them from becoming too large and causing instability.
The process of adjusting the contrast or brightness of an image is termed as _______ in image processing.
- Segmentation
- Normalization
- Histogram Equalization
- Enhancement
In image processing, adjusting the contrast or brightness of an image is termed as "Enhancement." Image enhancement techniques are used to improve the visual quality of an image by enhancing specific features such as brightness and contrast.
For machine learning model deployment in a production environment, which tool or language is often integrated due to its performance and scalability?
- Python
- R
- Java
- Kubernetes
Java is often integrated into production environments for machine learning model deployment due to its performance and scalability. Java is known for its speed, robustness, and suitability for large-scale applications. It is commonly used to build APIs and services for serving machine learning models in real-time production systems. Python and R are often used in model development, but Java is favored for deployment. Kubernetes is an orchestration tool.
The AUC-ROC curve is a performance measurement for classification problems at various _______ levels.
- Confidence
- Sensitivity
- Specificity
- Threshold
The AUC-ROC curve measures classification performance at various threshold levels. It represents the trade-off between true positive rate (Sensitivity) and false positive rate (1 - Specificity) at different threshold settings. The threshold affects the classification decisions, and the AUC-ROC summarizes this performance.
You are analyzing customer reviews for a product and want to automatically categorize each review as positive, negative, or neutral. Which NLP task would be most relevant for this purpose?
- Named Entity Recognition (NER)
- Text Summarization
- Sentiment Analysis
- Machine Translation
Sentiment Analysis is the NLP task most relevant for categorizing customer reviews as positive, negative, or neutral. It involves assessing the sentiment expressed in the text and assigning it to one of these categories based on the sentiment polarity. NER, Text Summarization, and Machine Translation serve different purposes and are not suitable for sentiment categorization.
Which algorithm is used to split data into subsets while at the same time an associated decision tree is incrementally developed?
- K-Means Clustering
- Random Forest
- AdaBoost
- Gradient Boosting
The algorithm used for this purpose is Random Forest. It's an ensemble learning method that builds multiple decision trees and aggregates their results. As the data is split into subsets, the decision tree is developed incrementally, making it a powerful algorithm.
In MongoDB, the _______ operator can be used to test a regular expression against a string.
- $search
- $match
- $regex
- $find
In MongoDB, the $regex operator is used to test a regular expression against a string. It allows you to perform pattern matching on string fields in your documents. This is useful for querying and filtering data based on specific patterns or text matching requirements.