What is the process of transforming raw data into a format that makes it suitable for modeling called?
- Data Visualization
- Data Collection
- Data Preprocessing
- Data Analysis
Data Preprocessing is the process of cleaning, transforming, and organizing raw data to prepare it for modeling. It includes tasks such as handling missing values, feature scaling, and encoding categorical variables. This step is crucial in Data Science to ensure the quality of data used for analysis and modeling.
In Data Science, when dealing with large datasets that do not fit into memory, the Python library _______ can be a useful tool for efficient computations.
- NumPy
- Pandas
- Dask
- SciPy
When working with large datasets that do not fit into memory, the Python library "Dask" is a useful tool for efficient computations. Dask provides parallel and distributed computing capabilities, enabling data scientists to handle larger-than-memory datasets using familiar Python tools.
Which layer type in a neural network is primarily responsible for feature extraction and spatial hierarchy?
- Input Layer
- Convolutional Layer
- Fully Connected Layer
- Recurrent Layer
Convolutional Layers in neural networks are responsible for feature extraction and learning spatial hierarchies, making them crucial in tasks such as image recognition. They apply filters to the input data, capturing different features.
In time-series data, creating lag features involves using previous time steps as new _______.
- Predictors
- Observations
- Predictions
- Variables
In time-series analysis, creating lag features means using previous time steps (observations) as new data points. This allows you to incorporate historical information into your model, which can be valuable for forecasting future values in time series data.
Which CNN architecture is known for its residual connections and improved training performance?
- LeNet
- VGGNet
- AlexNet
- ResNet
Residual Networks (ResNets) are known for their residual connections, which allow for easier training of very deep networks. ResNets have become a standard in deep learning due to their ability to mitigate the vanishing gradient problem, enabling the training of much deeper architectures.
In the context of outlier detection, what is the commonly used plot to visually detect outliers in a single variable?
- Box Plot
- Scatter Plot
- Histogram
- Line Chart
A Box Plot is a commonly used visualization for detecting outliers in a single variable. It displays the distribution of data and identifies potential outliers based on the interquartile range (IQR). Data points outside the whiskers of the box plot are often considered outliers. Box plots are useful for identifying data anomalies.
Which step in the Data Science Life Cycle is concerned with cleaning the data and handling missing values?
- Data Exploration
- Data Collection
- Data Preprocessing
- Data Visualization
Data Preprocessing is the step in the Data Science Life Cycle that involves cleaning the data, handling missing values, and preparing it for analysis. This step is crucial for ensuring the quality and reliability of the data used in subsequent analysis.
What is the most common measure of central tendency, which calculates the average value of a dataset?
- Median
- Mode
- Mean
- Standard Deviation
The mean, also known as the average, is a common measure of central tendency. It's calculated by adding up all the values in the dataset and then dividing by the number of data points. The mean provides a sense of the "typical" value in the dataset.
In the context of binary classification, which metric calculates the ratio of true positives to the sum of true positives and false negatives?
- Precision-Recall Curve
- F1 Score
- True Positive Rate (Sensitivity)
- Specificity
The True Positive Rate, also known as Sensitivity or Recall, calculates the ratio of true positives to the sum of true positives and false negatives. It measures the model's ability to correctly identify positive cases. It is an important metric in binary classification evaluation.
Which method for handling missing data involves using algorithms like k-NN to find similar records to impute the missing value?
- Mean imputation
- Median imputation
- k-NN imputation
- Mode imputation
k-NN imputation is a technique that uses the similarity of data points to impute missing values. It finds records with similar characteristics to the one with missing data and replaces the missing value with the imputed value from its nearest neighbors. Other options are simpler imputation methods.