When classifying text data, the ________ method can be used to convert text into numerical format for analysis.

  • Bag-of-Words
  • Clustering
  • Normalization
  • Principal Component Analysis
The Bag-of-Words (BoW) method represents text as a numerical vector where each element corresponds to the frequency or presence of a word in the document. It is commonly used in text classification tasks.

How does the curse of dimensionality impact the K-Nearest Neighbors algorithm, and what are some ways to address this issue?

  • Enhances speed, addressed by increasing data size
  • Improves accuracy, addressed by adding more dimensions
  • Makes distance measures less meaningful, addressed by dimension reduction
  • Reduces accuracy, addressed by increasing K
The curse of dimensionality can make distance measures less meaningful in KNN, and this issue can be addressed through dimensionality reduction techniques like PCA.

In a situation where the assumption of linearity in Simple Linear Regression is violated, how would you proceed?

  • Continue Without Changes
  • Increase Sample Size
  • Remove Outliers
  • Use a Nonlinear Transformation
If linearity is violated, applying a nonlinear transformation to the independent or dependent variable could help in capturing the underlying relationship.

How does the use of the Gini Index compare to entropy in terms of computational efficiency in building a Decision Tree?

  • Both are equally efficient
  • Entropy is more computationally efficient
  • Gini Index is more computationally efficient
  • Neither is efficient
Gini Index is more computationally efficient because it does not involve calculating logarithms like entropy does. Although they often produce similar results, the Gini Index is generally preferred when computational resources are limited.

Which type of regression would be suitable for predicting a continuous output?

  • Cluster Regression
  • K-Nearest Neighbors
  • Linear Regression
  • Logistic Regression
Linear Regression is suitable for predicting a continuous output, as it models the relationship between dependent and independent variables through a linear equation.

In Logistic Regression, what function is used to model the probability of the dependent variable?

  • Exponential function
  • Linear function
  • Polynomial function
  • Sigmoid function
Logistic Regression uses the Sigmoid function to model the probability of the dependent variable. It maps any input into a value between 0 and 1, which is ideal for binary classification.

What are the criteria for a point to be considered a core point in DBSCAN?

  • Being isolated from other clusters
  • Being the central point of a cluster
  • Being within Epsilon of at least MinPts other points
  • Having the minimum distance to all other points in a cluster
A point is considered a core point in DBSCAN if it has at least MinPts other points within its Epsilon neighborhood radius. This means it's part of a dense region and is central to the formation of a cluster, connecting other core or border points.

In a case where sparsity is important and you have highly correlated variables, which regularization technique might be most appropriate?

  • ElasticNet
  • Lasso
  • Ridge
  • nan
ElasticNet combines the properties of Ridge and Lasso, making it suitable for handling both sparsity and multicollinearity in the dataset.

You want to apply clustering to reduce the dimensionality of a dataset, but you also need to interpret the clusters easily. What approaches would you consider?

  • All of the Above
  • Hierarchical Clustering
  • K-Means
  • PCA with Clustering
Applying PCA (Principal Component Analysis) with clustering helps in reducing dimensionality while keeping the clusters interpretable, as PCA provides clear directions for the main sources of variance in the data.

How does Random Forest differ from a single decision tree?

  • Random Forest always performs worse
  • Random Forest focuses on one feature
  • Random Forest uses multiple trees and averages their predictions
  • Random Forest uses only one tree
Random Forest is an ensemble method that builds multiple decision trees and averages their predictions. Unlike a single decision tree, it typically offers higher accuracy and robustness by reducing overfitting through the combination of multiple trees' predictions.