________ is a NoSQL database that is designed for horizontal scalability and distributed architecture.

  • Cassandra
  • Couchbase
  • MongoDB
  • Redis
Cassandra is a NoSQL database designed for horizontal scalability and distributed architecture. It is suitable for handling large amounts of data across multiple commodity servers. MongoDB, Redis, and Couchbase are also NoSQL databases but may have different design considerations.

Which data type would be most appropriate for storing a person's phone number in a database?

  • Boolean
  • Float
  • Integer
  • String
Storing a person's phone number typically requires a text data type like String since phone numbers can include special characters like dashes and parentheses. Integer is used for whole numbers, Float for decimal numbers, and Boolean for true/false values.

The dplyr function _______ is used for filtering rows based on a condition.

  • filter
  • mutate
  • select
  • summarize
In the dplyr package, the filter function is used for filtering rows based on a specified condition. It allows you to subset the data based on logical conditions, making it a powerful tool for data manipulation.

In Git, the command to view the history of changes along with their details is 'git _______'.

  • details
  • history
  • log
  • show
The correct command is 'git log.' This command displays a detailed history of changes, including commit messages, authors, dates, and other relevant information. The 'git history' and 'git show' commands do not provide the same level of detailed information.

What advanced technique is often used for automated feature selection in large datasets?

  • K-Means Clustering
  • Mean Imputation
  • One-Hot Encoding
  • Recursive Feature Elimination (RFE)
Recursive Feature Elimination (RFE) is an advanced technique used for automated feature selection in large datasets. It recursively removes less important features based on their contribution to model performance.

What is the purpose of YARN in the Hadoop ecosystem?

  • YARN (Yet Another Resource Negotiator) manages resources and schedules tasks for applications in the Hadoop ecosystem.
  • YARN is a storage layer in Hadoop for storing large datasets.
  • YARN is an alternative to HDFS for data storage.
  • YARN is primarily used for data visualization in Hadoop.
YARN is a crucial component in the Hadoop ecosystem that separates resource management from job scheduling, enabling more efficient cluster utilization by various applications. It stands for Yet Another Resource Negotiator.

Which algorithm is typically used for sorting smaller lists due to its simplicity and ease of understanding?

  • Binary Search
  • Bubble Sort
  • Merge Sort
  • Quick Sort
Bubble Sort is often used for sorting smaller lists due to its simplicity and ease of understanding. However, it may not be the most efficient for larger datasets. Quick Sort, Merge Sort, and Binary Search are more suitable for larger datasets and offer better performance.

Which Big Data technology is best suited for distributed storage and processing of large data sets across clusters of computers?

  • Apache Flink
  • Apache Hive
  • Apache Kafka
  • Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is specifically designed for distributed storage and processing of large datasets across clusters of computers, making it well-suited for Big Data applications.

In a situation where data consistency is critical, what feature of a DBMS should be prioritized?

  • ACID Compliance
  • Indexing
  • Query Performance
  • Sharding
Data consistency is ensured by ACID (Atomicity, Consistency, Isolation, Durability) compliance. ACID compliance guarantees that database transactions are processed reliably and consistently, which is crucial in scenarios where data consistency is a top priority.

In SQL, how do you select all columns from a table named 'Customers'?

  • SELECT * FROM Customers
  • SELECT ALL FROM Customers
  • SELECT COLUMNS FROM Customers
  • SELECT DATA FROM Customers
To select all columns from a table named 'Customers' in SQL, you use the syntax: SELECT * FROM Customers. The asterisk (*) is a wildcard character that represents all columns.

n regression analysis, the _______ measures the strength and direction of a linear relationship between two variables.

  • Correlation Coefficient
  • Intercept
  • R-squared
  • Slope
In regression analysis, the correlation coefficient measures the strength and direction of a linear relationship between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

How does a percentile differ from a quartile in statistical terms?

  • A percentile divides the data set into 100 equal parts, while a quartile divides it into four parts
  • A percentile is the middle value of the data set, while a quartile is the average of the first and third quartiles
  • A percentile is the range between the maximum and minimum values, while a quartile is the range between the first and third quartiles
  • A percentile represents the median of the data set, while a quartile represents the mean
Percentiles divide the data set into 100 equal parts, while quartiles divide it into four parts. Percentiles are more granular, providing a more detailed view of data distribution.