A common measure of performance in the multi-armed bandit problem is the cumulative ________ over time.
- Rewards
- Q-values
- States
- Actions
The cumulative rewards over time are a common measure of performance in the multi-armed bandit problem, as you aim to maximize total reward.
Loading...
Related Quiz
- In the context of Q-learning, what does the 'Q' stand for?
- SVMs aim to maximize the margin, which is the distance between the decision boundary and the nearest ______ from any class.
- Which layer in a CNN is responsible for reducing the spatial dimensions of the input data?
- When a machine learning algorithm tries to group...
- In a neural network, what are the nodes that receive input data and pass it forward called?