Q-Learning in Letter World
This guide demonstrates how to train an agent using Q-learning with RM/CRMs in the Letter World environment.Environment Overview
The Letter World environment consists of a 3×7 grid where:- The agent starts at position (1,3)
- Letter ‘A’ appears at position (1,1)
- Letter ‘C’ appears at position (1,5)
- When the agent visits ‘A’, it may randomly change to ‘B’ with 50% probability
- The goal is to first visit ‘A’ (incrementing a counter), then visit ‘C’ to receive a positive reward
Setting Up the Environment
First, let’s create the environment components:Q-Learning Implementation
The Q-learning algorithm maintains a table of state-action values and updates them based on the rewards received:Training Progress Visualization
After training, we can visualize the agent’s learning progress:
Demonstrating the Learned Policy
After training, we can observe how the agent behaves with its learned policy:Sample Output
When running the demonstration, you’ll see output similar to this:Understanding the Algorithm
The Q-learning algorithm:- Explores vs. Exploits: Uses an epsilon-greedy strategy to balance exploration (random actions) and exploitation (best known actions)
- Updates Q-values: After each action, updates the Q-value based on the reward received and the maximum future Q-value
- Learning Rate: Controls how quickly the agent updates its Q-values with new information
- Discount Factor: Determines the importance of future rewards compared to immediate rewards
Performance Analysis
In the Letter World environment, the agent needs to learn that:- Visiting position ‘A’ is necessary to increment a counter
- After visiting ‘A’, the agent must navigate to position ‘C’ to receive a positive reward
- Navigation should be efficient to minimize negative step penalties
- Initial performance is poor as the agent explores randomly
- Performance improves rapidly as it discovers the optimal sequence
- Fluctuations occur as the agent balances exploration and exploitation
- Eventually, the agent converges toward an optimal policy
Conclusion
This example demonstrates how Q-learning can be used with RM/CRMs to train an agent to follow specific sequential tasks. The Letter World environment illustrates how CRMs can effectively model tasks that require remembering past events. For more complex environments, you might need to adjust hyperparameters or use more sophisticated reinforcement learning algorithms, but the same CRM framework can be applied.Next Steps
- Explore Counterfactual Q-Learning for more efficient learning