Letter World Cross-Product Example
This example demonstrates how the cross-product MDP behaves like a standard Gymnasium environment while adding the power of RM/CRMs.The Letter World Environment
The Letter World is a simple grid environment where an agent navigates to find specific letters:- Letter ‘A’ has a 50% chance of turning into letter ‘B’ when visited
- Letter ‘C’ gives a reward when visited after seeing letter ‘B’
- The agent must learn to visit ‘A’, hope it turns into ‘B’, and then visit ‘C’
Arepresents letter ‘A’ (or ‘B’ after transformation)Crepresents letter ‘C’xrepresents the agent
Components
To create our cross-product environment, we need several components:- Ground Environment: The basic grid world (
LetterWorld) - Labelling Function: Maps transitions to symbols (
LetterWorldLabellingFunction) - Counting Reward Machine: Defines rewards based on symbol history (
LetterWorldCountingRewardMachine) - Cross-Product: Combines all the above (
LetterWorldCrossProduct)
Setting Up the Environment
First, let’s import the necessary components and create our environment:Using the Environment Like a Standard Gym Environment
The cross-product environment works just like any other Gymnasium environment:- First part: Ground observation (symbol_seen, agent_row, agent_col)
- Last part: Machine configuration (in this example, this is the state and counter values)
The
CrossProduct class provides a default _get_obs that concatenates the ground observation with a one-hot encoded machine state and raw counter values. This Letter World example overrides it with a custom tabular encoding. You can override _get_obs to define the cross-product state representation as you wish.Running an Episode
Let’s run a full episode with the cross-product environment:Using a Specific Action Sequence
You can also execute a specific sequence of actions:What Makes It Special?
The cross-product environment extends a standard Gym environment with:- Symbol Tracking: It tracks which symbols have been seen
- Counter Values: It maintains counters as defined by the CRM if one is being used
- State Memory: The reward can depend on the history of previously seen symbols
- Reward Shaping: Complex reward signals based on achieving specific goals
Conclusion
The cross-product environment combines the simplicity of standard Gym environments with the power of RM/CRMs. This allows you to:- Use it with any RL algorithm designed for Gymnasium environments
- Define complex reward structures based on symbol history
- Track progress toward multi-step goals
- Shape rewards to guide exploration and learning
- Benefit from the sample efficiency of counterfactual experiences
Next Steps
- Learn about Q-Learning with RM/CRMs
- Explore Counterfactual Q-Learning for more efficient learning