Letter World Cross-Product Example
This example demonstrates how the cross-product MDP behaves like a standard Gymnasium environment while adding the power of RM/CRMs.
The Letter World Environment
The Letter World is a simple grid environment where an agent navigates to find specific letters:
- Letter ‘A’ has a 50% chance of turning into letter ‘B’ when visited
- Letter ‘C’ gives a reward when visited after seeing letter ‘B’
- The agent must learn to visit ‘A’, hope it turns into ‘B’, and then visit ‘C’
Here’s what the environment looks like:
+---------------+
|. . . . . . . .|
|. A . x . . C .|
|. . . . . . . .|
+---------------+
Where:
A represents letter ‘A’ (or ‘B’ after transformation)
C represents letter ‘C’
x represents the agent
Components
To create our cross-product environment, we need several components:
- Ground Environment: The basic grid world (
LetterWorld)
- Labelling Function: Maps transitions to symbols (
LetterWorldLabellingFunction)
- Counting Reward Machine: Defines rewards based on symbol history (
LetterWorldCountingRewardMachine)
- Cross-Product: Combines all the above (
LetterWorldCrossProduct)
Setting Up the Environment
First, let’s import the necessary components and create our environment:
from examples.introduction.core.crossproduct import LetterWorldCrossProduct
from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldCountingRewardMachine
# Create the ground environment
ground_env = LetterWorld()
# Create labelling function and counting reward machine
lf = LetterWorldLabellingFunction()
crm = LetterWorldCountingRewardMachine()
# Create the cross-product environment
cross_product = LetterWorldCrossProduct(
ground_env=ground_env,
crm=crm,
lf=lf,
max_steps=100, # Set maximum number of steps
)
Using the Environment Like a Standard Gym Environment
The cross-product environment works just like any other Gymnasium environment:
# Reset the environment
obs, info = cross_product.reset(seed=42)
print(f"Initial observation: {obs}")
# Sample a random action
action = cross_product.action_space.sample()
# Take a step in the environment
next_obs, reward, terminated, truncated, info = cross_product.step(action)
print(f"Action: {action}")
print(f"Observation: {next_obs}")
print(f"Reward: {reward}")
Output:
Initial observation: [0 1 3 0 0]
Action: 3
Observation: [0 2 3 0 0]
Reward: -0.1
The observation is structured as:
- First part: Ground observation (symbol_seen, agent_row, agent_col)
- Last part: Machine configuration (in this example, this is the state and counter values)
The machine configuration is defined by the _get_obs method of the CrossProduct interface. As a result, users can define the cross-product state representation as they wish.
Running an Episode
Let’s run a full episode with the cross-product environment:
# Reset and run an episode
obs, info = cross_product.reset(seed=0)
total_reward = 0
step_count = 0
print("Initial environment state:")
ground_env.render()
# Run for several steps
for _ in range(10):
action = cross_product.action_space.sample()
next_obs, reward, terminated, truncated, info = cross_product.step(action)
total_reward += reward
step_count += 1
print(f"\nStep {step_count}:")
print(f" Action: {action}")
print(f" Observation: {next_obs}")
print(f" Reward: {reward}")
# Render the environment
ground_env.render()
if terminated or truncated:
print(f"Episode ended after {step_count} steps")
break
Sample output:
Initial environment state:
+-------------+
|. . . . . . .|
|A . x . . C .|
|. . . . . . .|
+-------------+
Step 1:
Action: 0
Observation: [0 1 4 0 0]
Reward: -0.1
+-------------+
|. . . . . . .|
|A . . x . C .|
|. . . . . . .|
+-------------+
Step 2:
Action: 3
Observation: [0 2 4 0 0]
Reward: -0.1
+-------------+
|. . . . . . .|
|A . . . . C .|
|. . . x . . .|
+-------------+
Using a Specific Action Sequence
You can also execute a specific sequence of actions:
# Reset the environment
obs, info = cross_product.reset(seed=0)
# Define a specific action sequence to test
# (0=RIGHT, 1=LEFT, 2=UP, 3=DOWN)
actions = [1, 1, 1, 2] # Move to letter A
for i, action in enumerate(actions):
next_obs, reward, terminated, truncated, info = cross_product.step(action)
print(f"\nStep {i+1} with action {action}:")
print(f" Observation: {next_obs}")
print(f" Reward: {reward}")
ground_env.render()
Sample output for this sequence:
Step 1 with action 1:
Observation: [0 1 2 0 0]
Reward: -0.1
+-------------+
|. . . . . . .|
|A x . . . C .|
|. . . . . . .|
+-------------+
Step 2 with action 1:
Observation: [0 1 1 0 0]
Reward: -0.1
+-------------+
|. . . . . . .|
|x A . . . C .|
|. . . . . . .|
+-------------+
Step 3 with action 1:
Observation: [0 1 0 0 0]
Reward: -0.1
+-------------+
|x . . . . . .|
|A . . . . C .|
|. . . . . . .|
+-------------+
Step 4 with action 2:
Observation: [0 0 0 0 0]
Reward: -0.1
+-------------+
|x . . . . . .|
|A . . . . C .|
|. . . . . . .|
+-------------+
What Makes It Special?
The cross-product environment extends a standard Gym environment with:
- Symbol Tracking: It tracks which symbols have been seen
- Counter Values: It maintains counters as defined by the CRM if one is being used
- State Memory: The reward can depend on the history of previously seen symbols
- Reward Shaping: Complex reward signals based on achieving specific goals
Conclusion
The cross-product environment combines the simplicity of standard Gym environments with the power of RM/CRMs. This allows you to:
- Use it with any RL algorithm designed for Gymnasium environments
- Define complex reward structures based on symbol history
- Track progress toward multi-step goals
- Shape rewards to guide exploration and learning
- Benefit from the sample efficiency of counterfactual experiences
This example demonstrates that using RM/CRMs doesn’t require changing your existing RL algorithms, it just gives you more expressive power in defining rewards!
Next Steps