Letter World Cross-Product Example

This example demonstrates how the cross-product MDP behaves like a standard Gymnasium environment while adding the power of RM/CRMs.

The Letter World Environment

The Letter World is a simple grid environment where an agent navigates to find specific letters:
  • Letter ‘A’ has a 50% chance of turning into letter ‘B’ when visited
  • Letter ‘C’ gives a reward when visited after seeing letter ‘B’
  • The agent must learn to visit ‘A’, hope it turns into ‘B’, and then visit ‘C’
Here’s what the environment looks like:
+---------------+
|. . . . . . . .|
|. A . x . . C .|
|. . . . . . . .|
+---------------+
Where:
  • A represents letter ‘A’ (or ‘B’ after transformation)
  • C represents letter ‘C’
  • x represents the agent

Components

To create our cross-product environment, we need several components:
  1. Ground Environment: The basic grid world (LetterWorld)
  2. Labelling Function: Maps transitions to symbols (LetterWorldLabellingFunction)
  3. Counting Reward Machine: Defines rewards based on symbol history (LetterWorldCountingRewardMachine)
  4. Cross-Product: Combines all the above (LetterWorldCrossProduct)

Setting Up the Environment

First, let’s import the necessary components and create our environment:
from examples.introduction.core.crossproduct import LetterWorldCrossProduct
from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldCountingRewardMachine

# Create the ground environment
ground_env = LetterWorld()

# Create labelling function and counting reward machine
lf = LetterWorldLabellingFunction()
crm = LetterWorldCountingRewardMachine()

# Create the cross-product environment
cross_product = LetterWorldCrossProduct(
    ground_env=ground_env,
    crm=crm,
    lf=lf,
    max_steps=100,  # Set maximum number of steps
)

Using the Environment Like a Standard Gym Environment

The cross-product environment works just like any other Gymnasium environment:
# Reset the environment
obs, info = cross_product.reset(seed=42)
print(f"Initial observation: {obs}")

# Sample a random action
action = cross_product.action_space.sample()

# Take a step in the environment
next_obs, reward, terminated, truncated, info = cross_product.step(action)
print(f"Action: {action}")
print(f"Observation: {next_obs}")
print(f"Reward: {reward}")
Output:
Initial observation: [0 1 3 0 0]
Action: 3
Observation: [0 2 3 0 0]
Reward: -0.1
The observation is structured as:
  • First part: Ground observation (symbol_seen, agent_row, agent_col)
  • Last part: Machine configuration (in this example, this is the state and counter values)
The machine configuration is defined by the _get_obs method of the CrossProduct interface. As a result, users can define the cross-product state representation as they wish.

Running an Episode

Let’s run a full episode with the cross-product environment:
# Reset and run an episode
obs, info = cross_product.reset(seed=0)
total_reward = 0
step_count = 0

print("Initial environment state:")
ground_env.render()

# Run for several steps
for _ in range(10):
    action = cross_product.action_space.sample()
    next_obs, reward, terminated, truncated, info = cross_product.step(action)
    
    total_reward += reward
    step_count += 1
    
    print(f"\nStep {step_count}:")
    print(f"  Action: {action}")
    print(f"  Observation: {next_obs}")
    print(f"  Reward: {reward}")
    
    # Render the environment
    ground_env.render()
    
    if terminated or truncated:
        print(f"Episode ended after {step_count} steps")
        break
Sample output:
Initial environment state:
+-------------+
|. . . . . . .|
|A . x . . C .|
|. . . . . . .|
+-------------+

Step 1:
  Action: 0
  Observation: [0 1 4 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A . . x . C .|
|. . . . . . .|
+-------------+

Step 2:
  Action: 3
  Observation: [0 2 4 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A . . . . C .|
|. . . x . . .|
+-------------+

Using a Specific Action Sequence

You can also execute a specific sequence of actions:
# Reset the environment
obs, info = cross_product.reset(seed=0)

# Define a specific action sequence to test
# (0=RIGHT, 1=LEFT, 2=UP, 3=DOWN)
actions = [1, 1, 1, 2]  # Move to letter A

for i, action in enumerate(actions):
    next_obs, reward, terminated, truncated, info = cross_product.step(action)
    print(f"\nStep {i+1} with action {action}:")
    print(f"  Observation: {next_obs}")
    print(f"  Reward: {reward}")
    
    ground_env.render()
Sample output for this sequence:
Step 1 with action 1:
  Observation: [0 1 2 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A x . . . C .|
|. . . . . . .|
+-------------+

Step 2 with action 1:
  Observation: [0 1 1 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|x A . . . C .|
|. . . . . . .|
+-------------+

Step 3 with action 1:
  Observation: [0 1 0 0 0]
  Reward: -0.1

+-------------+
|x . . . . . .|
|A . . . . C .|
|. . . . . . .|
+-------------+

Step 4 with action 2:
  Observation: [0 0 0 0 0]
  Reward: -0.1

+-------------+
|x . . . . . .|
|A . . . . C .|
|. . . . . . .|
+-------------+

What Makes It Special?

The cross-product environment extends a standard Gym environment with:
  1. Symbol Tracking: It tracks which symbols have been seen
  2. Counter Values: It maintains counters as defined by the CRM if one is being used
  3. State Memory: The reward can depend on the history of previously seen symbols
  4. Reward Shaping: Complex reward signals based on achieving specific goals

Conclusion

The cross-product environment combines the simplicity of standard Gym environments with the power of RM/CRMs. This allows you to:
  1. Use it with any RL algorithm designed for Gymnasium environments
  2. Define complex reward structures based on symbol history
  3. Track progress toward multi-step goals
  4. Shape rewards to guide exploration and learning
  5. Benefit from the sample efficiency of counterfactual experiences
This example demonstrates that using RM/CRMs doesn’t require changing your existing RL algorithms, it just gives you more expressive power in defining rewards!

Next Steps