Letter World Cross-Product Example

This example demonstrates how the cross-product MDP behaves like a standard Gymnasium environment while adding the power of RM/CRMs.

The Letter World Environment

The Letter World is a simple grid environment where an agent navigates to find specific letters:

Letter ‘A’ has a 50% chance of turning into letter ‘B’ when visited
Letter ‘C’ gives a reward when visited after seeing letter ‘B’
The agent must learn to visit ‘A’, hope it turns into ‘B’, and then visit ‘C’

Here’s what the environment looks like:

+---------------+
|. . . . . . . .|
|. A . x . . C .|
|. . . . . . . .|
+---------------+

Where:

A represents letter ‘A’ (or ‘B’ after transformation)
C represents letter ‘C’
x represents the agent

Components

To create our cross-product environment, we need several components:

Ground Environment: The basic grid world (LetterWorld)
Labelling Function: Maps transitions to symbols (LetterWorldLabellingFunction)
Counting Reward Machine: Defines rewards based on symbol history (LetterWorldCountingRewardMachine)
Cross-Product: Combines all the above (LetterWorldCrossProduct)

Setting Up the Environment

First, let’s import the necessary components and create our environment:

from examples.introduction.core.crossproduct import LetterWorldCrossProduct
from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldCountingRewardMachine

# Create the ground environment
ground_env = LetterWorld()

# Create labelling function and counting reward machine
lf = LetterWorldLabellingFunction()
crm = LetterWorldCountingRewardMachine()

# Create the cross-product environment
cross_product = LetterWorldCrossProduct(
    ground_env=ground_env,
    crm=crm,
    lf=lf,
    max_steps=100,  # Set maximum number of steps
)

Using the Environment Like a Standard Gym Environment

The cross-product environment works just like any other Gymnasium environment:

# Reset the environment
obs, info = cross_product.reset(seed=42)
print(f"Initial observation: {obs}")

# Sample a random action
action = cross_product.action_space.sample()

# Take a step in the environment
next_obs, reward, terminated, truncated, info = cross_product.step(action)
print(f"Action: {action}")
print(f"Observation: {next_obs}")
print(f"Reward: {reward}")

Output:

Initial observation: [0 1 3 0 0]
Action: 3
Observation: [0 2 3 0 0]
Reward: -0.1

The observation is structured as:

First part: Ground observation (symbol_seen, agent_row, agent_col)
Last part: Machine configuration (in this example, this is the state and counter values)

The machine configuration is defined by the _get_obs method of the CrossProduct interface. As a result, users can define the cross-product state representation as they wish.

Running an Episode

Let’s run a full episode with the cross-product environment:

# Reset and run an episode
obs, info = cross_product.reset(seed=0)
total_reward = 0
step_count = 0

print("Initial environment state:")
ground_env.render()

# Run for several steps
for _ in range(10):
    action = cross_product.action_space.sample()
    next_obs, reward, terminated, truncated, info = cross_product.step(action)
    
    total_reward += reward
    step_count += 1
    
    print(f"\nStep {step_count}:")
    print(f"  Action: {action}")
    print(f"  Observation: {next_obs}")
    print(f"  Reward: {reward}")
    
    # Render the environment
    ground_env.render()
    
    if terminated or truncated:
        print(f"Episode ended after {step_count} steps")
        break

Sample output:

Initial environment state:
+-------------+
|. . . . . . .|
|A . x . . C .|
|. . . . . . .|
+-------------+

Step 1:
  Action: 0
  Observation: [0 1 4 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A . . x . C .|
|. . . . . . .|
+-------------+

Step 2:
  Action: 3
  Observation: [0 2 4 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A . . . . C .|
|. . . x . . .|
+-------------+

Using a Specific Action Sequence

You can also execute a specific sequence of actions:

# Reset the environment
obs, info = cross_product.reset(seed=0)

# Define a specific action sequence to test
# (0=RIGHT, 1=LEFT, 2=UP, 3=DOWN)
actions = [1, 1, 1, 2]  # Move to letter A

for i, action in enumerate(actions):
    next_obs, reward, terminated, truncated, info = cross_product.step(action)
    print(f"\nStep {i+1} with action {action}:")
    print(f"  Observation: {next_obs}")
    print(f"  Reward: {reward}")
    
    ground_env.render()

Sample output for this sequence:

Step 1 with action 1:
  Observation: [0 1 2 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A x . . . C .|
|. . . . . . .|
+-------------+

Step 2 with action 1:
  Observation: [0 1 1 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|x A . . . C .|
|. . . . . . .|
+-------------+

Step 3 with action 1:
  Observation: [0 1 0 0 0]
  Reward: -0.1

+-------------+
|x . . . . . .|
|A . . . . C .|
|. . . . . . .|
+-------------+

Step 4 with action 2:
  Observation: [0 0 0 0 0]
  Reward: -0.1

+-------------+
|x . . . . . .|
|A . . . . C .|
|. . . . . . .|
+-------------+

What Makes It Special?

The cross-product environment extends a standard Gym environment with:

Symbol Tracking: It tracks which symbols have been seen
Counter Values: It maintains counters as defined by the CRM if one is being used
State Memory: The reward can depend on the history of previously seen symbols
Reward Shaping: Complex reward signals based on achieving specific goals

Conclusion

The cross-product environment combines the simplicity of standard Gym environments with the power of RM/CRMs. This allows you to:

Use it with any RL algorithm designed for Gymnasium environments
Define complex reward structures based on symbol history
Track progress toward multi-step goals
Shape rewards to guide exploration and learning
Benefit from the sample efficiency of counterfactual experiences

This example demonstrates that using RM/CRMs doesn’t require changing your existing RL algorithms, it just gives you more expressive power in defining rewards!

Next Steps

Learn about Q-Learning with RM/CRMs
Explore Counterfactual Q-Learning for more efficient learning

Get Started

Example

Code Concepts

4 - Cross-Product

Letter World Cross-Product Example

The Letter World Environment

Components

Setting Up the Environment

Using the Environment Like a Standard Gym Environment

Running an Episode

Using a Specific Action Sequence

What Makes It Special?

Conclusion

Next Steps

Get Started

Example

Code Concepts

​Letter World Cross-Product Example

​The Letter World Environment

​Components

​Setting Up the Environment

​Using the Environment Like a Standard Gym Environment

​Running an Episode

​Using a Specific Action Sequence

​What Makes It Special?

​Conclusion

​Next Steps

Letter World Cross-Product Example

The Letter World Environment

Components

Setting Up the Environment

Using the Environment Like a Standard Gym Environment

Running an Episode

Using a Specific Action Sequence

What Makes It Special?

Conclusion

Next Steps