Quick Start Guide

This guide will help you get up and running with Reward Machines (RMs) and Counting Reward Machines (CRMs) in just a few minutes.

Basic Example

We’ll use the Letter World environment, where an agent must visit letters (specific goal locations)in a specific order.
from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldRewardMachine
from examples.introduction.core.crossproduct import LetterWorldCrossProduct

# 1. Create the ground environment
ground_env = LetterWorld()

# 2. Create the labelling function
lf = LetterWorldLabellingFunction()

# 3. Create the Reward Machine
rm = LetterWorldRewardMachine()

# 4. Create the cross-product MDP
env = LetterWorldCrossProduct(
    ground_env=ground_env,
    machine=rm,
    lf=lf,
    max_steps=100,
)

# Use like a standard Gymnasium environment
obs, _ = env.reset()
action = env.action_space.sample()
next_obs, reward, terminated, truncated, info = env.step(action)

What’s Happening?

  1. Ground Environment (LetterWorld) A simple grid world subclass of gymnasium.Env.
  2. Labelling Function (LetterWorldLabellingFunction) Maps low-level environment transitions to high-level events (propositions).
  3. Reward Machine (RM) (LetterWorldRewardMachine) Specifies rewards based on event sequences.
  4. Cross-Product MDP (LetterWorldCrossProduct) Combines environment, labelling function, and RM into a single Gymnasium-compatible environment.
To model tasks requiring counting or extended memory, swap in a CountingRewardMachine instead of a standard RM. The workflow is identical.

Training a Simple Agent

Here’s a basic tabular Q-learning loop:
import numpy as np
from collections import defaultdict

q_table = defaultdict(lambda: np.zeros(env.action_space.n))

for episode in range(100):
    obs, _ = env.reset()
    done = False
    
    while not done:
        # Epsilon-greedy action selection
        if np.random.rand() < 0.1:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[obs])
            
        next_obs, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Q-learning update
        q_table[obs][action] += 0.1 * (
            reward + 0.99 * np.max(q_table[next_obs]) - q_table[obs][action]
        )
        
        obs = next_obs

Next Steps

Worked Examples

Core Concepts