Quick Start Guide

This guide will help you get up and running with Reward Machines (RMs) and Counting Reward Machines (CRMs) in just a few minutes.

Basic Example

We’ll use the Letter World environment, where an agent must visit letters (specific goal locations)in a specific order.

from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldRewardMachine
from examples.introduction.core.crossproduct import LetterWorldCrossProduct

# 1. Create the ground environment
ground_env = LetterWorld()

# 2. Create the labelling function
lf = LetterWorldLabellingFunction()

# 3. Create the Reward Machine
rm = LetterWorldRewardMachine()

# 4. Create the cross-product MDP
env = LetterWorldCrossProduct(
    ground_env=ground_env,
    machine=rm,
    lf=lf,
    max_steps=100,
)

# Use like a standard Gymnasium environment
obs, _ = env.reset()
action = env.action_space.sample()
next_obs, reward, terminated, truncated, info = env.step(action)

What’s Happening?

Ground Environment (LetterWorld) A simple grid world subclass of gymnasium.Env.
Labelling Function (LetterWorldLabellingFunction) Maps low-level environment transitions to high-level events (propositions).
Reward Machine (RM) (LetterWorldRewardMachine) Specifies rewards based on event sequences.
Cross-Product MDP (LetterWorldCrossProduct) Combines environment, labelling function, and RM into a single Gymnasium-compatible environment.

To model tasks requiring counting or extended memory, swap in a CountingRewardMachine instead of a standard RM. The workflow is identical.

Training a Simple Agent

Here’s a basic tabular Q-learning loop:

import numpy as np
from collections import defaultdict

q_table = defaultdict(lambda: np.zeros(env.action_space.n))

for episode in range(100):
    obs, _ = env.reset()
    done = False
    
    while not done:
        # Epsilon-greedy action selection
        if np.random.rand() < 0.1:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[obs])
            
        next_obs, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Q-learning update
        q_table[obs][action] += 0.1 * (
            reward + 0.99 * np.max(q_table[next_obs]) - q_table[obs][action]
        )
        
        obs = next_obs

Next Steps

Worked Examples

Explore the Letter World Example for a detailed walkthrough.

Get Started

Example

Code Concepts

Quick Start

Quick Start Guide

Basic Example

What’s Happening?

Training a Simple Agent

Next Steps

Worked Examples

Core Concepts

Get Started

Example

Code Concepts

​Quick Start Guide

​Basic Example

​What’s Happening?

​Training a Simple Agent

​Next Steps

​Worked Examples

​Core Concepts

Quick Start Guide

Basic Example

What’s Happening?

Training a Simple Agent

Next Steps

Worked Examples

Core Concepts