Quick Start Guide
This guide will help you get up and running with Reward Machines (RMs) and Counting Reward Machines (CRMs) in just a few minutes.
Basic Example
We’ll use the Letter World environment, where an agent must visit letters (specific goal locations)in a specific order.
from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldRewardMachine
from examples.introduction.core.crossproduct import LetterWorldCrossProduct
# 1. Create the ground environment
ground_env = LetterWorld()
# 2. Create the labelling function
lf = LetterWorldLabellingFunction()
# 3. Create the Reward Machine
rm = LetterWorldRewardMachine()
# 4. Create the cross-product MDP
env = LetterWorldCrossProduct(
ground_env=ground_env,
machine=rm,
lf=lf,
max_steps=100,
)
# Use like a standard Gymnasium environment
obs, _ = env.reset()
action = env.action_space.sample()
next_obs, reward, terminated, truncated, info = env.step(action)
What’s Happening?
- Ground Environment (
LetterWorld
)
A simple grid world subclass of gymnasium.Env
.
- Labelling Function (
LetterWorldLabellingFunction
)
Maps low-level environment transitions to high-level events (propositions).
- Reward Machine (RM) (
LetterWorldRewardMachine
)
Specifies rewards based on event sequences.
- Cross-Product MDP (
LetterWorldCrossProduct
)
Combines environment, labelling function, and RM into a single Gymnasium-compatible environment.
To model tasks requiring counting or extended memory, swap in a CountingRewardMachine
instead of a standard RM. The workflow is identical.
Training a Simple Agent
Here’s a basic tabular Q-learning loop:
import numpy as np
from collections import defaultdict
q_table = defaultdict(lambda: np.zeros(env.action_space.n))
for episode in range(100):
obs, _ = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
if np.random.rand() < 0.1:
action = env.action_space.sample()
else:
action = np.argmax(q_table[obs])
next_obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Q-learning update
q_table[obs][action] += 0.1 * (
reward + 0.99 * np.max(q_table[next_obs]) - q_table[obs][action]
)
obs = next_obs
Next Steps
Worked Examples
Core Concepts