Letter World: The Reward Machine
Overview
The Reward Machine is the core automaton that defines what constitutes success in a task.Instead of embedding complex logic directly in a reward function, the RM tracks high-level events (e.g. visiting specific letters) and provides rewards when certain sequences occur.
The RM Concept
An RM consists of:- States: Different phases of the task
- Transitions: Rules for moving between states based on events
- Rewards: Values assigned to transitions
A, then B, then C.
Implementation
The Letter World RM is implemented as a subclass of theRewardMachine base class:
Method Explanations
u_0: Returns the initial state (0) where the agent begins the task_get_state_transition_function: Defines state transitions based on observed events:- From state 0, observing “A” moves to state 1
- From state 1, observing “B” moves to state 2
- From state 2, observing “C” moves to terminal state -1
_get_reward_transition_function: Defines rewards for each transition:- Small rewards (0.1) for intermediate steps (A→1, B→2)
- Large reward (1.0) for completing the sequence (C→terminal)
How it Works
- The agent starts in state 0.
- Seeing
Amoves to state 1, with a small reward. - Seeing
Bmoves to state 2, with a small reward. - Seeing
Ccompletes the sequence, reaching the terminal state with a larger reward.
A → B → C.
What if we use a Counting Reward Machine?
Key Differences from the RM
- Counters: CRMs introduce integer variables that track how many times events occur.
- Transition Conditions: CRM transitions depend not only on events, but also on counter values (e.g. “if counter = 0, go to success state”).
- Expressiveness: While RMs capture tasks definable by regular languages, CRMs are Turing-complete and can express much richer tasks.
Example: Balanced Parentheses Task
In Letter World, a CRM could enforce:- The agent must visit
An times, - Then
B, - Then
Cn times, - Only when the number of
As andCs match does the agent receive a reward.
As and Cs — something a plain RM cannot do.
Method Explanations
u_0: Returns the initial state (0) where the agent begins the taskc_0: Returns the initial counter values as a tuple (starts with counter = 0)encoded_configuration_size: Returns the size (2) for encoding state-counter configurations_get_state_transition_function: Defines state transitions with counter conditions:- From state 0: “A” keeps in state 0, “B” moves to state 1
- From state 1: “C” stays in state 1 if counter ≠ 0 (NZ), moves to terminal if counter = 0 (Z)
_get_counter_transition_function: Defines how counters change with events:- From state 0: “A” increments counter (+1), “B” resets counter to 0
- From state 1: “C” decrements counter (-1) if ≠ 0, keeps at 0 if = 0
_get_reward_transition_function: Defines rewards with counter conditions:- Small penalties (-0.1) for most transitions
- Large reward (1.0) only when “C” is observed in state 1 with counter = 0 (balanced task completion)
A and C.
Only when the counter reaches 0 (all As matched by Cs) does the agent receive the reward.
Summary
- Reward Machines (RMs): Great for sequential or regular tasks (e.g.
A → B → C). - Counting Reward Machines (CRMs): Extend RMs with counters, enabling tasks with memory, counting, or balance (e.g. “same number of
As andCs”). - PyCRM supports both: start with RMs for simple tasks, use CRMs when you need additional expressive power.