Letter World: The Reward Machine
A Reward Machine (RM) defines a task’s reward structure using states and transitions between events.
Unlike traditional reward functions that depend only on the current state, RMs let us specify temporal reward logic in a clear, interpretable way.
Unlike traditional reward functions that depend only on the current state, RMs let us specify temporal reward logic in a clear, interpretable way.
Overview
The Reward Machine is the core automaton that defines what constitutes success in a task.Instead of embedding complex logic directly in a reward function, the RM tracks high-level events (e.g. visiting specific letters) and provides rewards when certain sequences occur.
The RM Concept
An RM consists of:- States: Different phases of the task
- Transitions: Rules for moving between states based on events
- Rewards: Values assigned to transitions
A, then B, then C.
Implementation
The Letter World RM is implemented as a subclass of theRewardMachine base class:
Method Explanations
u_0: Returns the initial state (0) where the agent begins the task_get_state_transition_function: Defines state transitions based on observed events:- From state 0, observing “A” moves to state 1
- From state 1, observing “B” moves to state 2
- From state 2, observing “C” moves to terminal state -1
_get_reward_transition_function: Defines rewards for each transition:- Small rewards (0.1) for intermediate steps (A→1, B→2)
- Large reward (1.0) for completing the sequence (C→terminal)
How it Works
- The agent starts in state 0.
- Seeing
Amoves to state 1, with a small reward. - Seeing
Bmoves to state 2, with a small reward. - Seeing
Ccompletes the sequence, reaching the terminal state with a larger reward.
A → B → C.
What if we use a Counting Reward Machine?
A Counting Reward Machine (CRM) extends an RM by adding counters as memory.
This allows tasks that require counting or balancing, which RMs alone cannot represent.
This allows tasks that require counting or balancing, which RMs alone cannot represent.
Key Differences from the RM
- Counters: CRMs introduce integer variables that track how many times events occur.
- Transition Conditions: CRM transitions depend not only on events, but also on counter values (e.g. “if counter = 0, go to success state”).
- Expressiveness: While RMs capture tasks definable by regular languages, CRMs are Turing-complete and can express much richer tasks.
Example: Balanced Parentheses Task
In Letter World, a CRM could enforce:- The agent must visit
An times, - Then
B, - Then
Cn times, - Only when the number of
As andCs match does the agent receive a reward.
As and Cs — something a plain RM cannot do.
Method Explanations
u_0: Returns the initial state (0) where the agent begins the taskc_0: Returns the initial counter values as a tuple (starts with counter = 0)encoded_configuration_size: Returns the size (2) for encoding state-counter configurations_get_state_transition_function: Defines state transitions with counter conditions:- From state 0: “A” keeps in state 0, “B” moves to state 1
- From state 1: “C” stays in state 1 if counter ≠ 0 (NZ), moves to terminal if counter = 0 (Z)
_get_counter_transition_function: Defines how counters change with events:- From state 0: “A” increments counter (+1), “B” resets counter to 0
- From state 1: “C” decrements counter (-1) if ≠ 0, keeps at 0 if = 0
_get_reward_transition_function: Defines rewards with counter conditions:- Small penalties (-0.1) for most transitions
- Large reward (1.0) only when “C” is observed in state 1 with counter = 0 (balanced task completion)
A and C.
Only when the counter reaches 0 (all As matched by Cs) does the agent receive the reward.
Summary
- Reward Machines (RMs): Great for sequential or regular tasks (e.g.
A → B → C). - Counting Reward Machines (CRMs): Extend RMs with counters, enabling tasks with memory, counting, or balance (e.g. “same number of
As andCs”). - PyCRM supports both: start with RMs for simple tasks, use CRMs when you need additional expressive power.