Letter World: The Reward Machine

Overview

The Reward Machine is the core automaton that defines what constitutes success in a task.
Instead of embedding complex logic directly in a reward function, the RM tracks high-level events (e.g. visiting specific letters) and provides rewards when certain sequences occur.

The RM Concept

An RM consists of:

States: Different phases of the task
Transitions: Rules for moving between states based on events
Rewards: Values assigned to transitions

For the Letter World environment, we define an RM that rewards the agent for visiting letters in a specific order: first A, then B, then C.

Implementation

The Letter World RM is implemented as a subclass of the RewardMachine base class:

from pycrm.automaton import RewardMachine
from examples.introduction.core.label import Symbol


class LetterWorldRewardMachine(RewardMachine):
    """Reward machine for the Letter World environment."""

    @property
    def u_0(self) -> int:
        """Return the initial state."""
        return 0

    def _get_state_transition_function(self) -> dict:
        return {
            0: {"A": 1},
            1: {"B": 2},
            2: {"C": -1},  # terminal
        }

    def _get_reward_transition_function(self) -> dict:
        return {
            0: {"A": 0.1},
            1: {"B": 0.1},
            2: {"C": 1.0},  # success reward
        }

Method Explanations

u_0: Returns the initial state (0) where the agent begins the task
_get_state_transition_function: Defines state transitions based on observed events:
- From state 0, observing “A” moves to state 1
- From state 1, observing “B” moves to state 2
- From state 2, observing “C” moves to terminal state -1
_get_reward_transition_function: Defines rewards for each transition:
- Small rewards (0.1) for intermediate steps (A→1, B→2)
- Large reward (1.0) for completing the sequence (C→terminal)

How it Works

The agent starts in state 0.
Seeing A moves to state 1, with a small reward.
Seeing B moves to state 2, with a small reward.
Seeing C completes the sequence, reaching the terminal state with a larger reward.

This RM defines a sequential task: A → B → C.

What if we use a Counting Reward Machine?

Key Differences from the RM

Counters: CRMs introduce integer variables that track how many times events occur.
Transition Conditions: CRM transitions depend not only on events, but also on counter values (e.g. “if counter = 0, go to success state”).
Expressiveness: While RMs capture tasks definable by regular languages, CRMs are Turing-complete and can express much richer tasks.

Example: Balanced Parentheses Task

In Letter World, a CRM could enforce:

The agent must visit A n times,
Then B,
Then C n times,
Only when the number of As and Cs match does the agent receive a reward.

This requires a counter to track the difference between As and Cs — something a plain RM cannot do.

from pycrm.automaton import CountingRewardMachine
from examples.introduction.core.label import Symbol


class LetterWorldCountingRewardMachine(CountingRewardMachine):
    """CRM for the balanced A-B-C task in Letter World."""

    def __init__(self) -> None:
        super().__init__(env_prop_enum=Symbol)

    @property
    def u_0(self) -> int:
        return 0

    @property
    def c_0(self) -> tuple[int, ...]:
        return (0,)

    @property
    def encoded_configuration_size(self) -> int:
        return 2

    def _get_state_transition_function(self) -> dict:
        return {
            0: {"A / (-)": 0, "B / (-)": 1},
            1: {"C / (NZ)": 1, "C / (Z)": -1},
        }

    def _get_counter_transition_function(self) -> dict:
        return {
            0: {"A / (-)": (1,), "B / (-)": (0,)},
            1: {"C / (NZ)": (-1,), "C / (Z)": (0,)},
        }

    def _get_reward_transition_function(self) -> dict:
        return {
            0: {"A / (-)": -0.1, "B / (-)": -0.1},
            1: {"C / (NZ)": -0.1, "C / (Z)": 1.0},
        }

Method Explanations

u_0: Returns the initial state (0) where the agent begins the task
c_0: Returns the initial counter values as a tuple (starts with counter = 0)
encoded_configuration_size: Returns the size (2) for encoding state-counter configurations
_get_state_transition_function: Defines state transitions with counter conditions:
- From state 0: “A” keeps in state 0, “B” moves to state 1
- From state 1: “C” stays in state 1 if counter ≠ 0 (NZ), moves to terminal if counter = 0 (Z)
_get_counter_transition_function: Defines how counters change with events:
- From state 0: “A” increments counter (+1), “B” resets counter to 0
- From state 1: “C” decrements counter (-1) if ≠ 0, keeps at 0 if = 0
_get_reward_transition_function: Defines rewards with counter conditions:
- Small penalties (-0.1) for most transitions
- Large reward (1.0) only when “C” is observed in state 1 with counter = 0 (balanced task completion)

Here, the CRM enforces balanced counts of A and C. Only when the counter reaches 0 (all As matched by Cs) does the agent receive the reward.

Summary

Reward Machines (RMs): Great for sequential or regular tasks (e.g. A → B → C).
Counting Reward Machines (CRMs): Extend RMs with counters, enabling tasks with memory, counting, or balance (e.g. “same number of As and Cs”).
PyCRM supports both: start with RMs for simple tasks, use CRMs when you need additional expressive power.

Get Started

Example

Code Concepts

3 - Reward Machine

Letter World: The Reward Machine

Overview

The RM Concept

Implementation

Method Explanations

How it Works

What if we use a Counting Reward Machine?

Key Differences from the RM

Example: Balanced Parentheses Task

Method Explanations

Summary

Get Started

Example

Code Concepts

​Letter World: The Reward Machine

​Overview

​The RM Concept

​Implementation

​Method Explanations

​How it Works

​What if we use a Counting Reward Machine?

​Key Differences from the RM

​Example: Balanced Parentheses Task

​Method Explanations

​Summary

Letter World: The Reward Machine

Overview

The RM Concept

Implementation

Method Explanations

How it Works

What if we use a Counting Reward Machine?

Key Differences from the RM

Example: Balanced Parentheses Task

Method Explanations

Summary