Letter World: The Reward Machine

Overview

The Reward Machine is the core automaton that defines what constitutes success in a task.
Instead of embedding complex logic directly in a reward function, the RM tracks high-level events (e.g. visiting specific letters) and provides rewards when certain sequences occur.

The RM Concept

An RM consists of:
  • States: Different phases of the task
  • Transitions: Rules for moving between states based on events
  • Rewards: Values assigned to transitions
For the Letter World environment, we define an RM that rewards the agent for visiting letters in a specific order: first A, then B, then C.

Implementation

The Letter World RM is implemented as a subclass of the RewardMachine base class:
from pycrm.automaton import RewardMachine
from examples.introduction.core.label import Symbol


class LetterWorldRewardMachine(RewardMachine):
    """Reward machine for the Letter World environment."""

    @property
    def u_0(self) -> int:
        """Return the initial state."""
        return 0

    def _get_state_transition_function(self) -> dict:
        return {
            0: {"A": 1},
            1: {"B": 2},
            2: {"C": -1},  # terminal
        }

    def _get_reward_transition_function(self) -> dict:
        return {
            0: {"A": 0.1},
            1: {"B": 0.1},
            2: {"C": 1.0},  # success reward
        }

Method Explanations

  • u_0: Returns the initial state (0) where the agent begins the task
  • _get_state_transition_function: Defines state transitions based on observed events:
    • From state 0, observing “A” moves to state 1
    • From state 1, observing “B” moves to state 2
    • From state 2, observing “C” moves to terminal state -1
  • _get_reward_transition_function: Defines rewards for each transition:
    • Small rewards (0.1) for intermediate steps (A→1, B→2)
    • Large reward (1.0) for completing the sequence (C→terminal)

How it Works

  1. The agent starts in state 0.
  2. Seeing A moves to state 1, with a small reward.
  3. Seeing B moves to state 2, with a small reward.
  4. Seeing C completes the sequence, reaching the terminal state with a larger reward.
This RM defines a sequential task: A → B → C.

What if we use a Counting Reward Machine?

Key Differences from the RM

  • Counters: CRMs introduce integer variables that track how many times events occur.
  • Transition Conditions: CRM transitions depend not only on events, but also on counter values (e.g. “if counter = 0, go to success state”).
  • Expressiveness: While RMs capture tasks definable by regular languages, CRMs are Turing-complete and can express much richer tasks.

Example: Balanced Parentheses Task

In Letter World, a CRM could enforce:
  • The agent must visit A n times,
  • Then B,
  • Then C n times,
  • Only when the number of As and Cs match does the agent receive a reward.
This requires a counter to track the difference between As and Cs — something a plain RM cannot do.
from pycrm.automaton import CountingRewardMachine
from examples.introduction.core.label import Symbol


class LetterWorldCountingRewardMachine(CountingRewardMachine):
    """CRM for the balanced A-B-C task in Letter World."""

    def __init__(self) -> None:
        super().__init__(env_prop_enum=Symbol)

    @property
    def u_0(self) -> int:
        return 0

    @property
    def c_0(self) -> tuple[int, ...]:
        return (0,)

    @property
    def encoded_configuration_size(self) -> int:
        return 2

    def _get_state_transition_function(self) -> dict:
        return {
            0: {"A / (-)": 0, "B / (-)": 1},
            1: {"C / (NZ)": 1, "C / (Z)": -1},
        }

    def _get_counter_transition_function(self) -> dict:
        return {
            0: {"A / (-)": (1,), "B / (-)": (0,)},
            1: {"C / (NZ)": (-1,), "C / (Z)": (0,)},
        }

    def _get_reward_transition_function(self) -> dict:
        return {
            0: {"A / (-)": -0.1, "B / (-)": -0.1},
            1: {"C / (NZ)": -0.1, "C / (Z)": 1.0},
        }

Method Explanations

  • u_0: Returns the initial state (0) where the agent begins the task
  • c_0: Returns the initial counter values as a tuple (starts with counter = 0)
  • encoded_configuration_size: Returns the size (2) for encoding state-counter configurations
  • _get_state_transition_function: Defines state transitions with counter conditions:
    • From state 0: “A” keeps in state 0, “B” moves to state 1
    • From state 1: “C” stays in state 1 if counter ≠ 0 (NZ), moves to terminal if counter = 0 (Z)
  • _get_counter_transition_function: Defines how counters change with events:
    • From state 0: “A” increments counter (+1), “B” resets counter to 0
    • From state 1: “C” decrements counter (-1) if ≠ 0, keeps at 0 if = 0
  • _get_reward_transition_function: Defines rewards with counter conditions:
    • Small penalties (-0.1) for most transitions
    • Large reward (1.0) only when “C” is observed in state 1 with counter = 0 (balanced task completion)
Here, the CRM enforces balanced counts of A and C. Only when the counter reaches 0 (all As matched by Cs) does the agent receive the reward.

Summary

  • Reward Machines (RMs): Great for sequential or regular tasks (e.g. A → B → C).
  • Counting Reward Machines (CRMs): Extend RMs with counters, enabling tasks with memory, counting, or balance (e.g. “same number of As and Cs”).
  • PyCRM supports both: start with RMs for simple tasks, use CRMs when you need additional expressive power.