Letter World: The Ground Environment
The Letter World environment is a simple grid world where an agent navigates between labeled positions. It serves as the foundation for our Counting Reward Machine examples.
Overview
The ground environment in our framework refers to the base environment that defines the world dynamics, independent of any reward structure. For Letter World, this is a grid-based environment implemented as a standard OpenAI Gymnasium environment. See our paper for more details.Environment Description
Letter World consists of:- A 3×7 grid where the agent can move in four directions
- Special positions labeled with letters ‘A’, ‘B’, and ‘C’
- Stochastic behavior: when visiting position ‘A’, it may randomly change to ‘B’ (with 50% probability)
- A simple observation space containing the agent’s position and a flag indicating if A has changed to B.
xrepresents the agent (starting at position [1,3])Arepresents the position of letter A (which may change to B)Crepresents the position of letter C.represents empty spaces
Implementation Details
The Letter World environment is implemented as a subclass ofgymnasium.Env, following the standard Gym interface.
Action Space
The environment supports four actions for movement:0: Move right1: Move left2: Move up3: Move down
Observation Space
The observation is a 3-dimensional vector:- First element: Binary flag (0/1) indicating if A has changed to B
- Second element: Row position of the agent
- Third element: Column position of the agent
Movement Mechanics
When the agent attempts to move outside the grid boundaries, it remains in the same position. Here’s the code that handles movement:Stochastic Behavior
The environment includes a stochastic element: when the agent visits position A, there’s a 50% chance that the symbol changes from ‘A’ to ‘B’.symbol_seen flag in the observation.
Using the Environment
The environment follows the standard Gym interface, making it easy to use with existing RL algorithms:Environment Visualization
When you callrender(), the environment prints a colored grid representation to the console. Here’s what it looks like at different states:
Initial State
After Moving to Position A
After Symbol Change (A to B)
Key Points
- The base environment is reward-free - it only defines the dynamics of the world
- Navigation is deterministic, but the state transition at position A is stochastic
- The environment follows the standard Gym interface for compatibility with RL algorithms
- The agent’s task will be defined by an RM/CRM, which will assign rewards based on the sequence of visited letters
Next Steps
- Learn about the Labelling Function that maps environment states to symbolic events
- Explore how the Counting Reward Machine defines rewards based on these events
- See how everything comes together in the Cross-Product Environment