Letter World: The Ground Environment
Overview
The ground environment in our framework refers to the base environment that defines the world dynamics, independent of any reward structure. For Letter World, this is a grid-based environment implemented as a standard OpenAI Gymnasium environment.
See our paper for more details.
Environment Description
Letter World consists of:
- A 3×7 grid where the agent can move in four directions
- Special positions labeled with letters ‘A’, ‘B’, and ‘C’
- Stochastic behavior: when visiting position ‘A’, it may randomly change to ‘B’ (with 50% probability)
- A simple observation space containing the agent’s position and a flag indicating if A has changed to B.
Here’s a visualisation of the environment:
+-------------+
|. . . . . . .|
|A . x . . C .|
|. . . . . . .|
+-------------+
Where:
x represents the agent (starting at position [1,3])
A represents the position of letter A (which may change to B)
C represents the position of letter C
. represents empty spaces
Implementation Details
The Letter World environment is implemented as a subclass of gymnasium.Env, following the standard Gym interface.
import gymnasium as gym
import numpy as np
class LetterWorld(gym.Env):
"""The Letter World environment."""
RIGHT = 0
LEFT = 1
UP = 2
DOWN = 3
A_POSITION = np.array([1, 1])
C_POSITION = np.array([1, 5])
N_ROWS = 3
N_COLS = 7
def __init__(self) -> None:
"""Initialize the environment."""
self.action_space = gym.spaces.Discrete(4) # Four movement directions
self.observation_space = gym.spaces.Box(
low=0, high=100, shape=(3,), dtype=np.int32
) # [symbol_seen, row, col]
Action Space
The environment supports four actions for movement:
0: Move right
1: Move left
2: Move up
3: Move down
These actions are defined as constants in the LetterWorld class for better code readability.
Observation Space
The observation is a 3-dimensional vector:
- First element: Binary flag (0/1) indicating if A has changed to B
- Second element: Row position of the agent
- Third element: Column position of the agent
This is implemented as a Box space with shape (3,) in the code.
Movement Mechanics
When the agent attempts to move outside the grid boundaries, it remains in the same position. Here’s the code that handles movement:
def _update_agent_position(self, action: int) -> None:
"""Moves that take agent out of the grid leave it in the same position."""
row, col = self.agent_position
if action == self.RIGHT:
n_row = row
n_col = col + 1 if col < self.N_COLS - 1 else col
elif action == self.LEFT:
n_row = row
n_col = col - 1 if col > 0 else col
elif action == self.UP:
n_col = col
n_row = row - 1 if row > 0 else row
elif action == self.DOWN:
n_col = col
n_row = row + 1 if row < self.N_ROWS - 1 else row
else:
raise ValueError(f"Invalid action {action}.")
self.agent_position = (n_row, n_col)
The method includes boundary checks for each direction to prevent the agent from moving outside the grid.
Stochastic Behavior
The environment includes a stochastic element: when the agent visits position A, there’s a 50% chance that the symbol changes from ‘A’ to ‘B’.
def step(self, action: int) -> tuple[np.ndarray, float, bool, bool, dict]:
"""Take a step in the environment."""
self._update_agent_position(action)
if not self.symbol_seen and np.array_equal(
self.agent_position, self.A_POSITION
):
self.symbol_seen = np.random.random() < 0.5
return self._get_obs(), 0, False, False, {}
This stochastic behavior creates an interesting challenge for reinforcement learning algorithms, as they must adapt to uncertain outcomes. The change in symbol is tracked by the symbol_seen flag in the observation.
Using the Environment
The environment follows the standard Gym interface, making it easy to use with existing RL algorithms:
# Create the environment
env = LetterWorld()
# Reset to get initial observation
observation, info = env.reset()
# Take random actions
for _ in range(10):
action = env.action_space.sample() # Random action
observation, reward, terminated, truncated, info = env.step(action)
# Visualize the state
env.render()
if terminated or truncated:
observation, info = env.reset()
Environment Visualization
When you call render(), the environment prints a colored grid representation to the console. Here’s what it looks like at different states:
Initial State
+---------------+
|. . . . . . . .|
|. A . x . . C .|
|. . . . . . . .|
+---------------+
After Moving to Position A
+---------------+
|. . . . . . . .|
|. x . . . . C .|
|. . . . . . . .|
+---------------+
After Symbol Change (A to B)
+---------------+
|. . . . . . . .|
|. B . . . . C .|
|. . . x . . . .|
+---------------+
Note that the environment itself doesn’t provide rewards - the reward structure will be defined by the RM/CRM we’ll connect to this environment.
Key Points
- The base environment is reward-free - it only defines the dynamics of the world
- Navigation is deterministic, but the state transition at position A is stochastic
- The environment follows the standard Gym interface for compatibility with RL algorithms
- The agent’s task will be defined by an RM/CRM, which will assign rewards based on the sequence of visited letters
Next Steps