Cross-Product Environments
Cross-product environments combine a ground environment with a Reward Machine or Counting Reward Machine to create a new environment where rewards are determined by the machine’s response to symbolic events.
Introduction to Cross-Products
In the RM/CRM framework, a cross-product environment combines three key components:- Ground Environment: The base environment that defines the world dynamics
- Labelling Function: Translates environment observations to symbolic events
- Reward Machine: Specifies rewards based on sequences of events
- Preserves the original environment dynamics
- Adds reward structure based on high-level task specifications
- Tracks machine states and counters as part of the observation
- Automatically manages the interaction between components
Cross-Product Architecture
The cross-product environment works by:- Taking actions in the ground environment
- Converting observations to symbolic events via the labelling function
- Updating the RM/CRM state based on events
- Determining rewards according to the RM/CRM’s transition function
- Augmenting the observation with RM/CRM state information (to satisfy Markov assumption)
Creating a Cross-Product Environment
To create a custom cross-product environment, you need to subclass theCrossProduct base class:
Observation Methods
TheCrossProduct class provides default implementations of _get_obs and to_ground_obs that work well for continuous and deep RL settings.
_get_obs
This method combines ground environment observations with CRM state information to satisfy the Markov assumption. The default implementation concatenates the ground observation with a one-hot encoding of the machine state u and the raw counter values c:
to_ground_obs
This method extracts the ground observation from a cross-product observation. The default implementation strips the one-hot encoded machine state and counter values from the end of the observation.
Overriding the Defaults
You can override these methods for custom observation encodings. For example, in tabular settings you may prefer to use the raw integer machine state instead of a one-hot encoding:Example: Letter World Cross-Product
Here’s an example from the Letter World environment that overrides the default observation methods to use a tabular encoding (raw integeru rather than one-hot):
- Extends the Letter World environment with RM/CRM state information
- Augments the observation with the current machine state (
u) and counter value (c[0]) - Preserves the original action space
Using Cross-Product Environments
Once created, a cross-product environment can be used like any standard Gym environment:Interpreting Observations
The observation from a cross-product environment contains both ground environment information and RM/CRM state:Counterfactual Experience Generation
One powerful feature of cross-product environments is their ability to generate counterfactual experiences:- Takes a ground environment transition (obs, action, next_obs)
- Generates experiences for all possible RM/CRM states and counter configurations (up to an upper bound)
- Returns batches of experience tuples that can be used for more efficient learning
Behind the Scenes: How Cross-Products Work
The cross-product implements the following key methods:reset()
Initializes both the ground environment and RM/CRM:step(action)
Handles the full interaction cycle:Type Parameters
TheCrossProduct class uses generic type parameters for flexibility:
GroundObsType: Type of ground environment observationsObsType: Type of cross-product environment observationsActType: Type of actionsRenderFrame: Type returned by the render method
Best Practices
When creating cross-product environments:- State Augmentation: The default
_get_obs(one-hotu+ raw counters) works well for most cases. Override it only when you need a custom encoding (e.g. tabular settings) - Consistent Types: Ensure observation and action spaces are compatible with RL algorithms
- Reasonable Max Steps: Set an appropriate
max_stepsvalue for your task - Use Counterfactual Learning: Take advantage of counterfactual experience generation for faster learning
- Type Annotations: Use appropriate type parameters for better code safety
Summary
Cross-product environments are the glue that binds together ground environments, labelling functions, and RM/CRMs. They:- Create a seamless interface between environment dynamics and task specifications
- Preserve the Gym environment API for compatibility with standard RL algorithms
- Augment observations with RM/CRM state information
- Provide counterfactual experience generation for accelerated learning