Reinforcement Learning Agents

The PyCRM framework provides agent implementations specifically designed to learn from counterfactual experiences generated by Reward Machines and Counting Reward Machines.

Introduction

The pycrm.agents module provides reinforcement learning algorithms that integrate with Reward Machines and Counting Reward Machines to efficiently learn task policies. These agents are designed to take advantage of the counterfactual experience generation capabilities provided by the pycrm framework. The framework includes two main types of agent implementations:
  1. Tabular agents for discrete state and action spaces
  2. Deep RL agents based on Stable Baselines 3 for continuous domains

Tabular Agents

Tabular agents are suitable for environments with discrete state and action spaces. The framework provides:

Q-Learning (QL)

The standard Q-Learning algorithm is implemented in pycrm.agents.tabular.ql. This provides a baseline implementation that uses the standard Q-learning update rule:
Q(s,a) ← Q(s,a) + α[r + γ·max_a'Q(s',a') - Q(s,a)]

Counterfactual Q-Learning (CQL)

The pycrm.agents.tabular.cql module implements Counterfactual Q-Learning, which extends standard Q-Learning to take advantage of the counterfactual experience generation capabilities of Reward Machines and Counting Reward Machines.
from pycrm.agents.tabular.cql import CounterfactualQLearningAgent

# Create the agent
agent = CounterfactualQLearningAgent(
    env=cross_product_env,  # Must be a CrossProduct environment
    epsilon=0.1,            # Exploration rate
    learning_rate=0.01,     # Learning rate
    discount_factor=0.99    # Discount factor
)

# Train the agent
returns = agent.learn(total_episodes=1000)
The key enhancement is in the learning process, which:
  1. Takes a real step in the environment
  2. Generates counterfactual experiences using the CrossProduct environment
  3. Updates Q-values for all valid counterfactual experiences
  4. Significantly accelerates learning compared to standard Q-Learning
This allows the agent to learn from many possible state configurations in a single environment step, effectively “imagining” how the reward machine would behave in different states.

Deep RL Agents

For environments with continuous state or action spaces, the framework provides integrations with Stable Baselines 3. The framework currently supports Counterfactual versions of DQN, SAC, TD3, and DDPG algorithms.

Counterfactual DQN (C-DQN)

The pycrm.agents.sb3.dqn.cdqn module implements Counterfactual Deep Q-Network, extending the DQN algorithm from Stable Baselines 3 to learn from counterfactual experiences.
from pycrm.agents.sb3.dqn import CounterfactualDQN

# Create the agent
agent = CounterfactualDQN(
    policy="MlpPolicy",
    env=cross_product_env,  # Must be a CrossProduct environment
    learning_rate=1e-4,
    buffer_size=1_000_000,
    batch_size=32,
    exploration_fraction=0.5,
    exploration_final_eps=0.1
)

# Train the agent
agent.learn(total_timesteps=1_000_000)

Counterfactual SAC (C-SAC)

The pycrm.agents.sb3.sac.csac module implements Counterfactual Soft Actor-Critic (C-SAC), extending the SAC algorithm from Stable Baselines 3 to learn from counterfactual experiences.
from pycrm.agents.sb3.sac import CounterfactualSAC

# Create the agent
agent = CounterfactualSAC(
    policy="MlpPolicy",
    env=cross_product_env,  # Must be a CrossProduct environment
    learning_rate=3e-4,
    buffer_size=1_000_000,
    batch_size=256
)

# Train the agent
agent.learn(total_timesteps=100_000)

Counterfactual TD3 (C-TD3)

The pycrm.agents.sb3.td3.ctd3 module implements Counterfactual Twin Delayed Deep Deterministic Policy Gradient (C-TD3), extending the TD3 algorithm from Stable Baselines 3.
from pycrm.agents.sb3.td3 import CounterfactualTD3

# Create the agent
agent = CounterfactualTD3(
    policy="MlpPolicy",
    env=cross_product_env,  # Must be a CrossProduct environment
    learning_rate=1e-3,
    buffer_size=1_000_000,
    batch_size=256
)

# Train the agent
agent.learn(total_timesteps=100_000)

Counterfactual DDPG (C-DDPG)

The pycrm.agents.sb3.ddpg.cddpg module implements Counterfactual Deep Deterministic Policy Gradient (C-DDPG), extending the DDPG algorithm from Stable Baselines 3.
from pycrm.agents.sb3.ddpg import CounterfactualDDPG

# Create the agent
agent = CounterfactualDDPG(
    policy="MlpPolicy",
    env=cross_product_env,  # Must be a CrossProduct environment
    learning_rate=1e-3,
    buffer_size=1_000_000,
    batch_size=256
)

# Train the agent
agent.learn(total_timesteps=100_000)
All counterfactual deep RL agents enhance their respective base algorithms by:
  1. Collecting transitions from the environment
  2. Generating counterfactual experiences for each transition
  3. Adding these experiences to the replay buffer
  4. Training the policy network using both real and counterfactual experiences
This approach is particularly effective for complex continuous control tasks and environments with sparse rewards.

Vectorised Environment Support

All counterfactual deep RL agents provide specialised support for vectorised environments through the pycrm.agents.sb3.wrapper module, which includes:
  • DispatchSubprocVecEnv: An extension of Stable Baselines 3’s SubprocVecEnv that enables efficient parallel generation of counterfactual experiences
This implementation is designed to maintain performance when working with multiple parallel environments:
from pycrm.agents.sb3.wrapper import DispatchSubprocVecEnv

# Create vectorised environment
envs = DispatchSubprocVecEnv([
    lambda: create_cross_product_env() for _ in range(8)
])

# Create C-SAC agent with vectorised environment
agent = CounterfactualSAC(
    policy="MlpPolicy",
    env=envs,
    verbose=1
)

Performance Benefits

Agents that leverage counterfactual experiences show several advantages:
  1. Faster Convergence: Learning from counterfactual experiences often reduces the number of episodes needed to learn optimal policies by orders of magnitude.
  2. Better Sample Efficiency: By extracting more information from each environment interaction, these agents make better use of collected experiences.
  3. More Robust Policies: Since the agent explores the reward machine state space more completely, the resulting policies tend to be more robust.

Requesting Custom Agent Implementations

Need support for a different RL algorithm? We’re happy to add it! Open an issue on our GitHub repository and we’ll prioritize implementing it.
The PyCRM framework is designed to be extensible, and we’re committed to supporting a wide range of reinforcement learning algorithms. If you require an implementation not currently available, such as:
  • Integration with additional Stable Baselines 3 algorithms (PPO, A2C, etc.)
  • Support for other deep RL frameworks (RLlib, Pytorch, etc.)
  • Custom agent architectures or learning algorithms
  • Specialized handling for your environment type
Please open an issue on our GitHub repository with the following information:
  1. The algorithm or implementation you need
  2. Your use case or environment
  3. Any specific requirements or constraints
We actively monitors issues and will prioritise implementing requested features based on community needs. We believe in making the PyCRM framework as versatile and useful as possible for all users.

Example Usage

Here’s a complete example showing how to use the Counterfactual Q-Learning agent with a discrete Puck World environment:
from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldCountingRewardMachine
from examples.introduction.core.crossproduct import LetterWorldCrossProduct
from pycrm.agents.tabular.cql import CounterfactualQLearningAgent

# Create environment components
ground_env = LetterWorld()
lf = LetterWorldLabellingFunction()
crm = LetterWorldCountingRewardMachine()

# Create cross-product environment
cross_product = LetterWorldCrossProduct(
    ground_env=ground_env,
    crm=crm,
    lf=lf,
    max_steps=1000
)

# Create and train the agent
agent = CounterfactualQLearningAgent(
    env=cross_product,
    epsilon=0.1,
    learning_rate=0.01,
    discount_factor=0.99
)

# Train the agent
returns = agent.learn(total_episodes=1000)

# Evaluate the learned policy
obs, _ = cross_product.reset()
done = False
total_reward = 0

while not done:
    action = agent.get_action(obs)
    obs, reward, terminated, truncated, _ = cross_product.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Evaluation reward: {total_reward}")
For continuous environments, you can use any of the counterfactual deep RL agents:
from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldCountingRewardMachine
from examples.introduction.core.crossproduct import LetterWorldCrossProduct
from pycrm.agents.sb3.sac import CounterfactualSAC

# Create environment components
ground_env = LetterWorld()
lf = LetterWorldLabellingFunction()
crm = LetterWorldCountingRewardMachine()

# Create cross-product environment
cross_product = LetterWorldCrossProduct(
    ground_env=ground_env,
    crm=crm,
    lf=lf,
    max_steps=100
)

# Create and train the agent
agent = CounterfactualSAC(
    policy="MlpPolicy",
    env=cross_product,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    batch_size=256
)

# Train the agent
agent.learn(total_timesteps=50_000)

Summary

The reinforcement learning agents in the PyCRM framework are specifically designed to take advantage of the counterfactual experience generation capabilities of Reward Machines and Counting Reward Machines. This approach significantly improves learning efficiency and policy quality compared to standard reinforcement learning algorithms. By providing both tabular and deep RL implementations, the framework supports a wide range of environments and task specifications, from simple discrete environments to complex continuous control problems.