OpenRA-RL: An Open Platform for AI Agents in Real-Time Strategy Games

Abstract

Real-time strategy (RTS) games have driven landmark AI achievements, yet prior systems like AlphaStar and OpenAI Five relied on highly specialized architectures and training pipelines that do not generalize beyond their target domains. Meanwhile, LLM-based agents have emerged as a promising paradigm for general-purpose reasoning, and there is growing interest in applying them to RTS games. However, no existing RTS platform supports their distinct requirements: high-level action interfaces, asynchronous interaction, and tolerance for variable inference latencies—making current efforts difficult and ad-hoc.

We introduce OpenRA-RL, an open-source platform enabling LLM-based agents to play the classic RTS game Red Alert. Our platform provides a Gymnasium-style Python API, integrates with the Model Context Protocol (MCP) to expose 50 game actions as tool calls, and implements an asynchronous architecture for agents operating slower than real-time. To validate the platform, we demonstrate a Qwen3 32B agent across five episodes, exposing an eight-dimensional reward vector that reveals specific strategic weaknesses: the agent achieves 0.58–0.80 scores on economic management but zero on combat execution, demonstrating that even frontier LLMs require substantial learning to master RTS games.

MCP game actions as tool calls

64×

Concurrent game sessions

40×

Reset latency improvement (v1→v2)

Multi-dimensional reward vector

01Introduction

Real-time strategy (RTS) games have long served as a compelling testbed for artificial intelligence research, demanding capabilities that span rapid decision-making, long-horizon planning, resource management, and adversarial reasoning under partial observability. Landmark efforts in this domain include DeepMind's AlphaStar for StarCraft II, OpenAI Five for Dota 2, and earlier work on StarCraft. These systems achieved impressive performance, demonstrating that machine learning can master complex strategic environments.

However, these prior approaches relied on highly specialized architectures and training pipelines that do not generalize beyond their target games. AlphaStar, for instance, required a bespoke neural network architecture, imitation learning from human replays, and distributed reinforcement learning across thousands of TPUs. Such methods yield domain-specific experts but offer little reusable infrastructure for other tasks.

Meanwhile, LLM-based agents have emerged as a promising paradigm for general-purpose reasoning, leveraging pretrained world knowledge, natural language reasoning, and high-level semantic actions to tackle diverse tasks, including web navigation, code generation, and tool use.

The infrastructure gap: No existing RTS platform is designed to support LLM-based agents. Current platforms assume agents that operate at millisecond timescales with low-level action spaces, whereas LLM agents require high-level interfaces, asynchronous interaction patterns, and tolerance for variable inference latencies.

We address this gap with OpenRA-RL, an open-source platform that enables LLM agents to play the classic RTS game Red Alert. Our platform provides: (1) a Gymnasium-style Python API compatible with standard RL tooling; (2) integration with the Model Context Protocol (MCP), exposing 50 game actions as tool calls for seamless LLM agent deployment; and (3) an asynchronous architecture that gracefully handles agents operating slower than real-time.

RTS games exercise capabilities with broad real-world relevance. The skills required—strategic planning under uncertainty, resource allocation, multi-objective optimization, and adaptive response to adversarial dynamics—closely mirror those demanded in domains such as supply chain management, autonomous logistics coordination, and multi-robot systems.

To our knowledge, OpenRA-RL is the first platform specifically designed to support LLM-based agents in RTS games. Our contributions are twofold: (1) we provide the infrastructure necessary for systematic evaluation of LLM agents in a complex, real-time strategic environment; and (2) we enable direct comparison between LLM-based approaches and traditional methods, facilitating research into the strengths and limitations of general-purpose reasoning for strategic decision-making.

02Architecture & Engineering Design

OpenRA-RL is built upon a modular architecture designed to accommodate diverse agent paradigms while maintaining real-time game execution. At its core, the platform consists of three interconnected layers: a modified OpenRA game engine written in C# that executes the game logic at approximately 25 ticks per second; a gRPC bridge that exposes game state and accepts agent commands; and a Python wrapper providing a standardized Gymnasium-style interface via FastAPI.

This architecture decouples agent computation from game execution, allowing agents of varying speeds—from fast scripted bots to slower LLM-based reasoners—to interact with the same environment without disrupting game flow.

The platform supports three primary agent modalities. Scripted bots implement hand-crafted strategies through state machines, serving as baselines. Reinforcement learning agents consume the gRPC bridge through a Gymnasium-compatible adapter. Most notably, LLM agents interact through natural language reasoning, with the platform exposing 50 game actions as tools via the Model Context Protocol (MCP).

System Architecture of OpenRA-RL — **Figure 1.** System architecture of OpenRA-RL. The platform consists of three interconnected layers: LLM agents interact through the Model Context Protocol (MCP) server, which communicates with a Python backend via FastAPI, and the Python backend connects to the C# game engine through a gRPC bridge. This modular design decouples agent computation from game execution, supporting diverse agent paradigms from scripted bots to LLM-based reasoners.

2.1 Asynchronous Agent–Environment Decoupling

A fundamental challenge in applying LLM-based agents to real-time strategy games is the mismatch between game speed and agent inference time: the game engine ticks at approximately 25 Hz, while a single LLM decision may take over 2 seconds. To decouple these two timescales, the platform implements a dual-channel architecture using .NET System.Threading.Channels with bounded, non-blocking semantics.

Async Event Queue Design — **Figure 2.** Asynchronous event queue design with bounded, non-blocking channels. The observation channel (capacity=1, DropOldest) ensures agents always receive the most recent game state, while the action channel (capacity=16) buffers agent commands. This asymmetric design decouples game ticks from agent inference time, allowing both fast scripted bots (~40ms) and slow LLM agents (~2s) to interact with the same 25 Hz game engine.

Observation channel (game → agent). Each tick, the game engine serializes the current world state into a GameObservation protobuf message and writes it to a BoundedChannel<GameObservation> configured with capacity 1 and a DropOldest full-mode policy. The agent therefore always receives the most recent game snapshot, regardless of how many ticks elapsed during its thinking time.

Action channel (agent → game). When the agent completes a decision, it may issue multiple commands in a single batch. These are written to a BoundedChannel<AgentAction> with capacity 16, providing sufficient buffer for command batches without blocking the agent.

Non-blocking guarantee. Both channels use DropOldest semantics and all writes are non-blocking (TryWrite). The game thread never waits for the agent—if no action has arrived by the time a tick executes, the engine proceeds with a no-op. This ensures game progression is entirely independent of agent latency, a critical property for fair benchmarking.

2.2 Multi-Session Architecture

Multi-Session Worker Pool Architecture — **Figure 3.** Multi-session worker pool architecture (v2). Up to 64 concurrent game sessions run within a single .NET process, sharing immutable ModData to eliminate redundant memory usage. Game ticks are processed by a dedicated worker thread pool separate from gRPC request handlers, preventing resource contention. Each session is protected by a per-session semaphore for safe parallel execution.

Legacy Architecture v1 — **Figure 4.** Legacy architecture (v1) comparison. The original design spawned 64 separate .NET processes, each with its own JIT compilation and ModData loading. This consumed ~40 GB RAM and required 5–15s reset latency. The v2 architecture provides ~40× speedup and ~7× memory reduction.

The initial design (v1) spawned a separate .NET process for each game, incurring repeated JIT compilation and mod data loading per instance. At 64 concurrent sessions, this consumed approximately 40 GB of memory and required 5–15 seconds per reset cycle.

The current design (v2) hosts up to 64 concurrent game sessions within a single .NET process. The key insight is that game static data—unit statistics, building attributes, tech trees, map rules—is immutable after initialization. This ModData is loaded once and shared across all sessions without locks, eliminating approximately 35 GB of redundant memory.

Table 1. Multi-session architecture performance comparison.

Metric	Legacy (v1)	Multi-Session (v2)	Improvement
Reset latency	5–15 s	256 ms	~40×
RSS (64 sessions)	~40 GB	~6 GB	~7×
JIT compilations	64×	1×	64×
Active threads	~200	~20	~10×
Aggregate ticks/sec	~8K	~15K	~2×

Game ticks are processed by a fixed-size worker pool with N threads equal to the CPU core count, backed by a BlockingCollection<WorkItem> queue. This pool is deliberately separate from the .NET ThreadPool used by gRPC request handlers—a critical design choice motivated by a failure observed during testing: when game tick tasks saturated the shared ThreadPool, gRPC handlers starved and could not accept new requests, causing 0 of 16 sessions to complete.

2.3 Communication and Lifecycle Management

Game State Machine Lifecycle — **Figure 5.** Game state machine lifecycle. The environment transitions through eight states from IDLE to CLEANUP. The PLAYING state maintains a bidirectional gRPC stream for real-time observation and action exchange. Two error paths (TIMEOUT and CONN_LOST) ensure clean resource cleanup on failure.

Communication between the Python environment server and the C# game engine is mediated by gRPC, defined in a shared rl_bridge.proto schema. The service exposes two communication patterns. For the real-time game loop, a bidirectional streaming RPC (GameSession) is used. For discrete operations—creating a session, destroying a session, querying game state, or advancing a fixed number of ticks—the service provides unary RPCs that accept a session_id parameter to route requests to the correct game instance.

The environment lifecycle is managed as an explicit state machine with eight states: IDLE, LAUNCHING, LOADING, CONNECTING, STREAMING, PLAYING, GAME_OVER, and CLEANUP. Two explicit error transitions handle failure: a TIMEOUT path and a CONN_LOST path, both of which trigger an immediate abort and cleanup cycle.

2.4 Replay and Observability

OpenRA-RL records every game session as a deterministic .orarep replay file. Each replay captures the complete sequence of orders and random seed, enabling exact tick-by-tick reproduction of the game via a ReplayConnection packet reader. Replay files also embed the Docker image version of the engine that produced them, ensuring playback fidelity even after engine upgrades.

Viewing is handled through a browser-based noVNC interface running inside the Docker container (openra-rl replay watch), requiring no local game installation or graphics drivers—making replay analysis accessible from headless cloud training instances.

2.5 Integration with OpenEnv

OpenRA-RL is built as a first-class environment for OpenEnv, the emerging PyTorch-native standard for reinforcement learning environment creation and interoperability. OpenEnv defines a minimal, typed environment contract—reset, step, and structured observation/action spaces—together with a distribution and discovery layer on the Hugging Face Hub.

Because OpenRA-RL is an OpenEnv environment, it is immediately consumable by the PyTorch-native post-training stack—TRL, torchforge, and Unsloth—without any environment-specific adapters on the trainer side. A researcher who wants to run GRPO, PPO, or any other policy-optimization algorithm against OpenRA-RL can do so by pointing their existing training script at the environment's OpenEnv identifier.

Most existing OpenEnv environments target narrow, short-horizon tasks: code execution, single-turn tool use, and small-scale games. OpenRA-RL extends OpenEnv's applicability to a long-horizon, adversarial, real-time regime with combinatorial action spaces and variable inference latencies.

03Demonstration

We have developed a Python normal AI bot that can beat OpenRA easy AI bot. The environment code is on GitHub (OpenRA-RL) and also deployed as a live HuggingFace Space; the training/challenge code is at GitHub (openra-rl-challenge). The replay can be watched on YouTube.

To exercise the platform end-to-end through the OpenEnv interface, we deploy Qwen3 32B served locally via Ollama as an LLM agent playing the Allied faction against the built-in Beginner AI on a 128×128 map. The agent receives structured game-state observations as tool responses and issues actions through the MCP tools. We run five episodes under two timing regimes: Games 1–2 use a 30-minute time limit, while Games 3–5 use a 5-minute limit.

3.1 End-of-Episode Scorecard

All five games ended without combat engagement (zero kills, zero casualties): the agent successfully establishes a base economy but does not translate that economy into offensive force before the time limit expires.

Table 2. End-of-episode scorecard for five Qwen3 32B episodes against the Beginner AI. †Games 1–2 used a 30-minute time limit; Games 3–5 used a 5-minute limit.

Game	Duration	Ticks	Assets	Bldgs	Army	Explored	Calls
Game 1†	30:23	1621	$6,600	5	$2,920	3.7%	62
Game 2†	30:15	1477	$4,000	3	$2,340	2.7%	81
Game 3	5:01	540	$2,800	3	$640	2.7%	18
Game 4	5:19	509	$2,300	2	$540	2.2%	19
Game 5	5:17	621	$2,800	3	$740	2.7%	21

3.2 Multi-Dimensional Reward Profile

**Figure 6.** Eight-dimensional reward vector across the five episodes. Left: per-dimension scores for all five games. Right: radar plot of Game 1's skill profile (capped at 1.0). The agent registers non-trivial scores on *economy*, *infrastructure*, and *tempo*, but zero on *combat* and *disruption*—exactly the kind of multi-objective asymmetry that a single win/loss scalar would collapse into a single data point.

Decomposing performance this way exposes an asymmetric agent: strong on economic and infrastructural sub-skills, absent on combat and disruption. This is the practical motivation for exposing reward as a vector rather than a scalar: it identifies which strategic dimensions need improvement rather than only that the game was lost.

3.3 Build-Order Timelines

Across all five games, the agent constructs a Power Plant, Barracks, and (in some episodes) a Refinery; unit production, however, lags significantly, and no offensive units reach the field before the time limit expires in any episode.

3.4 Tool-Call Distribution

The most frequent call by a wide margin is advance, which time-skips the game forward by a specified number of ticks—the primary mechanism by which an LLM agent operating at multi-second decision latencies can interact with a 25 Hz game engine without leaving the game idle.

3.5 Macro-Management Trends

Economy trends — **Figure 9.** Macro-management indicators across the five episodes. Left: total asset value at end-of-game. Centre: per-tick-normalized economy reward score. Right: percentage of map explored. The drop in raw asset value from Game 1 ($6,600) to Games 3–5 (~$2,500) is an artifact of the shortened time limit, not agent regression; the per-tick-normalized economy score remains within 0.58–0.80 across both regimes.

3.6 Sample Agent Reasoning Trace

The trace below reproduces the first ten turns of Game 5 verbatim from the structured run log, illustrating the three phases of an episode: an initial intelligence and planning phase, construction of the early-game economy, and explicit time-skipping via advance to bridge the gap between the agent's multi-second decision latency and the 25 Hz game tick rate.

Listing 1 — First 10 turns of Game 5 (Qwen3 32B), as recorded in the run log.

Turn  1 | get_faction_briefing()
Turn  2 | end_planning_phase(strategy="Deploy MCV, build Power
       Plant (300 ticks), then Barracks/War Factory (500/750),
       train E1 + 1tnk, attack AI base at (64,64).")
       -> tick=2,   cash=$5000, units=1, buildings=0
Turn  3 | deploy_unit(unit_id=120)
       -> tick=4,   cash=$5000, units=1, buildings=0
Turn  4 | build_and_place(building_type="apwr")
Turn  5 | advance(ticks=100)
       -> tick=55,  cash=$5000, units=0, buildings=1
Turn  6 | build_and_place(building_type="apwr")
Turn  7 | build_and_place(building_type="powr")
       -> tick=57,  cash=$5000, units=0, buildings=1
Turn  8 | advance(ticks=60)
       -> tick=108, cash=$4923, units=0, buildings=1
Turn  9 | advance(ticks=130)
       -> tick=159, cash=$4838, units=0, buildings=1
Turn 10 | advance(ticks=80)
       -> tick=210, cash=$4753, units=0, buildings=1

3.7 API Usage

The minimal Python code below instantiates an OpenRA-RL environment, resets it into a scenario, steps it with a structured action, and closes it to finalize the replay—following the standard OpenEnv reset/step/close contract.

Listing 2 — Minimal end-to-end example of driving an OpenRA-RL environment from Python.

from openra_env.config import load_config
from openra_env.server.openra_environment import OpenRAEnvironment
from openra_env.models import ActionType, CommandModel, OpenRAAction

# 1. Configure and instantiate the environment.
config = load_config(game={
    "grpc_port": 8000,
    "map_name": "tank-duel-basic",
    "headless": True,
})
env = OpenRAEnvironment(config=config)

# 2. Reset into a scenario; obs is a structured observation
#    (economy, military, unit/building lists, 9-channel spatial map).
obs = env.reset(seed=0)

# 3. Issue a structured action. OpenRAAction wraps one or more
#    CommandModel entries drawn from 21 ActionType values
#    (MOVE, ATTACK, BUILD, TRAIN, DEPLOY, ...).
action = OpenRAAction(commands=[
    CommandModel(action=ActionType.BUILD, item_type="powr"),
])
obs = env.step(action)

# 4. Close the environment; this finalizes the .orarep replay file.
env.close()

3.8 Key Takeaways

The environment is strategically deep. Five episodes of a frontier LLM (Qwen3 32B) playing the simplest built-in opponent produced zero combat engagement, five draws, and no units ever reaching the enemy base. The gap between current LLM agents and a tutorial-level opponent is exactly the kind of headroom a long-horizon RL research testbed needs.

The multi-dimensional reward vector localizes weakness. A scalar win/loss metric would collapse all five episodes into a single data point ("draw"). The eight-dimensional reward vector instead reveals a precise failure mode: the agent achieves 0.58–0.80 on economy and infrastructure but zero on combat and disruption.

The asynchronous architecture is load-bearing, not decorative. 57% of the agent's 201 tool calls are advance calls that explicitly time-skip the game engine. Without the observation-drop / action-bounded channel design, an LLM agent operating at ~2-second decision latencies could not meaningfully play a 25 Hz real-time strategy game at all.

Cross-episode reflection works, but is not enough. After each game, the agent generates a reflection and extracts lessons injected into the next episode's system prompt. By Episode 4, the agent's pre-game plan explicitly opens with a Power Plant—but in-context learning does not close the combat gap, suggesting OpenRA-RL is exactly the environment where weight-updating RL training should measurably matter.

04Conclusion

We have presented OpenRA-RL, the first platform specifically designed to support LLM-based agents in real-time strategy games. Unlike prior RTS AI systems that relied on specialized architectures limited to a single game, OpenRA-RL provides general-purpose infrastructure built on the OpenEnv standard, enabling systematic research on long-horizon planning and strategic reasoning with diverse agent paradigms.

The platform addresses a critical infrastructure gap through three key contributions. First, a modular three-layer architecture decouples agent computation from game execution via a gRPC bridge, Gymnasium-style Python API, and Model Context Protocol integration that exposes 50 game actions as tool calls compatible with frontier LLM systems. Second, an asynchronous dual-channel design gracefully handles agents operating slower than real-time, with bounded observation and action buffers that prevent game progression from blocking on agent latency. Third, a multi-session architecture hosts 64 concurrent game sessions in a single process, reducing reset latency from 5–15 seconds to 256ms and memory consumption from approximately 40 GB to approximately 6 GB.

Our demonstration with a Qwen3 32B agent validates both the platform's technical capabilities and its utility as a research testbed. The agent's performance—achieving 0.58–0.80 scores on economic management but zero on combat execution across five episodes—reveals that even frontier LLMs require substantial learning to master RTS games, confirming strategic depth that is neither trivially solved by prompt engineering nor reducible to short-horizon reasoning.

As a first-class OpenEnv environment distributed via the Hugging Face Hub, OpenRA-RL is immediately consumable by PyTorch-native training frameworks including TRL, torchforge, and Unsloth without environment-specific adapters. We release the platform as open-source software and invite the research community to build upon this foundation for advancing strategic reasoning in AI agents.