Real-time strategy (RTS) games have driven landmark AI achievements, yet prior systems like AlphaStar and OpenAI Five relied on highly specialized architectures and training pipelines that do not generalize beyond their target domains. Meanwhile, LLM-based agents have emerged as a promising paradigm for general-purpose reasoning, and there is growing interest in applying them to RTS games. However, no existing RTS platform supports their distinct requirements: high-level action interfaces, asynchronous interaction, and tolerance for variable inference latencies—making current efforts difficult and ad-hoc.
We introduce OpenRA-RL, an open-source platform enabling LLM-based agents to play the classic RTS game Red Alert. Our platform provides a Gymnasium-style Python API, integrates with the Model Context Protocol (MCP) to expose 50 game actions as tool calls, and implements an asynchronous architecture for agents operating slower than real-time. To validate the platform, we demonstrate a Qwen3 32B agent across five episodes, exposing an eight-dimensional reward vector that reveals specific strategic weaknesses: the agent achieves 0.58–0.80 scores on economic management but zero on combat execution, demonstrating that even frontier LLMs require substantial learning to master RTS games.
01Introduction
Real-time strategy (RTS) games have long served as a compelling testbed for artificial intelligence research, demanding capabilities that span rapid decision-making, long-horizon planning, resource management, and adversarial reasoning under partial observability. Landmark efforts in this domain include DeepMind's AlphaStar for StarCraft II, OpenAI Five for Dota 2, and earlier work on StarCraft. These systems achieved impressive performance, demonstrating that machine learning can master complex strategic environments.
However, these prior approaches relied on highly specialized architectures and training pipelines that do not generalize beyond their target games. AlphaStar, for instance, required a bespoke neural network architecture, imitation learning from human replays, and distributed reinforcement learning across thousands of TPUs. Such methods yield domain-specific experts but offer little reusable infrastructure for other tasks.
Meanwhile, LLM-based agents have emerged as a promising paradigm for general-purpose reasoning, leveraging pretrained world knowledge, natural language reasoning, and high-level semantic actions to tackle diverse tasks, including web navigation, code generation, and tool use.
We address this gap with OpenRA-RL, an open-source platform that enables LLM agents to play the classic RTS game Red Alert. Our platform provides: (1) a Gymnasium-style Python API compatible with standard RL tooling; (2) integration with the Model Context Protocol (MCP), exposing 50 game actions as tool calls for seamless LLM agent deployment; and (3) an asynchronous architecture that gracefully handles agents operating slower than real-time.
RTS games exercise capabilities with broad real-world relevance. The skills required—strategic planning under uncertainty, resource allocation, multi-objective optimization, and adaptive response to adversarial dynamics—closely mirror those demanded in domains such as supply chain management, autonomous logistics coordination, and multi-robot systems.
To our knowledge, OpenRA-RL is the first platform specifically designed to support LLM-based agents in RTS games. Our contributions are twofold: (1) we provide the infrastructure necessary for systematic evaluation of LLM agents in a complex, real-time strategic environment; and (2) we enable direct comparison between LLM-based approaches and traditional methods, facilitating research into the strengths and limitations of general-purpose reasoning for strategic decision-making.
02Architecture & Engineering Design
OpenRA-RL is built upon a modular architecture designed to accommodate diverse agent paradigms while maintaining real-time game execution. At its core, the platform consists of three interconnected layers: a modified OpenRA game engine written in C# that executes the game logic at approximately 25 ticks per second; a gRPC bridge that exposes game state and accepts agent commands; and a Python wrapper providing a standardized Gymnasium-style interface via FastAPI.
This architecture decouples agent computation from game execution, allowing agents of varying speeds—from fast scripted bots to slower LLM-based reasoners—to interact with the same environment without disrupting game flow.
The platform supports three primary agent modalities. Scripted bots implement hand-crafted strategies through state machines, serving as baselines. Reinforcement learning agents consume the gRPC bridge through a Gymnasium-compatible adapter. Most notably, LLM agents interact through natural language reasoning, with the platform exposing 50 game actions as tools via the Model Context Protocol (MCP).
2.1 Asynchronous Agent–Environment Decoupling
A fundamental challenge in applying LLM-based agents to real-time strategy games is the mismatch between game speed and agent inference time: the game engine ticks at approximately 25 Hz, while a single LLM decision may take over 2 seconds. To decouple these two timescales, the platform implements a dual-channel architecture using .NET System.Threading.Channels with bounded, non-blocking semantics.
Observation channel (game → agent). Each tick, the game engine serializes the current world state into a GameObservation protobuf message and writes it to a BoundedChannel<GameObservation> configured with capacity 1 and a DropOldest full-mode policy. The agent therefore always receives the most recent game snapshot, regardless of how many ticks elapsed during its thinking time.
Action channel (agent → game). When the agent completes a decision, it may issue multiple commands in a single batch. These are written to a BoundedChannel<AgentAction> with capacity 16, providing sufficient buffer for command batches without blocking the agent.
Non-blocking guarantee. Both channels use DropOldest semantics and all writes are non-blocking (TryWrite). The game thread never waits for the agent—if no action has arrived by the time a tick executes, the engine proceeds with a no-op. This ensures game progression is entirely independent of agent latency, a critical property for fair benchmarking.
2.2 Multi-Session Architecture
The initial design (v1) spawned a separate .NET process for each game, incurring repeated JIT compilation and mod data loading per instance. At 64 concurrent sessions, this consumed approximately 40 GB of memory and required 5–15 seconds per reset cycle.
The current design (v2) hosts up to 64 concurrent game sessions within a single .NET process. The key insight is that game static data—unit statistics, building attributes, tech trees, map rules—is immutable after initialization. This ModData is loaded once and shared across all sessions without locks, eliminating approximately 35 GB of redundant memory.
| Metric | Legacy (v1) | Multi-Session (v2) | Improvement |
|---|---|---|---|
| Reset latency | 5–15 s | 256 ms | ~40× |
| RSS (64 sessions) | ~40 GB | ~6 GB | ~7× |
| JIT compilations | 64× | 1× | 64× |
| Active threads | ~200 | ~20 | ~10× |
| Aggregate ticks/sec | ~8K | ~15K | ~2× |
Game ticks are processed by a fixed-size worker pool with N threads equal to the CPU core count, backed by a BlockingCollection<WorkItem> queue. This pool is deliberately separate from the .NET ThreadPool used by gRPC request handlers—a critical design choice motivated by a failure observed during testing: when game tick tasks saturated the shared ThreadPool, gRPC handlers starved and could not accept new requests, causing 0 of 16 sessions to complete.
2.3 Communication and Lifecycle Management
Communication between the Python environment server and the C# game engine is mediated by gRPC, defined in a shared rl_bridge.proto schema. The service exposes two communication patterns. For the real-time game loop, a bidirectional streaming RPC (GameSession) is used. For discrete operations—creating a session, destroying a session, querying game state, or advancing a fixed number of ticks—the service provides unary RPCs that accept a session_id parameter to route requests to the correct game instance.
The environment lifecycle is managed as an explicit state machine with eight states: IDLE, LAUNCHING, LOADING, CONNECTING, STREAMING, PLAYING, GAME_OVER, and CLEANUP. Two explicit error transitions handle failure: a TIMEOUT path and a CONN_LOST path, both of which trigger an immediate abort and cleanup cycle.
2.4 Replay and Observability
OpenRA-RL records every game session as a deterministic .orarep replay file. Each replay captures the complete sequence of orders and random seed, enabling exact tick-by-tick reproduction of the game via a ReplayConnection packet reader. Replay files also embed the Docker image version of the engine that produced them, ensuring playback fidelity even after engine upgrades.
Viewing is handled through a browser-based noVNC interface running inside the Docker container (openra-rl replay watch), requiring no local game installation or graphics drivers—making replay analysis accessible from headless cloud training instances.
2.5 Integration with OpenEnv
OpenRA-RL is built as a first-class environment for OpenEnv, the emerging PyTorch-native standard for reinforcement learning environment creation and interoperability. OpenEnv defines a minimal, typed environment contract—reset, step, and structured observation/action spaces—together with a distribution and discovery layer on the Hugging Face Hub.
Because OpenRA-RL is an OpenEnv environment, it is immediately consumable by the PyTorch-native post-training stack—TRL, torchforge, and Unsloth—without any environment-specific adapters on the trainer side. A researcher who wants to run GRPO, PPO, or any other policy-optimization algorithm against OpenRA-RL can do so by pointing their existing training script at the environment's OpenEnv identifier.
Most existing OpenEnv environments target narrow, short-horizon tasks: code execution, single-turn tool use, and small-scale games. OpenRA-RL extends OpenEnv's applicability to a long-horizon, adversarial, real-time regime with combinatorial action spaces and variable inference latencies.
03Demonstration
We have developed a Python normal AI bot that can beat OpenRA easy AI bot. The environment code is on GitHub (OpenRA-RL) and also deployed as a live HuggingFace Space; the training/challenge code is at GitHub (openra-rl-challenge). The replay can be watched on YouTube.
To exercise the platform end-to-end through the OpenEnv interface, we deploy Qwen3 32B served locally via Ollama as an LLM agent playing the Allied faction against the built-in Beginner AI on a 128×128 map. The agent receives structured game-state observations as tool responses and issues actions through the MCP tools. We run five episodes under two timing regimes: Games 1–2 use a 30-minute time limit, while Games 3–5 use a 5-minute limit.
3.1 End-of-Episode Scorecard
All five games ended without combat engagement (zero kills, zero casualties): the agent successfully establishes a base economy but does not translate that economy into offensive force before the time limit expires.
| Game | Duration | Ticks | Assets | Bldgs | Army | Explored | Calls |
|---|---|---|---|---|---|---|---|
| Game 1† | 30:23 | 1621 | $6,600 | 5 | $2,920 | 3.7% | 62 |
| Game 2† | 30:15 | 1477 | $4,000 | 3 | $2,340 | 2.7% | 81 |
| Game 3 | 5:01 | 540 | $2,800 | 3 | $640 | 2.7% | 18 |
| Game 4 | 5:19 | 509 | $2,300 | 2 | $540 | 2.2% | 19 |
| Game 5 | 5:17 | 621 | $2,800 | 3 | $740 | 2.7% | 21 |
3.2 Multi-Dimensional Reward Profile
Decomposing performance this way exposes an asymmetric agent: strong on economic and infrastructural sub-skills, absent on combat and disruption. This is the practical motivation for exposing reward as a vector rather than a scalar: it identifies which strategic dimensions need improvement rather than only that the game was lost.
3.3 Build-Order Timelines
Across all five games, the agent constructs a Power Plant, Barracks, and (in some episodes) a Refinery; unit production, however, lags significantly, and no offensive units reach the field before the time limit expires in any episode.
3.4 Tool-Call Distribution
advance (~57% of all calls) reflects the asynchronous architecture: a slow LLM reasoner explicitly compresses idle game time by skipping ticks, decoupling its ~2-second decision latency from the engine's 25 Hz tick rate.The most frequent call by a wide margin is advance, which time-skips the game forward by a specified number of ticks—the primary mechanism by which an LLM agent operating at multi-second decision latencies can interact with a 25 Hz game engine without leaving the game idle.
3.5 Macro-Management Trends
3.6 Sample Agent Reasoning Trace
The trace below reproduces the first ten turns of Game 5 verbatim from the structured run log, illustrating the three phases of an episode: an initial intelligence and planning phase, construction of the early-game economy, and explicit time-skipping via advance to bridge the gap between the agent's multi-second decision latency and the 25 Hz game tick rate.
Turn 1 | get_faction_briefing()
Turn 2 | end_planning_phase(strategy="Deploy MCV, build Power
Plant (300 ticks), then Barracks/War Factory (500/750),
train E1 + 1tnk, attack AI base at (64,64).")
-> tick=2, cash=$5000, units=1, buildings=0
Turn 3 | deploy_unit(unit_id=120)
-> tick=4, cash=$5000, units=1, buildings=0
Turn 4 | build_and_place(building_type="apwr")
Turn 5 | advance(ticks=100)
-> tick=55, cash=$5000, units=0, buildings=1
Turn 6 | build_and_place(building_type="apwr")
Turn 7 | build_and_place(building_type="powr")
-> tick=57, cash=$5000, units=0, buildings=1
Turn 8 | advance(ticks=60)
-> tick=108, cash=$4923, units=0, buildings=1
Turn 9 | advance(ticks=130)
-> tick=159, cash=$4838, units=0, buildings=1
Turn 10 | advance(ticks=80)
-> tick=210, cash=$4753, units=0, buildings=1
3.7 API Usage
The minimal Python code below instantiates an OpenRA-RL environment, resets it into a scenario, steps it with a structured action, and closes it to finalize the replay—following the standard OpenEnv reset/step/close contract.
from openra_env.config import load_config
from openra_env.server.openra_environment import OpenRAEnvironment
from openra_env.models import ActionType, CommandModel, OpenRAAction
# 1. Configure and instantiate the environment.
config = load_config(game={
"grpc_port": 8000,
"map_name": "tank-duel-basic",
"headless": True,
})
env = OpenRAEnvironment(config=config)
# 2. Reset into a scenario; obs is a structured observation
# (economy, military, unit/building lists, 9-channel spatial map).
obs = env.reset(seed=0)
# 3. Issue a structured action. OpenRAAction wraps one or more
# CommandModel entries drawn from 21 ActionType values
# (MOVE, ATTACK, BUILD, TRAIN, DEPLOY, ...).
action = OpenRAAction(commands=[
CommandModel(action=ActionType.BUILD, item_type="powr"),
])
obs = env.step(action)
# 4. Close the environment; this finalizes the .orarep replay file.
env.close()
3.8 Key Takeaways
advance calls that explicitly time-skip the game engine. Without the observation-drop / action-bounded channel design, an LLM agent operating at ~2-second decision latencies could not meaningfully play a 25 Hz real-time strategy game at all.
04Conclusion
We have presented OpenRA-RL, the first platform specifically designed to support LLM-based agents in real-time strategy games. Unlike prior RTS AI systems that relied on specialized architectures limited to a single game, OpenRA-RL provides general-purpose infrastructure built on the OpenEnv standard, enabling systematic research on long-horizon planning and strategic reasoning with diverse agent paradigms.
The platform addresses a critical infrastructure gap through three key contributions. First, a modular three-layer architecture decouples agent computation from game execution via a gRPC bridge, Gymnasium-style Python API, and Model Context Protocol integration that exposes 50 game actions as tool calls compatible with frontier LLM systems. Second, an asynchronous dual-channel design gracefully handles agents operating slower than real-time, with bounded observation and action buffers that prevent game progression from blocking on agent latency. Third, a multi-session architecture hosts 64 concurrent game sessions in a single process, reducing reset latency from 5–15 seconds to 256ms and memory consumption from approximately 40 GB to approximately 6 GB.
Our demonstration with a Qwen3 32B agent validates both the platform's technical capabilities and its utility as a research testbed. The agent's performance—achieving 0.58–0.80 scores on economic management but zero on combat execution across five episodes—reveals that even frontier LLMs require substantial learning to master RTS games, confirming strategic depth that is neither trivially solved by prompt engineering nor reducible to short-horizon reasoning.
As a first-class OpenEnv environment distributed via the Hugging Face Hub, OpenRA-RL is immediately consumable by PyTorch-native training frameworks including TRL, torchforge, and Unsloth without environment-specific adapters. We release the platform as open-source software and invite the research community to build upon this foundation for advancing strategic reasoning in AI agents.