OpenAI Five

Our team of five neural networks, OpenAI Five, has started todefeat⁠amateur human teams atDota 2⁠(opens in a new window). While today we play withrestrictions⁠, we aim to beat a team of top professionals atThe International⁠(opens in a new window))in August subject only to a limited set of heroes. We may not succeed: Dota 2 is one of the most popular andcomplex⁠(opens in a new window)esports games in the world, with creative and motivated professionals whotrain⁠(opens in a new window)year-round to earn part of Dota’s annual$40Mprize pool⁠(opens in a new window)(the largest of any esports game).

OpenAI Five plays 180 years worth of games against itself every day, learning via self-play. It trains using a scaled-up version ofProximal Policy Optimization⁠running on 256 GPUs and 128,000 CPU cores—a larger-scale version of the system we built to play the much-simplersolo variant⁠of the game last year. Using a separateLSTM⁠(opens in a new window)for each hero and no human data, it learns recognizable strategies. This indicates thatreinforcement learning⁠(opens in a new window)can yield long-term planning with large but achievable scale—without fundamental advances, contrary to our own expectations upon starting the project.

To benchmark our progress, we’ll host a match versus top players on August 5th.Follow⁠(opens in a new window)us on Twitch to view the live broadcast, orrequest⁠(opens in a new window)an invite to attend in person!

One AI milestone is to exceed human capabilities in a complex video game likeStarCraft⁠(opens in a new window)or Dota. Relative to previous AI milestones likeChess⁠(opens in a new window))orGo⁠(opens in a new window), complex video games start to capture the messiness and continuous nature of the real world. The hope is that systems which solve complex video games will be highly general, with applications outside of games.

Dota 2 is a real-time strategy game played between two teams of five players, with each player controlling a character called a “hero”. A Dota-playing AI must master the following:

The Dota rules are also very complex — the game has been actively developed for over a decade, with game logic implemented in hundreds of thousands of lines of code. This logic takes milliseconds per tick to execute, versus nanoseconds for Chess or Go engines. The game also gets an update about once every two weeks, constantly changing the environment semantics.

Our system learns using a massively-scaled version ofProximal Policy Optimization⁠. Both OpenAI Five and our earlier1v1 bot⁠learn entirely from self-play. They start with random parameters and do not usesearch⁠(opens in a new window)or bootstrap from human replays.

OpenAI 1v1 bot OpenAI Five CPUs 60,000 CPU cores on Azure 128,000 preemptible CPU cores on GCP GPUs 256 K80 GPUs on Azure 256 P100 GPUs on GCP Experience collected~300 years per day~180 years per day (~900 years per day counting each hero separately) Size of observation~3.3 kB~36.8 kB Observations per second of gameplay 10 7.5 Batch size 8,388,608 observations 1,048,576 observations Batches per minute~20~60

RL researchers (including ourselves) have generallybelieved⁠(opens in a new window)that long time horizons would require fundamentally new advances, such ashierarchical⁠(opens in a new window)reinforcement⁠learning⁠(opens in a new window). Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way ofexploring⁠.

Our agent is trained to maximize the exponentially decayed sum of future rewards, weighted by an exponential decay factor called`γ`. During the latest training run of OpenAI Five, we annealed`γ`from`0.998`(valuing future rewards with a half-life of 46 seconds) to`0.9997`(valuing future rewards with a half-life of five minutes). For comparison, the longest horizon in thePPO⁠(opens in a new window)paper was a half-life of 0.5 seconds, the longest in theRainbow⁠(opens in a new window)paper was a half-life of 4.4 seconds, and theObserve and Look Further⁠(opens in a new window)paper used a half-life of 46 seconds.

While the current version of OpenAI Five is weak atlast-hitting⁠(opens in a new window)(observing our test matches, the professional Dota commentatorBlitz⁠(opens in a new window)estimated it around median for Dota players), itsobjective prioritization⁠matches a common professional strategy. Gaining long-term rewards such as strategic map control often requires sacrificing short-term rewards such as gold gained fromfarming⁠(opens in a new window), since grouping up to attack towers takes time. This observation reinforces our belief that the system is truly optimizing over a long horizon.

Each ofOpenAI Five’s networks⁠(opens in a new window)contain a single-layer, 1024-unitLSTM⁠(opens in a new window)that sees the current game state (extracted from Valve’sBot API⁠(opens in a new window)) and emits actions through several possible action heads. Each head has semantic meaning, for example, the number of ticks to delay this action, which action to select, the X or Y coordinate of this action in a grid around the unit, etc. Action heads are computed independently.

_Interactive demonstration of the observation space and action space used by OpenAI Five. OpenAI Five views the world as a list of 20,000 numbers, and takes an action by emitting a list of 8 enumeration values. Select different actions and targets to understand how OpenAI Five encodes each action, and how it observes the world. The image shows the scene as a human would see it._

Scene 4: Team Zoning Mid Push

OpenAI Five can react to missing pieces of state that correlate with what it does see. For example, until recently OpenAI Five’s observations did not includeshrapnel⁠(opens in a new window)zones (areas where projectiles rain down on enemies), which humans see on screen. However, we observed OpenAI Five learning to walk out of (though not avoid entering) active shrapnel zones, since it could see its health decreasing.

Given a learning algorithm capable of handling long horizons, we still need to explore the environment. Even with ourrestrictions⁠, there are hundreds of items, dozens of buildings, spells, and unit types, and a long tail of game mechanics to learn about—many of which yield powerful combinations. It’s not easy to explore this combinatorially-vast space efficiently.

OpenAI Five learns from self-play (starting from random weights), which provides a natural curriculum for exploring the environment. To avoid “strategy collapse”, the agent trains 80% of its games against itself and the other 20% against its past selves. In the first games, the heroes walk aimlessly around the map. After several hours of training, concepts such aslaning⁠(opens in a new window),farming⁠(opens in a new window), or fighting overmid⁠(opens in a new window)emerge. After several days, they consistently adopt basic human strategies: attempt to stealBounty⁠(opens in a new window)runes from their opponents, walk to theirtier one⁠(opens in a new window)towers to farm, and rotate heroes around the map to gain lane advantage. And with further training, they become proficient at high-level strategies like5-hero push⁠(opens in a new window).

In March 2017, our firstagent⁠(opens in a new window)defeated bots but got confused against humans. To force exploration in strategy space, during training (and only during training) we randomized the properties (health, speed, start level, etc.) of the units, and it began beating humans. Later on, when a test player was consistently beating our 1v1 bot, we increased our training randomizations and the test player started to lose. (Our robotics team concurrently applied similar randomization techniques tophysical⁠robots⁠to transfer from simulation to the real world.)

OpenAI Five uses the randomizations we wrote for our 1v1 bot. It also uses a new “lane assignment” one. At the beginning of each training game, we randomly “assign” each hero to some subset oflanes⁠(opens in a new window)and penalize it for straying from those lanes until a randomly-chosen time in the game.

Exploration is also helped by a good reward.Our reward⁠(opens in a new window)consists mostly of metrics humans track to decide how they’re doing in the game: net worth, kills, deaths, assists, last hits, and the like. We postprocess each agent’s reward by subtracting the other team’s average reward to prevent the agents from finding positive-sum situations.

We hardcode item and skill builds (originally written for ourscripted⁠baseline), and choose which of the builds to use at random.Courier⁠(opens in a new window)management is also imported from the scripted baseline.

OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.

Our system is implemented as a general-purpose RL training system called Rapid, which can be applied to anyGym⁠(opens in a new window)environment. We’ve used Rapid to solve other problems at OpenAI, includingCompetitive Self-Play⁠.

The training system is separated into _rollout_ workers, which run a copy of the game and an agent gathering experience, and _optimizer_ nodes, which perform synchronous gradient descent across a fleet of GPUs. The rollout workers sync their experience through Redis to the optimizers. Each experiment also contains workers evaluating the trained agent versus reference agents, as well as monitoring software such asTensorBoard⁠(opens in a new window),Sentry⁠(opens in a new window), andGrafana⁠(opens in a new window).

During synchronous gradient descent, each GPU computes a gradient on its part of the batch, and then the gradients are globally averaged. We originally usedMPI’s⁠(opens in a new window)allreduce⁠(opens in a new window)for averaging, but now use our ownNCCL2⁠(opens in a new window)wrappers that parallelize GPU computations and network data transfer.The latencies for synchronizing 58MB of data (size of OpenAI Five’s parameters) across different numbers of GPUs are shown on the right. The latency is low enough to be largely masked by GPU computation which runs in parallel with it.

We’ve implemented Kubernetes, Azure, and GCP backends for Rapid.

Thus far OpenAI Five has played (with ourrestrictions⁠) versus each of these teams:

1. Best OpenAI employee team: 2.5kMMR⁠(opens in a new window)(46th percentile) 2. Best audience players watching OpenAI employee match (including Blitz, who commentated the first OpenAI employee match): 4–6k MMR (90th-99th percentile), though they’d never played as a team. 3. Valve employee team: 2.5–4k MMR (46th-90th percentile). 4. Amateur team: 4.2k MMR (93rd percentile), trains as a team. 5. Semi-pro team: 5.5k MMR (99th percentile), trains as a team.

The April 23rd version of OpenAI Five was the first to beat our scripted baseline. The May 15th version of OpenAI Five was evenly matched versus team 1, winning one game and losing another. The June 6th version of OpenAI Five decisively won all its games versus teams 1–3. We set up informalscrims⁠(opens in a new window)with teams 4 & 5, expecting to lose soundly, but OpenAI Five won two of its first three games versus both.

> “The teamwork aspect of the bot was just overwhelming. It feels like five selfless players that know a good general strategy.”

We observed that OpenAI Five:

Trophies awarded after the match between the best players at OpenAI and our bot team. One trophy for the humans, one trophy for the bots (represented by Susan Zhang from our team!)

## Differences versus humans

OpenAI Five is given access to the same information as humans, but instantly sees data like positions, healths, and item inventories that humans have to check manually. Our method isn’t fundamentally tied to observing state, but just rendering pixels from the game would require thousands of GPUs.

OpenAI Five averages around 150-170 actions per minute (and has a theoretical maximum of 450 due to observing every 4th frame). Frame-perfect timing, whilepossible⁠(opens in a new window)for skilled players, is trivial for OpenAI Five. OpenAI Five has an average reaction time of 80ms, which is faster than humans.

These differences matter most in 1v1 (where our bot had a reaction time of 67ms), but the playing field is relatively equitable as we’ve seen humans learn from and adapt to the bot. Dozens ofprofessionals⁠(opens in a new window)used⁠(opens in a new window)our 1v1 bot fortraining⁠(opens in a new window)in the months after last year’sTI⁠(opens in a new window)). According to Blitz, the 1v1 bot has changed the way people think about 1v1s (the bot adopted a fast-paced playstyle, and everyone has now adapted to keep up).

## Surprising findings

A subset of the OpenAI Dota team, holding the laptop thatdefeated⁠the world’s top professionals at Dota 1v1 at The International last year.*

Our team is focused on making our August goal. We don’t know if it will be achievable, but we believe that with hard work (and some luck) we have a real shot.

This post described a snapshot of our system as of June 6th. We’ll release updates along the way to surpassing human performance and write a report on our final system once we complete the project. Please join us on August 5thvirtually⁠(opens in a new window)orin person⁠(opens in a new window), when we’ll play a team of top players!

Our underlying motivation reaches beyond Dota. Real-world AI deployments will need to deal with thechallenges⁠raised by Dota which are not reflected in Chess, Go, Atari games, or Mujoco benchmark tasks. Ultimately, we will measure the success of our Dota system in its application to real-world tasks. If you’d like to be part of what comes next, we’rehiring⁠!

Greg Brockman, Christy Dennison, Susan Zhang, Jakub Pachocki, Michael Petrov, Henrique Pondé, Przemysław Dębiak, David Farhi, Filip Wolski, Jonathan Raiman, Jie Tang, Szymon Sidor, Brooke Chan

Quirin Fischer, Christopher Hesse, Shariq Hashme, Ilya Sutskever, Alec Radford, Scott Gray, Jack Clark, Paul Christiano, David Luan, Christopher Berner, Eric Sigler, Jonas Schneider, Larissa Schiavo, Diane Yoon, John Schulman

## Current set of restrictions

* Mirror match ofNecrophos⁠(opens in a new window),Sniper⁠(opens in a new window),Viper⁠(opens in a new window),Crystal Maiden⁠(opens in a new window), andLich⁠(opens in a new window)

* Nowarding⁠(opens in a new window)

* NoRoshan⁠(opens in a new window)

* Noinvisibility⁠(opens in a new window)(consumables and relevant items)

* Nosummons⁠(opens in a new window)/illusions⁠(opens in a new window)

* NoDivine Rapier⁠(opens in a new window),Bottle⁠(opens in a new window),Quelling Blade⁠(opens in a new window),Boots of Travel⁠(opens in a new window),Tome of Knowledge⁠(opens in a new window),Infused Raindrop⁠(opens in a new window)

* 5 invulnerable couriers, no exploiting them by scouting or tanking

* NoScan⁠(opens in a new window)

The hero set restriction makes the game very different from how Dota is played at world-elite level (i.e.Captains Mode⁠(opens in a new window)drafting from all 100+ heroes). However, the difference from regular “public” games (All Pick⁠(opens in a new window)/Random Draft⁠(opens in a new window)) is smaller.

Most of the restrictions come from remaining aspects of the game we haven’t integrated yet. Some restrictions, in particular wards and Roshan, are central components of professional-level play. We’re working to add these as soon as possible.

Thanks to the following for feedback on drafts of this post: Alexander Lavin, Andrew Gibiansky, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, David Dohan, David Ha, Denny Britz, Erich Elsen, James Bradbury, John Miller, Luke Metz, Maddie Hall, Miles Brundage, Nelson Elhage, Ofir Nachum, Pieter Abbeel, Rumen Hristov, Shubho Sengupta, Solomon Boulos, Stephen Merity, Tom Brown, Zak Stone

Scaling laws for reward model overoptimization Publication Oct 19, 2022

Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022

Techniques for training large neural networks Publication Jun 9, 2022

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

The unpaid, unrecognised burden of the women-led care economy of India

Andrej Karpathy Transitions from Coding to Directing AI Agents

Musk and Hassabis Discuss AI's Impact on Scientific Discovery

Perfios Reports 46% Profit Increase to ₹104 Cr in FY25, Revenue Surpasses ₹700 Cr

Latest Briefs