Meta-learning for wrestling

We show that for the task of simulated robot wrestling, a meta-learning agent can learn to quickly defeat a stronger non-meta-learning agent, and also show that the meta-learning agent can adapt to physical malfunction.

We’ve extended theModel-Agnostic Meta-Learning (MAML)⁠(opens in a new window)algorithm by basing its objective function on optimizing against pairs of environments, rather than single ones as in stock MAML. MAML initializes the policies of our agents so that after only a small number of parameter updates during execution on a new environment (or task) the agents learn to do better in that environment. The policy parameter updates at execution are done via gradient ascent steps on the reward collected during the few episodes of initial interaction with a new environment. By training on pairs we’re able to create policies that quickly adapt to previously unseen environments, as long as the environment doesn’t diverge too wildly from previous ones.

To test our continuous adaptation approach we designed 3 types of agents—Ant (4-leg), Bug (6-leg), and Spider (8-leg)—and set up a multi-round game where each agent played several matches against the same opponent and adapted its policy parameters between the rounds to better counter the opponent’s policy. In tests, we found that agents that could adapt their tactics are much better competitors than agents that have fixed policies. After training over a hundred agents, some of which learned fixed policies and others learned to adapt, we evaluated the fitness of each agent.

Learning on the fly can also let agents deal with unusual changes in their own bodies as well, like adapting to some of their own limbs losing functionality over time. This suggests we can use techniques like this to develop agents that can handle both changes in their external environment and also changes in their own bodies or internal states.

We’re exploring meta-learning as part of our work on large-scale multi-agent research. Additionally, we’rereleasing the MuJoCo environments and trained policies⁠(opens in a new window)used in this work so that others can experiment with these systems.

Maruan Al-Shedivat, Trapit Bansal, Ilya Sutskever, Yura Burda, Igor Mordatch, Pieter Abbeel

Dota 2 with large scale deep reinforcement learning Publication Dec 13, 2019

Solving Rubik’s Cube with a robot hand Milestone Oct 15, 2019

OpenAI Five defeats Dota 2 world champions Milestone Apr 15, 2019

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Meta-learning for wrestling

The unpaid, unrecognised burden of the women-led care economy of India

Andrej Karpathy Transitions from Coding to Directing AI Agents

Musk and Hassabis Discuss AI's Impact on Scientific Discovery

Perfios Reports 46% Profit Increase to ₹104 Cr in FY25, Revenue Surpasses ₹700 Cr

Latest Briefs