
The Allen Institute for AI (Ai2) recently unveiled what it described as its most capable model family so far, Olmo 3. Since that launch, the team has continued iterating, extending its reinforcement learning (RL) runs to produce an upgraded release: Olmo 3.1.
The Olmo 3.1 lineup is aimed at enterprises, emphasizing efficiency, transparency, and fine-grained control.
Ai2 refreshed two of the three Olmo 3 variants: Olmo 3.1 Think 32B, its flagship model geared toward advanced research, and Olmo 3.1 Instruct 32B, tuned for instruction-following, multi-turn conversations, and tool integration.
The third model in the family, Olmo 3-Base, targets programming, comprehension, and math tasks, and is also well-suited for continued fine-tuning.
According to Ai2, upgrading Olmo 3 Think 32B to Olmo 3.1 involved extending its strongest RL run with a longer training schedule.
“After the original Olmo 3 launch, we resumed our RL training run for Olmo 3 32B Think, training for an additional 21 days on 224 GPUs with extra epochs over our Dolci-Think-RL dataset,” Ai2 explained in a blog post. “This yielded Olmo 3.1 32B Think, which brings substantial gains across math, reasoning, and instruction-following benchmarks: improvements of 5+ points on AIME, 4+ points on ZebraLogic, 4+ points on IFEval, and 20+ points on IFBench, alongside stronger performance on coding and complex multi-step tasks.”
To produce Olmo 3.1 Instruct, Ai2 said its researchers scaled up the approach used for the smaller 7B Instruct model and applied it to the 32B version.
Olmo 3.1 Instruct 32B is “optimized for chat, tool use, & multi-turn dialogue—making it a much more performant sibling of Olmo 3 Instruct 7B and ready for real-world applications,” Ai2 noted in a post on X.
For the moment, the new checkpoints can be accessed via the Ai2 Playground or Hugging Face, with API access planned soon.
Better performance on benchmarks
On standard benchmarks, the Olmo 3.1 models show clear gains, consistently surpassing their Olmo 3 predecessors.
Olmo 3.1 Think exceeded Qwen 3 32B models on the AIME 2025 benchmark and came close to matching Gemma 27B.
Olmo 3.1 Instruct also performed strongly against other open-source models, even outperforming systems like Gemma 3 on the Math benchmark.
“As for Olmo 3.1 32B Instruct, it’s a larger-scale instruction-tuned model built for chat, tool use, and multi-turn dialogue. Olmo 3.1 32B Instruct is our most capable fully open chat model to date and — in our evaluations — the strongest fully open 32B-scale instruct model,” the company said.
Ai2 additionally refreshed its RL-Zero 7B models focused on math and coding, saying on X that both benefited from longer, more stable training runs.
Commitment to transparency and open source
Ai2 previously told VentureBeat that the Olmo 3 series was built to give enterprises and research institutions deeper control over, and visibility into, the data and training processes behind the models.
Organizations can augment the training corpus with their own data and retrain the models so they learn from those additions.
This aligns with Ai2’s longstanding focus on openness, which includes a tool called OlmoTrace that helps trace how LLM outputs relate to their training data.
“Together, Olmo 3.1 Think 32B and Olmo 3.1 Instruct 32B show that openness and performance can advance together. By extending the same model flow, we continue to improve capabilities while retaining end-to-end transparency over data, code, and training decisions,” Ai2 said.