
In a new paper on tool use in large language model (LLM) agents, researchers from Google and UC Santa Barbara propose a framework that helps agents manage tool and compute budgets more efficiently. They introduce two techniques: a lightweight “Budget Tracker” and a broader framework called “Budget Aware Test-time Scaling” (BATS). Both methods make agents explicitly aware of how much reasoning and tool-use capacity they have left.
As AI agents increasingly depend on tool calls to operate in real-world settings, test-time scaling is shifting from purely improving model intelligence to carefully managing cost and latency.
For enterprise leaders and developers, budget-aware scaling offers a concrete way to deploy capable AI agents while avoiding volatile costs and diminishing returns on compute spending.
The challenge of scaling tool use
Conventional test-time scaling mainly emphasizes letting models “think” for more steps. But for agentic workflows such as web browsing, the number of tool calls directly governs how deeply and widely the agent can explore.
This creates substantial operational overhead for organizations. “Tool calls such as webpage browsing result in more token consumption, increase the context length and introduce additional time latency,” co-authors Zifeng Wang and Tengxiao Liu told VentureBeat. “Tool calls themselves introduce additional API costs.”
The team observed that simply giving agents more test-time resources does not reliably improve outcomes. “In a deep research task, if the agent has no sense of budget, it often goes down blindly,” Wang and Liu said. “It finds one somewhat related lead, then spends 10 or 20 tool calls digging into it, only to realize that the entire path was a dead end.”
Optimizing resources with Budget Tracker
To explore how to better allocate tool-use budgets, the researchers first evaluated a minimal intervention called “Budget Tracker.” This module functions as a plug-in that continuously signals resource availability to the agent, enabling budget-aware tool usage.
Their hypothesis was that “providing explicit budget signals enables the model to internalize resource constraints and adapt its strategy without requiring additional training.”
Budget Tracker works entirely at the prompt level, which keeps integration straightforward. (The paper includes the full prompt templates used for Budget Tracker, further simplifying adoption.)
In Google’s setup, the tracker supplies a short policy description outlining budget regimes and corresponding tool-use recommendations. At every step of the reasoning process, Budget Tracker informs the agent of its current resource consumption and remaining budget, allowing subsequent reasoning to be conditioned on the updated resource state.
To evaluate this, the researchers examined two scaling paradigms: sequential scaling, where the model iteratively refines a single trajectory, and parallel scaling, where multiple independent runs are executed and then aggregated. They tested search agents equipped with search and browse tools in a ReAct-style loop. ReAct (Reasoning + Acting) is a widely used approach in which the model alternates between internal deliberation and external actions. To capture a realistic cost–performance curve, they defined a unified cost metric that combines internal token usage with external tool-call expenses.
Budget Tracker was evaluated on three information-seeking QA benchmarks that require external search, including BrowseComp and HLE-Search, using models such as Gemini 2.5 Pro, Gemini 2.5 Flash, and Claude Sonnet 4. Results show that this simple plug-in improves performance across a range of budget settings.
“Adding Budget Tracker achieves comparable accuracy using 40.4% fewer search calls, 19.9% fewer browse calls, and reducing overall cost … by 31.3%,” the authors told VentureBeat. Moreover, Budget Tracker continued to yield gains as the budget grew, while vanilla ReAct performance flattened beyond a certain point.
BATS: A comprehensive framework for budget-aware scaling
To push tool-use optimization further, the researchers introduced Budget Aware Test-time Scaling (BATS), a more comprehensive framework aimed at maximizing agent performance under any fixed budget. BATS maintains a live estimate of remaining resources and uses it to dynamically adjust the agent’s behavior as it constructs its answer.
BATS coordinates several modules to govern the agent’s actions. A planning module tunes the effort spent at each step to align with the current budget, while a verification module decides whether to “dig deeper” into a promising direction or “pivot” to new leads based on remaining resources.
Given an information-seeking query and a tool-call budget, BATS starts by using the planning module to design a structured action plan and select which tools to call. When tools are invoked, their outputs are appended to the reasoning trace, enriching the context with fresh evidence. Once the agent proposes a candidate answer, the verification module evaluates it and chooses either to continue the current trajectory or to launch a new attempt with the leftover budget.
This iterative loop continues until the allocated resources are exhausted. At that point, an LLM-as-a-judge compares all verified answers and selects the best one. Throughout the process, Budget Tracker keeps updating both resource usage and remaining budget at every iteration.
The team benchmarked BATS on BrowseComp, BrowseComp-ZH, and HLE-Search, comparing it with baselines such as standard ReAct and several training-based agents. Their experiments indicate that BATS delivers higher accuracy while using fewer tool calls and incurring lower total cost than competing approaches. With Gemini 2.5 Pro as the base model, BATS reached 24.6% accuracy on BrowseComp versus 12.6% for standard ReAct, and 27.0% on HLE-Search versus 20.5% for ReAct.
BATS not only improves performance under strict budgets but also offers superior cost–performance trade-offs. On BrowseComp, for instance, BATS achieved higher accuracy at a cost of about 23 cents, whereas a parallel scaling baseline needed more than 50 cents to reach a similar level.
According to the authors, this efficiency makes previously cost-prohibitive workflows more realistic. “This unlocks a range of long-horizon, data-intensive enterprise applications… such as complex codebase maintenance, due-diligence investigations, competitive landscape research, compliance audits, and multi-step document analysis,” they said.
As enterprises roll out agents that autonomously manage their own resources, designing systems that balance accuracy against cost will become essential.
“We believe the relationship between reasoning and economics will become inseparable,” Wang and Liu said. “In the future, [models] must reason about value.”