
While many enterprises were still nowhere near considering agentic behaviors or infrastructures, Booking.com had already “stumbled” into them through its internally built conversational recommendation system.
This early hands-on work let the company pause and avoid getting swept up in the current AI agent frenzy. Instead, it’s pursuing a disciplined, layered, modular strategy for model development: compact, travel-focused models for low-cost, fast inference; larger large language models (LLMs) for deeper reasoning and comprehension; and tightly domain-tuned, in-house evaluations where accuracy really matters.
With this blended approach — plus selective collaboration with OpenAI — Booking.com has doubled accuracy across core retrieval, ranking and customer-facing interaction tasks.
As Pranav Pathak, Booking.com’s AI product development lead, told VentureBeat in a recent podcast: “Do you build it very, very specialized and bespoke and then have an army of a hundred agents? Or do you keep it general enough and have five agents that are good at generalized tasks, but then you have to orchestrate a lot around them? That's a balance that I think we're still trying to figure out, as is the rest of the industry.”
Listen to the new Beyond the Pilot podcast here, then read on for key takeaways.
From guesswork to deep personalization — without crossing the ‘creepy’ line
Recommendation engines sit at the heart of Booking.com’s customer experiences, but traditional recommenders have historically been closer to guesswork than true personalization, Pathak admitted. From day one, he and his team set out to avoid generic tooling: in his words, both price and recommendations should be grounded in customer context.
Booking.com’s first pre-gen AI system for intent and topic detection was a small language model, roughly “the scale and size of BERT,” as Pathak described it. The model consumed customer problem descriptions to decide whether an issue could be resolved via self-service or needed to be escalated to a human agent.
“We started with an architecture of ‘you have to call a tool if this is the intent you detect and this is how you've parsed the structure,’” Pathak said. “That was very, very similar to the first few agentic architectures that came out in terms of reason and defining a tool call.”
Since then, his team has expanded that design to include an LLM-based orchestrator that classifies queries, kicks off retrieval-augmented generation (RAG) and invokes APIs or smaller, specialized language models. “We've been able to scale that system quite well because it was so close in architecture that, with a few tweaks, we now have a full agentic stack,” Pathak said.
The impact: Booking.com is seeing a 2X improvement in topic detection, which in turn is increasing human agent capacity by roughly 1.5 to 1.7X. More topics — including complex ones that previously fell into an ‘other’ bucket and required escalation — are now automated.
This enables more effective self-service and lets human agents concentrate on customers with highly specific, unusual issues that don’t map to an existing tool flow — for example, a family locked out of their hotel room at 2 a.m. when the front desk is closed.
That shift “really starts to compound,” Pathak said, and it has a direct, long-term effect on customer retention. “One of the things we've seen is, the better we are at customer service, the more loyal our customers are.”
Another recent launch is personalized filtering. Booking.com offers between 200 and 250 search filters on its site — far too many for any user to reasonably navigate, Pathak noted. To address this, his team added a free-text input box where users can describe what they want and instantly receive tailored filters.
“That becomes such an important cue for personalization in terms of what you're looking for in your own words rather than a clickstream,” he said.
This, in turn, reveals what customers truly care about. One example: hot tubs. When filter personalization first went live, jacuzzis quickly emerged as one of the most requested features. It hadn’t even been on the radar before; there was no dedicated filter. Now, that filter exists.
“I had no idea,” Pathak admitted. “I had never searched for a hot tub in my room honestly.”
Still, personalization has limits; memory is particularly tricky, Pathak stressed. While it’s valuable to maintain long-term context and evolving conversations — such as typical budgets, preferred star ratings or accessibility needs — this must happen on the customer’s terms and with strong privacy protections.
Booking.com is extremely cautious with memory, explicitly seeking consent to avoid being “creepy” when storing or reusing customer data.
“Managing memory is much harder than actually building memory,” Pathak said. “The tech is out there, we have the technical chops to build it. We want to make sure we don't launch a memory object that doesn't respect customer consent, that doesn't feel very natural.”
Striking the right build-versus-buy balance
As agents become more capable, Booking.com is wrestling with a core industry-wide question: How narrow or broad should agents be?
Rather than fully committing to either a large swarm of ultra-specialized agents or a small set of general-purpose ones, the company focuses on reversible choices and avoids “one-way doors” that would lock its architecture into rigid, expensive directions. Pathak’s guiding principle: generalize where it makes sense, specialize where it’s essential, and keep agent design adaptable to preserve resilience.
He and his team stay “very mindful” of each use case, deciding when to invest in more reusable, generalized agents versus tightly scoped, task-specific ones. Their goal is always to deploy the smallest model that still delivers the required accuracy and output quality. Anything that can be generalized, is.
Latency is another major factor. When factual correctness and minimizing hallucinations are critical, the team will opt for a larger, slower model. But for search and recommendations, user expectations demand speed. (As Pathak put it: “No one’s patient.”)
“We would, for example, never use something as heavy as GPT-5 for just topic detection or for entity extraction,” he said.
Booking.com applies a similarly flexible philosophy to monitoring and evaluation. If it’s broad, general-purpose monitoring that a third party can do better and at scale, they’ll buy it. But when it comes to enforcing brand guidelines or other company-specific constraints, they build their own evaluation systems.
Overall, Booking.com has committed to being “super anticipatory,” nimble and adaptable. “At this point with everything that's happening with AI, we are a little bit averse to walking through one way doors,” Pathak said. “We want as many of our decisions to be reversible as possible. We don't want to get locked into a decision that we cannot reverse two years from now.”
Lessons other builders can draw from Booking.com’s AI evolution
Booking.com’s AI path offers a useful reference for other organizations.
Reflecting on the journey, Pathak acknowledged that they began with a “pretty complicated” tech stack. They’ve since stabilized it, “but we probably could have started something much simpler and seen how customers interacted with it.”
With that in mind, he shared this guidance: if you’re just getting started with LLMs or agents, off-the-shelf APIs are usually more than enough. “There's enough customization with APIs that you can already get a lot of leverage before you decide you want to go do more.”
Conversely, if a use case demands capabilities that standard APIs can’t provide, that’s when it makes sense to invest in in-house tooling.
Still, he cautioned: don’t begin with the most complex projects. Start with “the simplest, most painful problem you can find and the simplest, most obvious solution to that.”
First validate product-market fit, then explore the surrounding ecosystems, he advised — but don’t rip out existing infrastructure just because a new use case seems to require a specific platform (for example, shifting an entire cloud strategy from AWS to Azure solely to access the OpenAI endpoint).
In the end: “Don't lock yourself in too early,” Pathak said. “Don't make decisions that are one-way doors until you are very confident that that's the solution that you want to go with.”