/

Elasticsearch isn’t your AI search problem. You are.

Elasticsearch isn’t your AI search problem. You are.

The gap between a working demo and a working deployment has nothing to do with the platform.

Because nobody owns relevance — and AI search won’t fix that for you.

I co-founded a company in 2005 for one reason: organizations were investing in search platforms and not getting value from them. The technology worked fine. The practice of implementing it well didn’t exist. Twenty years later, I’m watching the same pattern play out with AI search — at considerably higher cost.

A VP or CTO evaluates RAG and hybrid search. They choose a platform — or they already have one. They stand up a pilot. It works in the demo. It struggles in production. And now they’re asking whether the problem is the model, the embeddings, the chunking strategy, or the platform itself.

Almost every time, the answer is none of those things. The problem is that they bought a technology and skipped the practice that makes it work.

We’re focusing on Elasticsearch throughout this piece — one of the most widely deployed enterprise search platforms in the world. The argument applies to any platform. The specifics are Elasticsearch.

The demo works. Your data is different.

Every AI search demo — from Elastic, from any vendor — runs on clean, structured, single-domain data. It performs beautifully. Your data is the opposite of all of that. The distance between what demos on Friday and what survives in production is where most projects go wrong.

The two gaps that cause the most damage are ones nobody talks about in the sales cycle.

Data governance comes before retrieval.  The demo assumes one canonical version of every document. Your environment has three versions of the same policy document: two outdated, one partially OCR’d from a scan, none marked as authoritative. A RAG system retrieves all three and generates a confident answer that blends them. Atlan’s 2026 RAG evaluation research calls this the “context trustworthiness gap” — you can score 0.95 faithfulness and still return business-wrong answers because the index is stale or contradictory. That’s not a retrieval problem. It’s a data governance problem. No platform solves it for you.

Evaluation has to come before optimization.  This is the one most teams skip entirely. Skylar Payne, who has audited dozens of enterprise RAG implementations, found that roughly 90% of teams add architectural complexity — rerankers, multi-step retrieval, routing — without measuring whether any of it actually improves results. Our CTO Phil Lewis has a simpler version of the same point: “measure, optimize, measure, optimize.” He and our senior architect Matt Willsmore walk through exactly how to do that — retrieval metrics, generation metrics, end-to-end task success — if you want a practical framework. If you can’t measure relevance against a representative query set, every architectural decision is a guess.

Three more things the demo won’t show you:

Permission-aware retrieval.  Most pilots are built without document-level access controls wired through the retrieval pipeline. Elasticsearch has the right primitives — field-level security, document-level filtering, role-based access. But connecting those controls correctly at query time is non-trivial work that gets deferred when teams are racing to ship. The result is a system that works in a sandbox and leaks sensitive content the moment it meets real users with real permission boundaries.

Hybrid search, configured carefully.  Elasticsearch’s hybrid surface — BM25 lexical retrieval combined with dense vector kNN — is powerful but genuinely expert-grade. Elastic calls it that themselves. Doug Turnbull’s 2025 production analysis documents how naïve configurations silently degrade: pre-filter vs. post-filter behavior in kNN queries, the three-way tension between similarity thresholds, candidate pool size, and filter application. His benchmarks show a naïve hybrid setup scores around 0.71 NDCG. Getting meaningfully above that requires deliberate engineering. The platform doesn’t configure itself.

Exact-match requirements.  Vector search is significantly weaker at exact matches than BM25. In most enterprise environments — legal, financial, regulatory, technical publishing — users query by clause number, case reference, SKU, or error code. A pure semantic approach confidently returns adjacent results and misses the precise document the user needed. Invisible in demos. Costly in production. BM25 handles it natively. Don’t throw it away.

FROM THE FIELD:  When EMARKETER first came to us, around 40% of their searches were resulting in no clicks. Not because the platform was broken — but because clients were asking complex questions that went beyond what keyword search could handle. The content existed. Users just couldn’t find it. That gap between “we have the content” and “our users can find it” is exactly where AI search earns its keep. Watch Why EMARKETER implemented AI Search →

The LLM gets the budget. The index gets the blame.

There’s a consistent inversion in how organizations allocate attention to AI search. The LLM — GPT-4o, Claude, Gemini, whichever — gets the most scrutiny and the most spend conversation. It’s also the cheapest problem to fix. Swapping models is a one-line change. Fixing the index, the chunking strategy, the evaluation framework, and the feedback loops is months of work.

Most teams optimize only the final step — generation — while every preceding step in the pipeline is a more likely point of failure. Bad chunking. Stale data. No reranking. No learning from user behavior. The LLM at the end of that chain can only work with what retrieval hands it. If retrieval is broken, generation is confidently wrong.

The numbers tell the story. Gartner forecast that at least 30% of GenAI projects would be abandoned after proof of concept by end of 2025, citing poor data quality, escalating costs, and unclear business value. Anthropic’s contextual retrieval research showed a 49% reduction in retrieval failures when chunks were pre-enriched with document-level context — 67% when combined with hybrid retrieval and a reranker. These aren’t marginal gains. They’re the difference between a system that works and one that doesn’t. The reranker is not exotic — Cohere, Elastic, and Databricks all offer it. And yet in most enterprise implementations I review, it’s missing entirely.

30%+
of GenAI projects forecast to be abandoned after proof of concept
Gartner, 2025
90%
of teams add RAG complexity without measuring whether it actually helps
Skylar Payne
67%
reduction in retrieval failures combining contextual chunking, hybrid search, and reranking
Anthropic, 2024

Nobody owns relevance. That’s the real problem.

Everything above — the data quality issues, the missing evaluation, the unused reranker, the click signals nobody reads — is a symptom of one structural failure: in most organizations, search quality belongs to no one.

Engineering owns the cluster. Data science owns the embedding model. Product owns the UX. Compliance owns the access controls. Nobody owns the quality of results. There is no quarterly OKR that reads “NDCG improvement” or “zero-result rate reduction” or “click-through rate on result position one.” Search quality is nobody’s KPI — so it improves nobody’s performance review — so it degrades slowly and silently — so users work around it — so the business concludes search is a solved problem. It isn’t.

CMSWire identified the “relevance engineer” as an emerging profession in 2017. Nine years later the role still doesn’t exist at most large organizations. Search has been treated as infrastructure: you provision it, you maintain it, you don’t tend it. AI search doesn’t change that assumption automatically. It makes the cost of not changing it much higher.

This is where the “easy toy” framing falls apart. The easy toy version of AI search — bolt an LLM on top of whatever you have, watch the demo, ship it — is genuinely easy to start. It’s also genuinely hard to sustain, because sustaining it requires someone whose job is to measure it, tune it, and keep it honest. That person usually doesn’t exist. And when the results degrade, everyone looks at the platform.

Here’s what organizations that are actually getting value from AI search on Elasticsearch have in common. It has nothing to do with which version of the software they run.

They have someone who owns relevance — by name, with a metric on their performance review. They have a golden query set and they run scored evaluations against it regularly. They use click and interaction data to improve ranking — Elasticsearch’s learning-to-rank capability, native in 8.13+, is sitting unused on most enterprise clusters. They did the data governance work before indexing, not after. And they treat hybrid search as a configuration discipline that requires ongoing attention, not a feature you switch on once.

WHAT GOOD LOOKS LIKE: EMARKETER started with 30 representative queries. They ran them again and again. Graded the results — accuracy, completeness, relevance, quality — and passed feedback to our team. Ricardo Leon and Adriana Morales turned changes around fast: prompt engineering, chunking adjustments, asset type boosts. Three evaluation gates over three to four months, then a full launch. Almost universally positive feedback. That’s not a story about great technology. It’s a story about a disciplined process. Watch Dan Van Dyke, EMARKETER’s VP of AI, walk through it →

We’ve written about what search modernization actually looks like for organizations managing complex, long-standing implementations. The short version: respect what’s already been built. Add the practice layer on top. Don’t confuse a platform decision with a relevance strategy.

Three questions that tell you where you stand

If you’re planning a build, or trying to diagnose why your current implementation is underperforming, the right question isn’t “which model should we switch to.” It’s these:

Who in your organization owns search relevance — by name, with a specific metric on their performance review?

What is your golden query set, and when did you last run a scored evaluation against it?

When did you last run a controlled experiment on a ranking or retrieval change and measured the outcome?

 If you can’t answer all three, you don’t have an AI search problem. You have an organizational problem that AI search will make more visible and more expensive.

Elasticsearch will host the answer beautifully. It won’t produce the answer for you. That’s what we’re here for.

– Kamran

ADDITIONAL RESOURCES

EXTERNAL REFERENCES

Stay up to date with our latest insights!