/

Guest Blog: What is OpenSearch?

Guest Blog: What is OpenSearch?

What is OpenSearch? A Pureinsigths Guest Blog by Stavros Macrakis, Senior Product Manager and Lead PM for Search, OpenSearch

What is OpenSearch?

OpenSearch is an open-source Apache 2.0 licensed search and analytics suite for many use cases, including website and e-commerce search, application monitoring, and log analytics. It can be run from a personal laptop or scaled to a corporate data center and multiple cloud solutions, including Oracle, Aiven-Azure, and AWS. Each of these cloud solutions supports a managed version of OpenSearch.

Beyond classic text search techniques, OpenSearch incorporates AI/ML techniques such as semantic vector search. Furthermore, OpenSearch Dashboards offer customizable, integrated visualizations that make it easy to explore voluminous data through custom dashboards.

This post will focus on what OpenSearch can provide in search use cases and does not cover analytics use cases.

How does OpenSearch work?

At the core of OpenSearch is the Apache Lucene search library, which provides highly efficient data structures and algorithms for ingesting, indexing, searching, and aggregating data. Although Lucene was initially designed specifically for natural language searching, its basic algorithms are the same as those used in modern columnar databases. This makes OpenSearch very efficient at fast column retrieval, which is needed both for natural language search and analytical operations.

OpenSearch allows Lucene to manage and coordinate multiple Lucene instances to distribute the computational load among multiple nodes. This horizontal scaling enables OpenSearch to handle massive data volumes efficiently. OpenSearch also supports replication and geographic distribution, which provide high reliability.

To make managing data easier, OpenSearch Dashboards offers many visualizations and additional plugins to increase the number of ways you can index and manage your data.

The Community

OpenSearch is a fully open-source community project hosted in multiple repositories on GitHub. The community’s home page is opensearch.org.

OpenSearch is distributed under the Apache 2.0 license, giving its users enormous flexibility. Development is driven by the user community and conducted in the open. OpenSearch does not require contributors to assign their rights with a contributor license agreement, allowing contributors the flexibility to use their contributions as they wish. OpenSearch has hundreds of contributors and multiple community projects around it. There is never vendor lock-in.

Proposals for new features and functionality are generally first published as Requests for Comments. Regular user meetings allow users to give feedback, request features, and discuss other issues.

OpenSearch has a rich partner ecosystem. Partners may offer consulting, training, and hosting and often contribute to the code base.

Applications

The core OpenSearch engine is a high-performance database system, well suited for application monitoring, log analytics, and data observability. Since this post focuses on search, we won’t go into detail on these applications in this post.

Search applications break down into three big categories:

  • Document search, which works primarily on unstructured free text.
  • E-commerce search, which works on a mix of structured and unstructured data
  • Query offloading, which operates mainly from structured data.

Document search

The best-known application of search engines in general and OpenSearch, in particular, is document search. A document can be a web page, a technical report, a customer support knowledge item, a newspaper article, an email, or any other natural-language text — even programming language code.

OpenSearch users have document collections, sometimes called corpuses or corpora, ranging from thousands to millions of items. Many document collections are relatively static, but others, like newspaper or finance collections, change rapidly. We can consider these examples of unstructured data, information not easily mapped to a relational structure, such as a phone or a credit card number.

In document search uses, users typically search for items in the main or body of a document, which can be as small as a paragraph and as large as thousands of pages. They also usually include a variety of other fields, including unstructured text fields like:

  • Title and summary
  • Semistructured fields like the author of the body text
  • Structured fields about the body text, or metadata:
    • Publication date
    • Originating group
    • Category

Many document search systems support faceting on metadata — typically presented as categories along the left side of the search result page.

e-Commerce search

Many organizations use OpenSearch for searching e-commerce catalogs. E-commerce has several vital differences from classic document search.

For one, there tends to be much less free text content. Product descriptions are usually only a few sentences or paragraphs long. Some catalogs are very technical, and users know what they want, for example, a 4mm Allen wrench.

But most catalogs need to support broader searches. Besides showing relevant results (which we’ll discuss in more detail below), e-commerce sites typically want to feature particular products based on reviews, profit margins, inventory, etc. Inventory is critical and changes frequently, so e-commerce search typically has a high volume of updates, unlike document search. Finally, almost all e-commerce searches support faceting.

Some types of e-commerce search depend heavily on personalization and recommendations. If a shopper asks to see dresses, ideally, the search engine should find dresses that this particular customer might be interested in, even though the query is very open-ended. Similarity metrics like kNN are beneficial to personalization.

Finally, almost all e-commerce searches support faceting, allowing users to augment their search results with multiple filters.

Query offloading

Transactional database systems are generally the system of record for changing business data, but they are not very efficient for reporting and analysis. Moreover, running analytical workloads on the same server as transactional workloads can hurt transaction times. For these use cases, OpenSearch supports query offloading.

On the other hand, OpenSearch is not well suited for transactional applications but is optimized to efficiently run analytical workloads. Thus, many users mirror their transactional data onto an OpenSearch system for analysis and often use OpenSearch Dashboards for visualizing the results.

OpenSearch’s implementation of SQL is widely used for query offloading, and users have often seen up to 10x improvements in performance at up to 10x lower cost than running traditional Database Management Systems (DBMS).

In the query offloading case, queries use only structured data fields (although they may return some unstructured data) and do not require relevance ranking.

Relevance scoring

The key to effective document and e-commerce search is relevance — do the search results meet the searcher’s needs? Search engines attempt to put the best results on top using various techniques called relevance ranking.

However, user needs and document collections vary tremendously, so there is no universal relevance ranking method. For example, a doctor researching a disease complication in an array of research articles has very different needs from a social media user looking for a picture of themselves in Paris or a customer service agent troubleshooting a problem on a customer’s cellphone using a knowledge base.

OpenSearch ranks results by calculating a score intended to approximate a human user’s judgment of the documents’ relevance. The most basic method treats queries and documents as collections of words (“bags of words”), ignoring the order. If a query term appears more times in one document than another, it is more likely to be relevant than the other.

If multiple query terms exist, the rarer ones are given more weight than the common ones. We call this model the “term frequency-inverse document frequency” (TF-IDF) ranking model. A refinement of TF-IDF called BM25 also considers document length.

BM25 is the foundation of standard search rankings. Basic BM25 on a bag of words works surprisingly well even though it completely ignores word order. The simplest way of taking word order into account is to look for not just words but word pairs (“shingles”); if the query is [hot dog bun], the search system looks for [hot] and [dog] and [bun], as before, but gives additional weight to documents including the pairs hot-dog and dog-bun. Usually, it isn’t necessary to explicitly weigh longer phrases.

The next problem is how to account for matches in the different fields. Text matches in metadata are often more significant than matches in body text, so relevance algorithms generally weigh them more heavily. Usually, a match in the title field is more effective than one in the body field and will be weighted more heavily, so OpenSearch provides for weighing different fields by different amounts.

Weight is part of what makes relevance scoring complex. The relevance score calculation has to include parameters specifying the relative weight of word pairs, words in different fields, and so on. How can we adjust these parameters for the best results? We’ll return to this problem later.

Towards semantic matching

So far, the algorithm has only looked for exact word matches. It’s often helpful to broaden the query for matches that aren’t exact, but are close enough in meaning (semantics) that they are relevant to the query.

For example, if the query is [cozy sofas], a document touting “a cozier sofa for the holidays” is probably relevant even though it is not an exact word match. So, it’s usually helpful to reduce words to their base forms: treat cozier as a synonym for cozy, and sofas as a synonym for sofa. Search developers call this practice stemming or lemmatization.

OpenSearch’s language analyzers handle stemming or lemmatization. But like many of the techniques we’re talking about, it’s a heuristic and sometimes gets things wrong. For example, a dress and a dresser are not at all similar, though the language analyzer superficially looks like dress like dresser, and cozy like cozier.

OpenSearch also supports synonyms, so that [cozy sofas] can match “comfortable couches”; multi-word synonyms let “hot dog” be a synonym for “frankfurter”. Synonyms are often specific to a domain: in clothing, a tee is a T-shirt or teeshirt, but in plumbing, a tee is a T-joint or tee-joint. Generating useful synonym lists for a domain is time-consuming. Using synonyms can lead to unwanted matches since many words have multiple meanings: a river bank is not a kind of financial institution!

Vector embeddings

Given all these limitations of traditional search techniques depending on word-by-word matching, there have been decades of research on more flexible approaches. In the past decade, a radically different class of search techniques has emerged: vector embeddings, also called semantic or neural search.

The idea of vector search is to encode queries and documents as a mathematical object called a vector such that if two vectors are “nearby,” vector search relates the query and the document. With the stunning advances in machine learning (ML) techniques and the availability of vast corpora of textual data, it is now possible to train language models that can encode text usefully; the encoding is called vectorization. Vectorization uses models on not just individual words but patterns of words. These patterns capture much more semantic context than synonyms do.

OpenSearch supports semantic vector embedding search through its vector nearest-neighbor module (kNN). kBB can load documents along with their vectorization. When searching, queries appear vectorized and return the closest document vectors. Performing an exact vector search on a large corpus is computationally expensive, so OpenSearch also supports approximate nearest-neighbor calculations, which can be very fast even for billions of vectors.

With vector search, a document doesn’t have to have any words or even synonyms in common with a query to be considered relevant. For example, a query on `[bicycle maintenance]` could match a document about “derailleur lubrication.” The ML algorithm has observed that “derailleur lubrication” often appears close to discussions of bicycles and their maintenance.

Combining BM25 and embeddings

BM25 and vector embeddings have complementary strengths and weaknesses. BM25 is strong when searching for a specific, narrow term, such as a part number or an unusual molecule; vector search is almost useless in those cases. On the other hand, Vector search is powerful when searching for a concept with no standard name or when looking for related information; BM25 does poorly there. What’s more, BM25 and kNN scores are not directly comparable: they have different ranges and distributions.

An area of active development in OpenSearch is intelligently combining BM25 and vector results. Through a combination of range normalization, score combination, and rank combination, developers can take advantage of the unique advantages of the two technologies, giving better results than either one alone.

Query languages

OpenSearch supports three query languages: DSL, SQL, and PPL.

Rather than being designed specifically for text searching, as in some other search engines, DSL is a framework that hosts all the usual text search queries — including phrase search, prefix matching, wildcards, field search, and so on. DSL supports combining these queries in various ways, not just with Boolean operators but also explicit calculations on relevance ranking scores. It also supports complete filtering and aggregation. DSL queries are written in JSON and provide a highly structured query language.

OpenSearch also offers SQL querying. SQL is a standard language familiar to many analysts, and OpenSearch has now enriched it with the relevance ranking features of DSL.

Finally, OpenSearch supports PPL, the piped processing language, which follows a model of multiple processing stages, each piped (connected) to the next. PPL is used for analytics work. Most developers use PPL for analytics work.

Evaluation and tuning

We’ve discussed that OpenSearch offers many options and parameters for tuning search result relevance. But using these options effectively is problematic because it is hard to evaluate the relative quality of different option configurations.

Fortunately, many open source tools evaluate search result quality and help tune relevance, including OpenSource Connections’ Quepid and Sease’s RRE. The OpenSearch team and outside contributors are continuing to extend these systems. Over time, we plan to integrate these and similar tools into OpenSearch to make them more manageable.

How can you get involved?

This post has reviewed some of the core relevance ranking features of OpenSearch. Still, it only scratches the surface of OpenSearch’s capabilities, such as faceting, running ML models within OpenSearch (ML Commons), geographic search, typeahead and predictive search, prospective search (monitoring new content for search matches), and so on.

We will continue developing OpenSearch in these and many other areas and invite you to participate in this exciting open-source project by contributing your ideas, code, and even bug reports!

Editor’s Note:  Pureinsights is ever so thankful to Stavros Macrakis, Senior Product Manager and Lead PM for Search, OpenSearch, for his Guest Blog post. Readers who enjoyed this post may find the related resources below useful as well.

From OpenSearch:

From Pureinsights and other Sources:

Twitter
LinkedIn

Stay up to date with our latest insights!