/

Comparing Vector Search Solutions 2023

Comparing Vector Search Solutions 2023

Vector Search: How do Different Solutions Compare?

Those of you who keep an eye on search will know that the last few years have seen an earthquake take place. After 50 years of lexical search being dominant, and only a couple of ranking algorithms (step forward and take a bow TF-IDF and BM25), we now have a new kid on the block: Vector Search.

The popular search library, Lucene, which is the foundation of Elasticsearch, Solr, MongoDB Atlas search and OpenSearch, introduced dense vector fields back in late 2021, and the four search engines have each subsequently implemented Vector search slightly differently. Additionally, a whole new technology, the Vector Database, has arisen to take advantage of this new way of searching. In fact, we have recently seen Elasticsearch calling their technology a vector database which just muddies the picture further.

So, we thought it might be useful to just lay out the differences between how vector search is implemented in the current main market contenders in 2023. To keep this round up manageable we’ll consider the search engines mentioned above, plus four prominent vector databases, namely Pinecone, Chroma, Weaviate and Milvus.

comparing vector search solutions 2023

Hang about. What’s vector search again?

5 years ago, Google released what might be considered the first Large Language Model (LLM) called BERT. It transforms blocks of text into an array of floating numbers (a vector) which encodes the meaning of the text without reference to the original keywords themselves. So, the words ‘car’ and ‘automobile’ would have a very similar (if not identical) vector representation. The fact that the vector is numerical means that new algorithms can be used to calculate similarity between vectors (and therefore texts), and hey presto, we have Vector search. Since then, the quality of the vectors (or embeddings as they are often called) has been improving as new and better LLMs arrive (as they do literally every week), and new ways of storing vectors and implementing Vector search have arrived.

And what are all those different systems you were going on about?

Right, yes. Here’s a quick overview. First, the search systems that have added vector search on top of what they already have:

Solr: Solr was the original open source search engine, becoming an Apache project in 2006. Solr has a large and active community, is still widely used, and is a popular choice for enterprise search and content management applications.

Elasticsearch: Elasticsearch is a distributed search and analytics engine that supports various types of data, including vectors. It offers a variety of features, including machine learning, log analytics and alerting. It offers both Open Source and licensed versions.

OpenSearch: OpenSearch was forked from Elasticsearch in 2021. OpenSearch is committed to being an open and inclusive project, and it is governed by a community board. OpenSearch is a good choice for organizations that are looking for a fully open-source alternative to Elasticsearch.

MongoDB Atlas Search: MongoDB Atlas Search is a fully managed embedded search engine that enables developers to build powerful search experiences into their applications as part of a document database without having to manage a separate search system. It is built on Apache Lucene too.

And then the dedicated Vector databases:

Pinecone: Pinecone is a cloud-native vector database that is designed to be easy to use and scale. It is fully managed and offers a variety of features such as search, filtering, and integrations with .

Milvus: Milvus is an open-source vector database that is designed for high performance and scalability. It offers a high degree of flexibility in the number of vector field types and distance algorithms including for specialized domains such as chemical structures.

Chroma: Chroma is an AI-native open-source embedding database. It is designed to be fast and easy to use, and it supports a variety of embedding models.

Weaviate: Weaviate is an open-source vector database that allows you to store data objects and vector embeddings from your favorite ML-models. It has additional features such as aggregations, sharding and BM25 search which make it good for hybrid applications.

The first three (Solr, Elasticsearch and OpenSearch) are traditional search engines based on Lucene who have all recently added vectors as an additional search capability. MongoDB Atlas is a multi-model database which is hugely popular and over the last few years has added full text and now vector search, again based on Lucene. The final four are dedicated Vector databases, a new breed of technology for storing and managing Vectors and providing dedicated Vector search.

How should we compare them?

Here’s a few criteria we will consider:

  • Vector search implementation: how flexible is it? Which algorithms are supported? Are there limitations?
  • Features: What other features does the implementation offer aside from vector search?
  • Scalability: How many vectors can the database support?
  • Licensing: Is the database open source or commercial?

So how do the vector search capabilities compare?

Please note that this field is evolving fast. All evaluations are based on the latest versions at the time of writing, but if you think we’ve got anything wrong, please let us know!

Below is a simple summary focusing on the vector search capabilities. The columns are:

  • ANN implementation – which underlying K/ANN implementations are available?
  • Distance measures – which distance measures are available?
  • Filter support – can filters be applied – how flexible are they?
  • Keyword search – is keyword search also available, how flexible is it?
  • Hybrid search – can vector and keyword search be combined, are there any limitations?
  • Max Dimensions – how many dimensions do the vector fields support?
Product ANN Implementation Distance Measures Filter Support Keyword Search Hybrid Search Max Dimensions
Solr (v9.3)
Lucene
Euclidean, dot product, cosine similarity
Yes - can be combined with or used as filter query
Yes - BM25
Can be used as re-ranking query or filter query
No maximum enforced, but Lucene limit is 2048.
Elasticsearch (v8.10)

Lucene. 

Can also do Exact KNN

Euclidean, dot product, cosine similarity

Yes - can be combined with filter query

Yes - BM25

Yes – can combine keyword and vector results with normalized scores.

2048

OpenSearch (v2.9)
Lucene, faiss, nmslib. Can also do Exact KNN
Euclidean, inner product, cosine similarity, lin similarity
Yes – can be combined with or used as filter during, or after search.
Yes - BM25
No – but should be available in v.2.10 as part of search pipeline framework
16,000
MongoDB Atlas (v7.0)
Lucene
Euclidean, Cosine and dot product
Yes
Yes - BM25
No, but expected in a future version.
2048
Pinecone
Proprietary (based on faiss)
Euclidean, dot product or cosine
Yes – can be combined with or used as a filter query.
Sort of – can encode text as a sparse dense vector
Sort of – can combine keyword and vector embedding as a combined vector
20,000
Milvus (v2.3)
Proprietary (Knowhere based on faiss)
Approximate KNN - Euclidean, Inner Product, Cosine similarity, Jaccard, Hamming, Tanimoto, Superstructure and Substructure
Yes
Sort of : Boolean search via Scalar fields
Not really. Can be combined with other search expressions on scalar fields but these are filters
32,768
Chroma
Proprietary
Approximate KNN – Euclidean, Cosine similarity, inner product
Yes
Not really – can specify a where or contains clause
No
Aligned with supported models
Weaviate (v1.21.3)
Proprietary
Approximate KNN - Squared Euclidean, Dor Product, Cosine Similarity, Manhattan, Hamming
Yes
Yes - BM25
Yes – can specify fusion method
65,535

What about when you consider other features?

Vector search is being used in a variety of ways to provide novel search experiences. The most common are:

  1. Semantic Search – because vectors don’t rely on keywords they can find content based on the underlying meaning and therefore are in some ways better (or at least different to) keyword search. They do prioritise recall over precision, however, which is not always what you want.
  2. Featured Snippets – this is where a semantic search is used to find a candidate block of text containing an answer to a question. Another model is used to pluck out (or highlight) the answer itself.
  3. Retrieval Augmented Generation (RAG) – this is where a model like GPT is asked to provide an answer based on the results from semantic search and not its broader training base. This removes hallucinations and allows an organization’s own data to be the basis for the answer.

Importantly, methods 2 and 3 are completely dependent on the quality of the semantic search (RAG in particular is causing a lot of excitement at the moment). If the top few results generated by semantic search which form the context for the answer don’t contain what you want, or perhaps instead include an incorrect, old or outdated version of the information, then incorrect answers will be output by the generative AI part.

Right now, for data sets other than very small ones, a hybrid search containing a mix of keyword-based and vector-based results is likely to produce the best results, as it combines the precision of keyword-based search with the recall of vector-based search. Traditional search engines have an advantage in hybrid search because they already have all the necessary tools for good keyword search, such as language-specific synonyms, stemming, spellchecking, stopwords, facets, and field weightings. Pure vector databases currently only have relatively rudimentary keyword search capabilities and apart from Weaviate have some limitations in their hybrid implementations.

This means that if you have a workload that needs a hybrid search or a mixture of vector and keyword search for different purposes, the traditional Lucene based search engines are likely to be good candidates. It’s been relatively easy for them to add vector search to their existing capabilities. It will be much harder for the pure vector databases to add all the keyword search capabilities which have been added and refined over decades.

Added to this is the fact that the older systems have interesting capabilities beyond search to offer too. MongoDB Atlas for example is a full featured document database with a large variety of capabilities from graph DB, to stream processing and even edge computing capabilities. It also benefits from the close coupling of search with data platform, meaning no synchronization headaches. OpenSearch, Elasticsearch and Solr also have many other capabilities such as integrated Machine Learning, dashboards and much more.

On the other hand, the dedicated vector database are all about vector search and just that, so they have more of a focus on that, and so some include support for very high dimension fields, additional distance measures and even multiple vector indexing types. So if you just care about Vector search and nothing else, have specific requirements, or need that extra flexibility,  they may be a better choice because that’s their sole focus.

What about Scalability & Performance?

You can find a few benchmarks on the internet, but very little in terms of systematic comparisons so we won’t repeat them here, although for those interested there is this ANN benchmark which covers some of the implementations in this article.  What we can do is to look at their scaling models. All of them apart from Chroma can scale vertically (bigger servers/pods) or horizontally (more servers/replicas). Chroma is designed to be used more like an embedded DB (think SQLite) although that may change in the future. The horizontal scalability of the others is based on more replicas of the index. This adds more query capacity and availability. They also scale index capacity by sharding their indexes and distributing them across hardware, which means their potential vector index capacity is very high.

And Licensing?

Apart from Pinecone & MongoDB Atlas, all of the systems above have an Open Source offering, although they all have commercial offerings too, either as a premium version or via a Managed Service / Hosted version.

So when would I choose a pure vector database?

This is a good question. It basically boils down to three things:

Do you also need good keyword search?

If you are working in the context of an existing search implementation and/or know that traditional keyword search capabilities and all the associated capabilities such as synonyms, facets and so on will be needed in some way, then a pure vector database is probably not the best choice. If you are working in a data science context where storage and management of the vectors is the main objective, then a pure vector database is a good choice as they are more targeted to that workload.

Do you need very high dimensionality vectors?

There are also some considerations around vector dimensionality. The max number of dimensions varies for each system, but in general vector databases support models with higher numbers of output dimensions. To put this in context, OpenAI’s text-embedding-ada-002 model outputs 1536 dimensions and Elasticsearch and MongoDB support up to 2048, however vector databases such as Milvius and Weaviate support up to 32,768 and 65,535, respectively. The ability to handle very high dimensionality therefore is likely to be the key consideration/differentiator.

Do you need a wide variety of vector field types and distance measures?

Milvius also supports multiple vector field types whereas the others typically rely on HNSW so if you have more exotic or specialist indexing requirements, then a specialist Vector database is likely to be preferable. Finally, both Milvius and Weaviate have a larger number of distance measures available, so again if you have specialist requirements or need the additional options then a vector database is for you.

Of course, there are some other considerations such as market presence and maturity, supportability, manageability, resilience and so on but perhaps those are topics for another post.

Wrapping up

Now that vector search is available in both traditional search engines and vector databases, it can be difficult to know which one to choose. If you are new to search implementation and want to take advantage of vector capabilities, it can be daunting and difficult to know where to start.

We are here to help. With decades of experience in search and our Pureinsights Discovery Platform™, which takes advantage of the latest advances in vector search, Pureinsights is perfectly positioned to support you.

So, as always, feel free to CONTACT US with any comments or questions or to request a complimentary consultation to discuss your ongoing search and AI projects.

Cheers,

Matt

Twitter
LinkedIn

Stay up to date with our latest insights!