The rise of the vector database

The rise of the vector database. I’ve been writing about the use of word vectors in geoscience since 2015, but recently some exciting developments have emerged. A vector is an array of numbers which can be used to represent words based on complex word co-occurrence.

Taking the cosine similarity between vectors enables us to find similarity between words, phrases, sentences and documents. The closer to 1 vectors are, the more similar they are.

With the emergence of Large Language Models (LLM) like ChatGPT and LLaMa, a key task has been prompt engineering on up to date trusted (not Internet) content. In this way vast organisational information repositories can be queried using natural language processing. It is, in my opinion, the biggest change in enterprise search for 20 years.

For a question such as “What is the bottom hole temperature of the Bluejohn-1 well?”. In order to find answers in databases and documents behind the firewall within organisations, we need vector databases. This is because LLM’s have very restrictive limits on how much external text they can process in this way, so they need to be given the most relevant subset of text which most likely contains the answer.

Existing databases such as Elastic and PostgreSQL, and new ones like Weaviate, Chroma, Pinecone and Faiss can store vectors from unstructured text. This can be done at the sentence level or split into chunks using libraries like Langchain.

The final step is to compare the vector similarity of the question text to the chunks in the vector database and pass the most similar results to the LLM. Just a few lines of Python code in reality.

There can be flaws with this technique which I described in a previous post so some additional processing can enhance results further.

Combined with LLM’s being accessible now in Python from SpaCy, the democratisation of deep learning capability continues at pace.

https://paulhcleverley.com/2019/06/03/word-embeddings-and-language-models/

https://python.langchain.com/docs/get_started/introduction.html

https://thenewstack.io/how-large-language-models-fuel-the-rise-of-vector-databases/

https://github.com/explosion/spacy-llm

https://paulhcleverley.com/2023/06/15/large-language-models-semantic-search-vectors-and-petroleum-systems/