
Towards more meaningful Generative AI: There’s more to understanding language than just statistics. More meaningful answers and summaries can be generated by automatically pre-labelling document text chunks with entities and thematic topics before the vector cosine similarity step to feed organisational content to Large Language Models (LLM).
As with Google-like searches, when there are vast amounts of organisational content, what is most statistically popular tends to make it to the first search results page. If a simple answer is sought or the person is a novice in the discipline or area, that can be “good enough”. However, for exploratory search goals, we sometimes learn the most when we stray from the well beaten path.
Using just the statistical similarity between prompt and text chunks in a vector database will mean the most relevant ‘interesting’ organisational information may not always make the cut to be passed into a LLM. We will have a plausible answer or summary but it may be ‘shallow’. We may be obliviously satisfied not knowing what is missed.
Take the prompt “What might be potential geological seals in basin X?”.
This is important for Carbon Capture and Storage (CCS), Oil & Gas Exploration as well as potential green infrastructure such as Geological Hydrogen Storage.
In a large corpus of documents that mention ‘seals’ it is unlikely anything will be returned from the vector database as input for the LLM that does not already talk explicitly about ‘seals’ within a text chunk. Summaries will be biased by what is already well known.
Clues in text on geological formations that might contain, for example, laterally extensive impermeable shales, flooding mudstones or the presence of helium in thick porous sandstones, will most likely be left out of the Gen AI summary. These passages of geology text are unlikely to be deemed similar enough to the prompt “What might be potential geological seals in basin X?” to make the token cutoff of what can be sent to a LLM. They will be out-competed by their more obvious cousins. We will likely miss these clues and insights, and won’t get to deeper thoughtful Generative AI summaries. We probably won’t know that we miss them.
“Once I was a scuba diver in a sea of words. Now I zip along the surface like someone on a Jet Ski” Nicholas Carr
Perhaps, the key to unlock deeper Gen AI answers and summaries from organisational content is to combine with more Natural Language Processing (NLP) and Knowledge Engineering techniques.
Sometimes on the surface there is not much to see. We need to go deeper to get the real insights.
#artificialintelligence #largelanguagemodels #chatgpt #enterprisesearch #search #naturallanguageprocessing #knowledgeengineering #promptengineering #generativeai #digitaltransformation #digital #analytics #bigdata #knowledgemanagement #geosciences #earthscience #subsurface
Leave a comment