What are we missing? The effect of semantics on search results.


Semantics is about meaning. We often use different terms to describe the same thing. This might be because the words are very similar (synonyms) for example, or because we are choosing different levels of granularity when we describe (hypernyms). See Table 1 for a more complete list.

Table 1 – Some different types of lexical semantics

I published an academic paper back in 2012 looking at the impact of semantics when searching geoscience information in petroleum exploration.

The 2012 study was based on 50 subject based queries randomly sampled from the search logs of an oil & gas exploration company. These queries were applied to the company’s corporate technical document library using the in-house search engine that did not use semantics. The same queries were made incorporating a rich set of domain semantics (synonyms, acronyms, hypernyms etc.). Query expansion was utilized for the study, but there are other implementation methods to incorporate semantics in a search. The number of relevant results were counted.

A delta was calculated (#results found using semantics – #results found without the use of semantics) and an average calculated of how much relevant information is typically missed in a single subject based geoscience search query if semantics are not used by the search engine.

From over 12,500 results, it was found that on average up to 43% of relevant results were missed in a single subject based query, if semantics were not used as part of the search. Furthermore, I discovered that the more words entered into a single subject based query, the greater the potential for missed relevant results (Figure 1).

Graph_semanticsFigure 1 – Effects on percentage of relevant results returned and #words entered

The x-axis shows the number of words in a single subject based query and the percentage of relevant results found on the y-axis. The blue line is a typical ‘keyword’ search and the red line is a search using semantics. As the number of terms increases, due to combinatorial effects, the amount ‘missed’ increases if semantics are not used. A phenomena I termed ‘Compounding of Semantic Field Ambiguity CoSFA’.

For example, a query on ‘carbonates’ may miss information items on ‘limestone’ or ‘dolomite’ which don’t mention ‘carbonates’ (limestone is a carbonate rock). A query on ‘carbonate buildups’, may miss items on ‘reefs’ or ‘mounds’ (synonyms). So a combinatorial effect takes place of all combinations for the synonyms-hypernyms etc. This is not always the case of course, sometimes a ‘multi-word’ subject based query provides more specifics and disambiguation properties, so these are tendencies not absolute laws.

Good searchers understand these issues, adapting and issuing many ‘similar’ queries into search engines to compensate. However, getting ‘no results’ is almost unheard of – there are even competitions to find what is termed a Googlewhack This may lead to many geoscientists being “obliviously satisfied” with lots of results for their query; but they may miss other relevant results because of semantics, which could potentially be ‘more relevant’ if that information tells the user something they don’t already know. People cannot complain however, about what they fail to find if they don’t know it exists.

After some recent work both inside organizations but also researching and testing some well known scholarly literature search engines on the Internet, it appears the findings may be as true today as they were 7 years ago. Improved search deployments and use of company thesauri (this is not necessarily a technology issue) and improved search literacy are two remedies to improve outcomes.

Quantifying the problem is of course the first step.




