Search query term recommendations for Scholarly and Enterprise Search: Google Scholar recently introduced recommended search terms based on specificity of the query for scholars, a topic I published on in 2014. I have reviewed the ‘search serendipity’ literature in past posts, so won’t go over that again, rather focus specifically on a few methods in general terms.
Crowdsourcing previous queries
In the Google Scholar model, general queries receive general recommendations, specific queries receive specific recommendations. For example, a query on ‘corrosion’ may lead to general query recommendations such as ‘corrosion inhibitors’. A more specific query on ‘corrosion inhibitors’ may lead to more specific suggestions such as ‘corrosion inhibitors for zinc’. Catering for generalists and specialists.
These techniques typically use statistics around crowdsourced queries, so in a way looking at past search behaviour through the rear view mirror. So of the people that made a query on ‘corrosion’, the most popular compound term was ‘inhibitors’. So query terms are suggested by popularity. If search query terms are not made very often (rare) they often have few or even no search recommendations (e.g. make a query on injectites in Google Scholar). When there are many past searches, typically only the top 10 are ever shown. A trait I have also seen in enterprise search technologies from large vendors.
People may still be limited to what they discover by their own knowledge of keywords or other searcher’s knowledge of keywords to use as search terms. There are variants that do not include the initial query term, but suggest queries based on what people ‘searched on next’ , analogous to the Amazon.com people that bought this also bought.. but with queries rather than financial transactions. It is still discovery through the rear view mirror. It is impossible in this model to discover a recommended search query that someone else has never made. Whilst in Google consumer search the statistical volumes are so vast we rarely consider this issue, with scholarly and enterprise search this could be a real issue due to low traffic as the search activity in a narrow domain will never be vast.
A third method is using unsupervised search term word co-occurrence – exploiting the body text content of around the search term as it appears in documents.
Bigrams close-by to the search term (which critically does not include the initial search term) can be used, for a focus on semantic relatedness rather than semantic similarity. These also have the capability to surface content that did not contain the original search query term which could be significant. Also ranking word co-occurrences by the discriminative Pointwise Mutual Information (PMI) algorithm, rather than just what is most popular, seems to have a greater propensity for serendipity (unexpected, insightful and valuable information encounters). For example, a query on ‘stuck-pipe’ would yield search term recommendations such as (using bigrams without stuck-pipe in the name) ‘hole instability’, ‘lost circulation’, and unigrams ranked by PMI would yield ‘caving’ and ‘sloughing’.
Research with geoscientists and engineers showed that descriptive multi-word recommendations were favoured over single words. When search query terms initially entered were more narrow, they preferred suggested queries ranked by PMI, whereas when they entered initial search query terms that were broad and general, they preferred single words ranked by popularity. The latter may relate to someone new learning about a topic, the former more likely to be a subject matter expert. It also showed that when presented with 30 suggestions, scientists clicked as many times outside the ‘top 10 ranked’ as they clicked within the Top 10. Diversity can stimulate.
Topic modelling (Blei 2002) also helped suggest interesting phrases. However, juxtaposed terms required significant cognitive load so as search filters, bigrams and trigrams were preferred.
Table 1 below shows for a query ‘stuck-pipe’ (a common oil & gas well drilling problem), the Top 30 search query suggestions using literature sources (70,000 papers) for:
– Algorithm A (single words ranked by most frequent occurring near stuck-pipe in text)
– Algorithm B (bigram – 2 adjacent words but do not include search query term occurring near stuck-pipe in text)
– Algorithm C (single words ranked by PMI occurring near stuck-pipe in text)
To read the full paper click here These methods are not without their issues. The associative network needed to support a performant user interface is significantly larger than the one simply storing users search queries. There are methods to pre-compute and perhaps the prize is worth pursuing.
It could be argued that search term recommendations as typically implemented in mass search tools tend to target ‘lookup/known item’ search queries – where someone knows what they want. Popularity is often used to suggest to support light serendipity. However, more exploratory search modes may be less catered for by current search term recommendations; where a question is not fully formed in the person’s mind, where a hidden need may be ‘show me something I don’t already know’.
This could present an opportunity for improving scholarly search, enterprise search or in fact any search!