Author: phcleverley

Using Search Term Word Co-Occurrence for Browsing Search Results: What is most popular is not necessarily the most interesting.

I conducted some exploratory search research with geoscientists and engineers in numerous oil & gas companies back in 2014-2015, which I have recently revisited. Unlike lookup/known item search where a user is seeking something specific, an existing need, something they know exists where there is a ‘right answer’, exploratory search tasks are more loosely defined. Where a question is perhaps not fully formed in the mind, where learning and serendipitous discovery are invited by the searcher, there is no ‘right answer’. In this case, there are unlikely to be absolute laws, but there may be tendencies for certain algorithms to give a greater propensity for valuable information discoveries.

Although studies show these types of search tasks are smaller in number than the ‘lookup/known item’ search tasks in organizations, they can be of potentially higher value as they can lead to new knowledge discovery.

One area of research investigated search term word co-occurrence – the terms that occur ‘around’ the query terms made by a user in body text of articles (Fig 1). Data courtesy of the Society of Economic Geology (SEG) via GeoscienceWorld.


Fig 1 – The words that occur around a search query of ‘precambrian’ in the search results from thousands of geoscience articles. The words themselves are clustered by their similarity to one another (in clouds and list form). For example, clicking on the word ‘iron’ shows the user all the paragraphs where Precambrian and iron occur together .

In one experiment these were presented back to geoscientists and engineers as ‘filters’ and data captured on what filters seemed to be of most interest to people. One interesting finding, was that users clicked on as many terms outside the top 10 most frequently occurring (around their search terms in the body text of search results), as they did within the top 10 most frequently occurring (Fig 2). There appeared a latent need to ‘show me something I don’t already know’.


Fig 2 – Exactly the same content as Fig 1 but with the Top 30 most frequency co-occurring terms to the search term removed. Due to the ‘PowerLaw’ nature of statistical word frequency, what is most popular can often hide in some cases more interesting and unusual associations for subject matter experts. Many people thought this yielded more interesting terms to filter.

Another finding showed evidence that the specificity of the search terms entered by users, may be a predictor of what algorithm was most optimal to use for presenting co-occurrence filters to match intent. For example, for a broad term (such as ‘Geology’) this may be better suited to showing the very common words that occur around it in text as filters too match intent. Whereas a very specific search term ‘injectites’, filtering out the most common words around the term may have tendencies to be more beneficial. An area for further research.

Published Research Articles

Journal of Information Science here

Journal of Information and Knowledge Management here

Journal of Knowledge Organization here

Semantic Word Cloud here


Applying Natural Language Processing (NLP) to Scholarly Geoscience Literature.

Presenting research on Artificial Intelligence (AI) in academic publishing 26-27 Sep, Washington DC, along with Google, Web of Science, SAGE, Taylor & Francis etc at the Silverchair technology platform conference.

In my talk I’ll be covering unsupervised machine learning, supervised machine learning, rule based methods (and hybrids) with actual examples of how each technique has been applied to Geoscience scholarly literature to yield new insights.

I will be representing both Robert Gordon University (RGU) and GeoscienceWorld.

RGU in Aberdeen Scotland, has roots back to 1729 when Robert Gordon had a vision to provide accessible education and enhanced opportunities across society. It conducts world class research and is a top modern university.

GeoscienceWorld is a not-for-profit cooperative of independent scholarly publishers in Geoscience. Founders include the Geological Society of America (GSA), Geological Society of London (GSL) and the American Geosciences Institute (AGI).

Automatically Summarizing Petroleum Exploration Texts by Events and Dates.

One form of text summarization is by a timeline of some sort. In academic literature, this can help follow a discourse through time using bibliographic reference dates in the body of text.

In business literature, this may be more related to events and dates of some activity. In Petroleum Exploration for example, it may refer to the opening up of acreage, license rounds, seismic surveys, well drilling, dry holes or hydrocarbon discoveries, farm-in’s, field development, relinquishments and so forth.

It is relatively easy using Named Entity Recognition (NER) techniques to detect many patterns in text including People, Places and Locations Stanford GATE . Accepting that language understanding is hard so nothing is perfect (but then humans make mistakes as well..).

Dates are also straightforward, although the range of possibilities to express times and dates can be vast in certain contexts. Python has several libraries, there is also research from Facebook duckling .

A particularly useful web tool in my opinion that illustrates the potential of what can be done applying these techniques is TimeLineCurator by the University of British Columbia InfoVis Group.. A nice overview diagram of Visual Analytics is here.

For example, the image below (Figure 1) shows events automatically detected in text discussing the exploration history of the Norwegian Sea.


Figure 1 – Automatic Summary of Exploration History in a Basin

On the far left in the top half of the screen, exploration begins (1980’s), the black circle highlighted allows the user to interrogate key events (in this case the first Permian Discovery by Statoil in 1994), moving towards present day on the right. Sometimes dates are points (circles), in other cases ranges (lines). In this case the different colours are different information collections (e.g. NPD v Oil & Gas Journal). The panels in the bottom half of the screen show the text fragments/sentences on interrogation.

These interactive visuals may be particularly useful to interrogate a body of text that is simply too large (in this age of big data) for a human to read, given some time constraint.

Our cognitive processing limitations.

We know 95% of the time we never look beyond page 1 in Google. In these cases paraphrasing Nicholas Carr, instead of a scuba diver in a sea of words, we zip along the surface on a Jet Ski.

So these techniques may provide some use in surfacing events of interest, that we may have otherwise missed (or simply don’t know we missed). An area for a deeper dive.


TimeLineCurator: Interactive Authoring of Visual Timelines from Unstructured Text
IEEE Transactions on Visualization and Computer Graphics (TVCG).
Proceedings of
IEEE Conference on Visual Analytics Science and Technology (VAST), Chicago, USA, 2015.

Extracting Knowledge from Text using AI


Thoroughly enjoyed two days workshops with the Oil and Gas Technology Centre (OGTC) this week in Aberdeen. The OGTC’s goal is to maximise economic recovery from the UK Continental Shelf, supported by Government.

As well as participating in workshops, I also shared some results of my research on predictive geoscience sentiment analysis and its role to stimulate new insights. Thanks to all the staff for coordinating the event and some great participation from Operators, Service Companies and Academia.

An exciting time to be involved in Geoscience and Data Science!

Artificially Intelligent Sub-Surface

Delighted to be invited as a keynote speaker for the Oil and Gas Technology Centre (OGTC) workshop on artificially intelligent sub-surface this month, 19-20 June in Aberdeen, representing Robert Gordon University.


Artificially intelligent sub-surface is one of the six themes the OGTC are working on for Digital Transformation in the oil and gas industry. More here:


Review of Enterprise Search: Journal of Information Science Paper

Martin White (Visiting Professor at the University of Sheffield and Managing Director of IntranetFocus) has written a review of a recent academic paper I authored Here with Professor Simon Burnett on enterprise search:

“Dr Paul Cleverley and Professor Simon Burnett (Robert Gordon University) have published in the Journal of Information Science what is without doubt a landmark research paper on the factors that influence user satisfaction with enterprise search applications”

“No matter how small or large your organization, if you have responsibility for search management you should be taking this remarkable paper, marking it up para by para, and then using it to benchmark your approach to achieving the levels of search satisfaction that your employees expect”

“This research will change the way that the enterprise search community (and that includes software vendors) consider the opportunities and challenges of effective enterprise search management”