Search is on the move…


Today, there are more search queries made on mobile devices than on desktop computers, which has probably been one of the major changes within information search during the past decade, along with the rise of voice search.

I was invited to conduct a book review over the summer of ‘Mobile Search Behavior: An In-depth Analysis Based on Contexts, APPs, and Devices. By Dan Wu and Shaobo Liang’ . The review has been published this week in the Journal of Library and Information Science (JOLIS) academic journal published by SAGE.

It is a fascinating book based on empirical research, how social context, time of day and location influence needs, behaviours and motives. How executing transactions, curiosity, time-killing and learning are some of the motivations.

The increasing use of search through apps, rather than just through the traditional web browser, is also an extremely interesting trend and one particularly relevant to technology designers.

Like the majority of search behaviour research in general, most academic research on mobile search is studying students at universities. This presents an opportunity for further research of search behaviour in the workplace.

More here:


From Unstructured Text to a Graph Visualization in 60 minutes..

From a collection of documents to a Graph Visualization in 60 minutes using OpenSource. As a quick exploratory view of a collection of documents too vast to read, these techniques may be useful.

Using tools like Python, create n-grams (bi-grams, tri-grams to make it quicker) from your text, terms are nodes, spaces are edges, frequency of occurrence can be visualised through the thickness or colour of the edges (lines connecting nodes). Load these into any Graph Database/Structure and visualise, store URI’s to drill down to documents once you find an interesting association.

Use semantic techniques to enhance accuracy and/or algorithms like Pointwise Mutual Information (PMI) for discriminatory analysis. Example below from Feb 2017 when I used 6,000 PDF articles from SEG Journal (courtesy GSW).

Very quick, very simple – may be capable when context limited (visualising the entire Graph structure is normally too dense) of highlighting surprising connections & facilitating learning.

YouTube Video

#bigdata #analytics #geology #visualization #graph

Geohealth: Using Text Analytics to Test Hypotheses

Geohealth: Text Analytics can be useful to test hypotheses and exploratory data analysis. I’ve taken the cosine similarity between word vectors of all US States and ‘Cadmium’ in 100 years of economic geology literature. Above the average shown in orange, below average in blue. This is combined with health data from national databases per US state. On the x-axis is cancer deaths, y-axis is Alzheimer’s. Obviously it’s a complex area, nothing statistically significant in this chart, but useful to experiment and further research areas of interest.

Text: Society of Economic Geology (via GeoscienceWorld)

Data: CDC , Alzheimer’s Association

Chemical Element Entity Extraction from Economic Geology Journals: Combining word frequency in text with measured numerical data

Finding trends and patterns in unstructured text can be possible without combining with other data sources. However, combining derived structured data from text analytics with measured real world data can also lead to differentiating insights.

Figure 1 shows the frequency of occurrence of chemical elements in the text of 100 years of the Society of Economic Geology (SEG) courtesy of GeoScienceWorld on the x-axis and abundance of that element naturally occurring in the crust from existing databases on the y-axis.

Element3 Fig 1 – Combining derives data from text analytics with measured data

The plot shows all data, but it could quite easily be animated through time to surface trends and hot topics. As an analogy, Professor Han Rosling has created some of the best displays I have seen on human population video here.

Journals depict social history; what scholars are writing about, the motivations, the drivers for this behaviour can be numerous and overlapping. Clearly new discoveries, economic forces, prices of commodities will be one, health issues (Geo-health) may be another.

From Figure 3, we can see Yttrium has the highest frequency in the articles of the Rare Earth Elements. Of the base metals, Hafnium appears of least interest to the literature, iron, copper and lead the most popular, with the precious metals of gold and sliver close behind. In terms of word frequency, there are many elements rarer than radium!

These are relatively trivial displays to produce, I make no claims or predictions based on them, I just find them interesting to visualize. Similar petroleum equivalents, may be to count the source rocks around the world by geological age in literature and compare to actual data showing producing fields and proven reserves. Maybe gaps can be surfaced for areas to exploit, driven by the potential described in vast amounts of literature that has yet to be realized.



Unsupervised Machine Learning: Clustering Geoscience Text Using Co-occurrence windows and Principal Component Analysis (PCA).

Unsupervised machine learning techniques exploit latent patterns in text (in layman’s terms – normally some form of complex word co-occurrence) rather than rules driven by human labelled data. As this is essentially an inductive technique, it can be useful to stimulate ideas and questions that the information professional or geoscientist a priori, may not have thought of without the help of the algorithm.

A typical output is a cross-plot used to represent the most discriminatory features found in the text through various statistical methods. The goal of clustering these data is to sub-divide a set of items so similar items fall into same cluster, dissimilar items into different clusters. Common consensus is that there is no one-size-fits all solution to clustering, or even a consensus of what ‘good’ clustering should look like. Some clustering algorithms force you to choose the number of clusters that are computed, some algorithms identify hierarchical relationships between clusters. Ultimately, any algorithm imposes its own set of biases on the clusters it constructs, so to avoid bias a rule of thumb may be to use numerous clustering algorithms.

The example clustering below (Fig 1) is using the Society of Economic Geology (SEG) corpus of 100 years of research articles (courtesy of GeoScienceWorld). Unlike normal clustering applied to the whole ‘document’, this is working off text co-occurrence windows. In this case the sentences that mention the term ‘Precambrian’, using PCA.

Text Clustering

Fig 1 – Manual clustering applied to the SEG corpus search term of Precambrian

Four main clusters were identified, relating to ‘Iron Formations’, ‘Shields’, ‘Gold Deposits’ and ‘Igneous Mineralization’.

The could be seen to represent a high level ‘summary’ of the main topics related to the Precambrian in this corpus. So a form of high level subject summarization. There are of course numerous methods and the LDA Topic Modelling method itself. I conducted some research a few years ago using Topic Modelling (Blei) with petroleum geoscientists and engineers, finding evidence that these techniques were capable of surfacing new knowledge to experienced oil & gas professionals, driven by text corpora.

At the recent Oil & Gas Technology Centre (OGTC) event I was asked what techniques geoscientists should use, which is best: supervised machine learning or unsupervised machine learning for text. This may be akin to a knowledge organization question on which is best for organizing documents: ‘folders or metadata tags’? There will be evangelists on both sides, but when you scrutinize the evidence, pro’s and con’s, it is likely the answer will be ‘use both as a strategy’, as it depends on the situation! I find its always useful to examine text using some form of clustering before we impose all our biases on it!

Using Search Term Word Co-Occurrence for Browsing Search Results: What is most popular is not necessarily the most interesting.

I conducted some exploratory search research with geoscientists and engineers in numerous oil & gas companies back in 2014-2015, which I have recently revisited. Unlike lookup/known item search where a user is seeking something specific, an existing need, something they know exists where there is a ‘right answer’, exploratory search tasks are more loosely defined. Where a question is perhaps not fully formed in the mind, where learning and serendipitous discovery are invited by the searcher, there is no ‘right answer’. In this case, there are unlikely to be absolute laws, but there may be tendencies for certain algorithms to give a greater propensity for valuable information discoveries.

Although studies show these types of search tasks are smaller in number than the ‘lookup/known item’ search tasks in organizations, they can be of potentially higher value as they can lead to new knowledge discovery.

One area of research investigated search term word co-occurrence – the terms that occur ‘around’ the query terms made by a user in body text of articles (Fig 1). Data courtesy of the Society of Economic Geology (SEG) via GeoscienceWorld.


Fig 1 – The words that occur around a search query of ‘precambrian’ in the search results from thousands of geoscience articles. The words themselves are clustered by their similarity to one another (in clouds and list form). For example, clicking on the word ‘iron’ shows the user all the paragraphs where Precambrian and iron occur together .

In one experiment these were presented back to geoscientists and engineers as ‘filters’ and data captured on what filters seemed to be of most interest to people. One interesting finding, was that users clicked on as many terms outside the top 10 most frequently occurring (around their search terms in the body text of search results), as they did within the top 10 most frequently occurring (Fig 2). There appeared a latent need to ‘show me something I don’t already know’.


Fig 2 – Exactly the same content as Fig 1 but with the Top 30 most frequency co-occurring terms to the search term removed. Due to the ‘PowerLaw’ nature of statistical word frequency, what is most popular can often hide in some cases more interesting and unusual associations for subject matter experts. Many people thought this yielded more interesting terms to filter.

Another finding showed evidence that the specificity of the search terms entered by users, may be a predictor of what algorithm was most optimal to use for presenting co-occurrence filters to match intent. For example, for a broad term (such as ‘Geology’) this may be better suited to showing the very common words that occur around it in text as filters too match intent. Whereas a very specific search term ‘injectites’, filtering out the most common words around the term may have tendencies to be more beneficial. An area for further research.

Published Research Articles

Journal of Information Science here

Journal of Information and Knowledge Management here

Journal of Knowledge Organization here

Semantic Word Cloud here

Applying Natural Language Processing (NLP) to Scholarly Geoscience Literature.

Presenting research on Artificial Intelligence (AI) in academic publishing 26-27 Sep, Washington DC, along with Google, Web of Science, SAGE, Taylor & Francis etc at the Silverchair technology platform conference.

In my talk I’ll be covering unsupervised machine learning, supervised machine learning, rule based methods (and hybrids) with actual examples of how each technique has been applied to Geoscience scholarly literature to yield new insights.

I will be representing both Robert Gordon University (RGU) and GeoscienceWorld.

RGU in Aberdeen Scotland, has roots back to 1729 when Robert Gordon had a vision to provide accessible education and enhanced opportunities across society. It conducts world class research and is a top modern university.

GeoscienceWorld is a not-for-profit cooperative of independent scholarly publishers in Geoscience. Founders include the Geological Society of America (GSA), Geological Society of London (GSL) and the American Geosciences Institute (AGI).