Beyond Google


I gave a lecture this week on search & analytics to students on the online Petroleum Data Management course at Robert Gordon University. Some excellent discussions, debate, questions and a thoroughly enjoyable session with knowledgeable students mostly in full time employment from around the world.

My topic was ‘Beyond the Search Box and Ten Blue Links’. How a new generation of search tools are emerging in the workplace and challenging the cosy search ‘habitus’ we are used to. These are taking search from simply remembering (retrieving information) up Bloom’s taxonomy to higher levels of cognitive ‘human-like’ tasks such as comparing, contrasting, summarizing and predicting. I also presented some practical examples of work task specific search tools and use cases in the Oil and Gas Industry.

Internet search engines like Google have been (and are) tremendously successful – a social phenomenon. They have probably become an epistemology for some tasks ‘how we come to know things‘ covered in my previous posts last year. However, ranking tends to be popularity based so some knowledge may be hidden by its obscurity and despite people finding & discovering relevant and useful information, they may be limited by their own knowledge of keywords to enter into the search box. In some research I conducted with geoscientists published in peer reviewed journals back in 2014  some issues with Internet Searching with tools like Google were highlighted:

“Problem I have with Google is choosing right selection of words to find something” “With analogues you don’t know what terms to query on because you don’t know what they are” “I use Google as an exploratory tool. Some things difficult to find in Google” “How do I know I have really found the most relevant?”

With exploratory search task goals there is not a single correct answer (like Lookup/known item search task goals) – the user is learning with awareness of information changing their information needs which are dynamic. That means that a user can often find plenty of useful relevant material on the first few search results pages and be ‘obliviously satisfied’ because they are unaware of the critical (perhaps even more game-changing) information they failed to find.

Some of my previous published research supported this, showing there was no relationship between ‘user satisfaction’ and how well users actually performed exploratory search tasks, using standard ‘Google-like’ search tools.

Any user interface is effectively a theory – what we think is the best way to present search results. The Internet Search Engine Google-Like ‘search box and ten blue links’ seems pretty efficient to meet certain search task goals. However, for exploratory searching in the workplace and meeting the needs of subject matter experts in organizations for specific tasks we may need to think ‘outside the box’. There has been significant and ongoing research into exploratory search user interfaces.

As one research participant (a subject matter expert in geoscience) commented about search results that were ranked by traditional frequency/popularity methods – “relevant but not interesting’. Perhaps for some subject matter experts in some tasks they really want to be shown something they don’t already know. Whilst it is arguably impossible for a technology system to ‘know’ what someone may always find surprising, some techniques have been proven to be more likely to surface these than others through content driven algorithmic prompts.

So perhaps we need to move beyond relevance in some situations in the workplace to what is useful. According to John Maeda in the laws of simplicity, ‘simplicity is about subtracting the obvious and leaving the meaningful’.

Finally, beyond the ‘search box and ten blue links’ means more than user interface design and technology. The literacy of the searcher is likely to be key for exploratory search. Not just query formulation but their metacognitive processes of planning, monitoring and reflecting as they search. Some commentators describe search as one where if you get your content organized and your technology right ‘search will take care of itself‘. Perhaps for the very simple lookup/known item search goals, but highly unlikely for exploratory search – it is a philosophy that taken to its conclusion – denies human agency in the information search process. The evidence points to search literacy as being key for exploratory search tasks in the workplace.


Big Data in the Geosciences : Geoscience Aware Sentiment Analyzer


Geoscience aware text sentiment algorithm improves on out-of-the-box specific sentiment tools like IBM Watson, Google, Microsoft and Amazon by over 30% for geoscience sentiment in text.

Presented early research findings today at the Janet Watson ‘Big Data in the Geosciences’ conference at the Geological Society of London.

Google opened proceedings with a talk on Satellite Imagery and the Earth Engine, subsequent talks ranged from using Twitter for early warnings of Earthquakes, Virtual Reality and Digital Analogues through to applying deep learning to detect volcano deformation. Some fascinating insights.

My latest research addressed sentiment/tone, the context, around mentions of petroleum system elements (such as source rock, migration, reservoir and trapping) in literature, company reports and presentations. The hypothesis is that stacked somewhat independent opinion/tone in text, the averages, the outliers, the contradictions –may potentially show geoscientists what they don’t know and challenge what they think they do know.

The research question was to assess whether a geoscience-aware algorithm could improve on existing API’s/algorithms in use for sentiment analysis and how useful resulting visualization might be.

Using a held-back set of 750 labelled examples to test, the Geoscience Aware text sentiment analyZER (GAZER) algorithm achieved 90.4% accuracy for two classes (positive and negative) and 84.26% accuracy for 3 classes (positive, negative and neutral sentiment). This compared favourably with generic paragraph Vector and Naïve Bayes out-of-the-box generic approaches. It also compares favourably to the out-of-the-box sentiment Cloud API’s from IBM Watson, Microsoft, Amazon and Google that averaged approximately 50% accuracy for the 3 classes.

This supports findings in in other areas showing the need for customization for sentiment in domain areas and the criticality of specific training data for the work task in hand. The findings also support existing literature that suggested generative probabilistic machine learning algorithms may perform better than discriminatory ones when trying to classify snippets of information such as sentences and bullets in PowerPoint presentations.

Early evidence suggested resulting visualizations such as streamgraphs of the sentiment data could be used to challenge individual biases and organizational dogma, potentially generating new knowledge – presenting an area for further research.

Presentation available in SlideShare Click Here

750 Labelled sentences (the test set) and simple Python Extraction Script on Github


Big Data in the Geosciences

Will be presenting some of my research on detecting “geological sentiment” in text, at the Janet Watson Geological Society of London meeting on 27th Feb.

This will include showing how the Geological-sentiment AnalyZER (GAZER) algorithm I developed in Python, compares to the sentiment API classifiers from Google Cloud, IBM Watson, Microsoft Azure, Amazon Comprehend and Lexalytics Semantria.

The test set used to compare accuracy, is over 1,000 sentences relating to petroleum systems labelled by retired geologists as either ‘positive’, ‘negative’ or ‘neutral’.

Some interesting findings!

2018 Runner Up International iSchools Doctoral Dissertation Award.

Delighted to receive the 2018 Runner Up Prize for the International iSchools Doctoral Dissertation Award. The first time a Scottish university has been recognized for its dissertation. The awards recognize outstanding work in the information field; specifically the relationship between information, people and technology. My dissertation topic was enterprise search and discovery.

Nominations are solicited from all members of the iSchools organization, now more than 80 universities worldwide, and judged by an award committee drawn from leading international schools. Congratulations to the winner, Galen Panger from the University of California Berkeley (and Researcher at Google).

A very big thank you to my supervisory team, Professor Simon Burnett and Dr Laura Muir and colleagues at Robert Gordon University. I’d also like to thank everyone who has (and still) helps and encourages my research. It is very much appreciated.

Look forward to many more exciting collaborations as I research social informatics and how advanced analytics & machine learning can be blended with search techniques to augment human computer interaction in the workplace.

Happy New Year!

Press release here and more details here

Using Streamgraphs to visualize results from geological text analytics


Figure 1 – Frequency of Geological Concept ‘Mentions’ in text Co-occurring with Petroleum Systems Elements by Geological Time. Streamgraph (stacked area chart) Sankey Curves; Three visualizations shown for the ssame data: Silhouette [left], expanded [centre], zero-offset [right]. Extraction from 40 public domain articles using Python/RawGraphs.

The ‘dark layers’ are associated to source rock, organic lithologies or anoxic environments. The ‘yellow/orange’ layers are associated to reservoirs, ‘purple layers’ are seals and traps, ‘red’ are volcanics. The ‘light blue’ is carbonate lithology, ‘darker blue’ is lacustrine, river and marine depositional environments. The key is shown in Figure 2.

Large_KeyFigure 2 – Silhouette streamgraph and key. Source Rock (SR) in black can be clearly seen.

 Streamgraphs can be an ’emotionally’ engaging and useful way to show the ebb and flow (hence river metaphor) for large amounts of topic changes over time. Stacked area charts have two main purposes, to show a trend of a specific categeory as well as trends of aggregated categories. Early work dates back to 1999 ThemeRiver with the ordering of the layers, baseline layers, scaling and colours the four key areas. Bryon and Wattenberg (2008) emphasize the criticality of ordering and colouring in streamgraphs.

Quotes from users of systems deploying Streamgraphs include “helps see big picture“,  “quickly led me to investigate the topics that had unique or extreme temporal qualities” and how events could have triggered such peaks (Bradley et al 2013).

There are no known published examples applying the technique to geological text extractions by time. Figure 1 may be the first published example. This is interesting as the ‘layering’ effect over time has certain geological connotations. Figure 3 shows the typical horizontal display (by time).


Figure 3 – Horizontal ‘Traditional’ Display

The expanded display forces each topic to be proportionally represented vertically (Figure 4). This can be misleading ‘horizontally’, but can surface some interesting trends that may otherwise remain hidden. For example, in Fig 4 we can see ‘marl’ in the Paleogene relatively ‘thicker’ (from a relative frequency perspective) to other concepts, although mention of source rocks for that time period is absent. That may warrant closer inspection and could lead to a new insight perhaps.


Figure 4 – Expanded display

There may be value in using these visualizations as an interactive interface. Allowing geoscientists to characterize a geological basin and drill down to the documents and sentences where the concepts co-occur. Due to the nature of these visualizations, they may resonate with geoscientists more so than other displays, to convey large amounts of data from text analytics and machine learned topics. This presents an area for further research.



Geological Expressions: Clustering the results of text analytics for exploratory data analysis.

In previous articles I have discussed how concepts can be detected and extracted from text. The patterns of these concepts (such as their proportions with respect to other concepts) provide a signature or ‘expression’ that can be compared. That could be at a ‘micro’ scale such as a Geological Formation or a ‘macro’ scale such as a Geological Basin.

These multivariate data can be clustered in a number of ways for similarity. Typical Euclidian methods focus on the magnitude of the concept frequencies, whilst Pearson Correlation focuses on the relative proportions with respect to one another (the profile). Due to the sampling of text reports and the likelihood that you will not have the same proportion of mentions or documents for every entity you want to compare (e.g. basin), correlation methods may be better suited for the results of text analytics.

The toy model in the figure below illustrates how some concepts (along the x-axis) extracted for Geological Basins (along the y-axis)  in text can be clustered using correlation. The Heatmap shows where the major parameter (concept) variations (outliers) are located, dark red cells in the matrix above the mean, dark blue below it, whiter colours around it. For example, concept parameter P21 for Basin #2 literally sticks out like a sore thumb! Is this an artefact or something more interesting – this is what exploratory data analysis is all about..

cluster basins.JPG

The Dendograms cluster the concepts (parameters) along the x-axis and basins along the y-axis. As you move up the Dendogram, items get further away from one another.

Basins are grouped by Klemme Types. So in this example, all terrestrial rift valleys (depicted in orange on the left hand side) are grouped nicely together. Forearc basins (in green on the left hand side) can be seen to cluster together, however, one can see that Basin #42 (in red – a Backarc basin) is clustered in the middle of these. This outlier (based on the data generated from the text analytics) may indicate something unusual about this basin or perhaps its type has been misinterpreted. It may provide a stimulus for a closer inspection.

These techniques differ from a one step unsupervised Latent Semantic Analysis (LSA) or Neural Probabilistic methods of word co-occurrence (such as Word2vec). They are effectively a two step process; firstly a semi-supervised extraction of concepts followed by a second step of unsupervised clustering. This has the benefits of specifically targeting the clustering on appropriate geological characteristics, rather than ‘all of the text’ which may unduly bias clustering to non-geological characteristics. For example, basins being deemed similar simply because they contain similar geographical location words and phrases. This presents an area for further research.

Applying Deep Learning to Geoscience Image Type Classification in Literature: Some Early Research Findings.


Before I delve into this topic, I’ll start with a story that led me here. This year I went on a fossil hunting expedition with my family to the Dorset coast in the UK. We spent several hours scanning the beach performing our usual ‘pattern recognition’ to look for ‘flying saucer’ shaped pebbles of a certain colour. I was lucky enough to find a nodule containing a Jurassic Marine Fish (Dapedium). The nodule is shown below, on the left you can see the back of the skull and the thick scales, next to a coin for scale and an artists impression of the fish.


So what has this to do with Deep Learning?

Well, I tested some photographs I had taken on the beach with the Google and Microsoft Image API’s available on the web (there are others as well of course, such as OpenSource TensorFlow libraries that can be used in Python). I took a screen ‘snip’ of the photographs and saved a JPEG to ensure the image had no locational metadata. When I ‘drag and dropped’ the image into these API’s, I was stunned to find that one of them had geo-located (positioned on a map) the photograph on the very beach where I had stood and taken the photo! Others may not think this ‘landmark’ detection anything special, but I still marvel at this. This started me thinking what else could we do with image classification in the Geosciences. There has been significant and ongoing research using deep learning on high resolution images in the geosciences (using powerful microscope images of microfossils, SEM, seismic and remote sensing data). But what about the relatively poor quality material (in terms of resolution) typically added to the average document, article and report?

Geoscience literature and reports contain numerous images (such as charts, maps, sections, plots etc.). Whilst general OpenSource Optical Character Recognition (OCR) will extract explicit text on any image, there are other opportunities to extract implicit information about (and from) these image objects.

Classifying the ‘type’ of image is probably at the lower value end, but may be useful. Especially as some images may not be associated with text or captions/figure labelling.

I will be conducting some user Human Computer Interaction (HCI) studies in academia with Geoscientists from different industries and roles to ascertain what is most important and why.


By looking through a representative sample of public domain Petroleum Systems reports, seven high level common classes of geological image types were identified: Seismic sections, Maps, Photographs, Cross Plots, Stratigraphic Charts, Logs and Cross Sections.

A Deep Learning Convolutional Neural Network (CNN) with transfer learning, was applied to balanced training sets of approximately 200 public domain images per class. Of these, approximately 80% was used for training and 20% for testing. Transfer Learning   ‘piggy backs’ off pre-built models that have used hundreds of thousands of images, by using those existing ‘generic’ layers and supplementing them with ‘domain specific’ ones. This is useful as for many subject domain classes and features, it is likely that only a small number of training images is easily available.

Part of the pre-trained models used include weights from the VGG16 Model which is a deep (16 layer) convolutional net trained on 1.3 million images for 1,000 general image classes, that generalises well to other datasets. There are models ResNet that are much deeper (can be hundreds of layers) and the ImageNet research initiative contains over 14 million images linked to WordNet.


Testing on the geoscience images and classes gave a projected accuracy of 92.7%. This is the likelihood that an image that belongs to one of the seven classes, will be classified to the correct class using examples (the 20%) not used during training (machine learning). Cut-offs or other techniques can be used to ignore images not related to the pre-defined classes if they are encountered.

Image Classification

You can try the resulting classifier yourself, using Google Chrome < Click Here > using the API. Simply drag and drop a sample image and it will return the classification. For the example below, the classifier is 99.9% certain it is a seismic section. Correct!



Moving down one level from these classes to sub-classes, I experimented with some Map Types. The training data available for this short experiment was a little sparse, more uneven and differences between the classes more subtle. This led to an overall accuracy of 76.9%.

This could probably be improved with further iterations, merging some classes and adding more training data. Reviewing the results (see image below), ‘Paleogeographical Maps’, ‘Tectonic Element Maps’ and ‘Seismic Basemaps’ had high accuracy. The subtle differences between other types of maps leading to poorer results based on the limitations described above in this experiment. This presents an area for further research.



In addition to information on what an image is, it is also possible to train a classifier to detect what the image contains. This could range from geological photographs showing depositional, diagenetic and structural (e.g. faults or folds) features; seismic sections showing extensional or compressional features, through to geological cross sections showing roll over anticlines and Lithostratigraphy charts showing symbols for petroleum system elements such as source rock.

The example below shows the latter, with an estimated 90% accuracy. Petroleum Systems Elements (PSE) typically cover source rock, reservoir, migration, seal and trap. Whilst there are sometimes columns on lithostratigraphic charts labelled with text in a variety of ways (e.g. Source Rock, source, SR, Sr. Charge), they are not always present; sometimes a legend is used at the base and sometimes labelling is absent completely. Detecting the presence of these symbols (sometimes black or coloured circles, ticks, diamonds etc.) without relying on OCR and where the symbols occur on the image, could be useful.


By providing examples of each, deep learning can detect patterns enabling classifiers to detect such nuances. These features may not be described using text, so these techniques may surface information that traditional ‘enterprise search’ approaches miss every time…….you may even catch a “big fish” 🙂