Author: phcleverley

Geological Expressions: Clustering the results of text analytics for exploratory data analysis.

In previous articles I have discussed how concepts can be detected and extracted from text. The patterns of these concepts (such as their proportions with respect to other concepts) provide a signature or ‘expression’ that can be compared. That could be at a ‘micro’ scale such as a Geological Formation or a ‘macro’ scale such as a Geological Basin.

These multivariate data can be clustered in a number of ways for similarity. Typical Euclidian methods focus on the magnitude of the concept frequencies, whilst Pearson Correlation focuses on the relative proportions with respect to one another (the profile). Due to the sampling of text reports and the likelihood that you will not have the same proportion of mentions or documents for every entity you want to compare (e.g. basin), correlation methods may be better suited for the results of text analytics.

The toy model in the figure below illustrates how some concepts (along the x-axis) extracted for Geological Basins (along the y-axis)  in text can be clustered using correlation. The Heatmap shows where the major parameter (concept) variations (outliers) are located, dark red cells in the matrix above the mean, dark blue below it, whiter colours around it. For example, concept parameter P21 for Basin #2 literally sticks out like a sore thumb! Is this an artefact or something more interesting – this is what exploratory data analysis is all about..

cluster basins.JPG

The Dendograms cluster the concepts (parameters) along the x-axis and basins along the y-axis. As you move up the Dendogram, items get further away from one another.

Basins are grouped by Klemme Types. So in this example, all terrestrial rift valleys (depicted in orange on the left hand side) are grouped nicely together. Forearc basins (in green on the left hand side) can be seen to cluster together, however, one can see that Basin #42 (in red – a Backarc basin) is clustered in the middle of these. This outlier (based on the data generated from the text analytics) may indicate something unusual about this basin or perhaps its type has been misinterpreted. It may provide a stimulus for a closer inspection.

These techniques differ from a one step unsupervised Latent Semantic Analysis (LSA) or Neural Probabilistic methods of word co-occurrence (such as Word2vec). They are effectively a two step process; firstly a semi-supervised extraction of concepts followed by a second step of unsupervised clustering. This has the benefits of specifically targeting the clustering on appropriate geological characteristics, rather than ‘all of the text’ which may unduly bias clustering to non-geological characteristics. For example, basins being deemed similar simply because they contain similar geographical location words and phrases. This presents an area for further research.


Applying Deep Learning to Geoscience Image Type Classification in Literature: Some Early Research Findings.


Before I delve into this topic, I’ll start with a story that led me here. This year I went on a fossil hunting expedition with my family to the Dorset coast in the UK. We spent several hours scanning the beach performing our usual ‘pattern recognition’ to look for ‘flying saucer’ shaped pebbles of a certain colour. I was lucky enough to find a nodule containing a Jurassic Marine Fish (Dapedium). The nodule is shown below, on the left you can see the back of the skull and the thick scales, next to a coin for scale and an artists impression of the fish.


So what has this to do with Deep Learning?

Well, I tested some photographs I had taken on the beach with the Google and Microsoft Image API’s available on the web (there are others as well of course, such as OpenSource TensorFlow libraries that can be used in Python). I took a screen ‘snip’ of the photographs and saved a JPEG to ensure the image had no locational metadata. When I ‘drag and dropped’ the image into these API’s, I was stunned to find that one of them had geo-located (positioned on a map) the photograph on the very beach where I had stood and taken the photo! Others may not think this ‘landmark’ detection anything special, but I still marvel at this. This started me thinking what else could we do with image classification in the Geosciences. There has been significant and ongoing research using deep learning on high resolution images in the geosciences (using powerful microscope images of microfossils, SEM, seismic and remote sensing data). But what about the relatively poor quality material (in terms of resolution) typically added to the average document, article and report?

Geoscience literature and reports contain numerous images (such as charts, maps, sections, plots etc.). Whilst general OpenSource Optical Character Recognition (OCR) will extract explicit text on any image, there are other opportunities to extract implicit information about (and from) these image objects.

Classifying the ‘type’ of image is probably at the lower value end, but may be useful. Especially as some images may not be associated with text or captions/figure labelling.

I will be conducting some user Human Computer Interaction (HCI) studies in academia with Geoscientists from different industries and roles to ascertain what is most important and why.


By looking through a representative sample of public domain Petroleum Systems reports, seven high level common classes of geological image types were identified: Seismic sections, Maps, Photographs, Cross Plots, Stratigraphic Charts, Logs and Cross Sections.

A Deep Learning Convolutional Neural Network (CNN) with transfer learning, was applied to balanced training sets of approximately 200 public domain images per class. Of these, approximately 80% was used for training and 20% for testing. Transfer Learning   ‘piggy backs’ off pre-built models that have used hundreds of thousands of images, by using those existing ‘generic’ layers and supplementing them with ‘domain specific’ ones. This is useful as for many subject domain classes and features, it is likely that only a small number of training images is easily available.

Part of the pre-trained models used include weights from the VGG16 Model which is a deep (16 layer) convolutional net trained on 1.3 million images for 1,000 general image classes, that generalises well to other datasets. There are models ResNet that are much deeper (can be hundreds of layers) and the ImageNet research initiative contains over 14 million images linked to WordNet.


Testing on the geoscience images and classes gave a projected accuracy of 92.7%. This is the likelihood that an image that belongs to one of the seven classes, will be classified to the correct class using examples (the 20%) not used during training (machine learning). Cut-offs or other techniques can be used to ignore images not related to the pre-defined classes if they are encountered.

Image Classification

You can try the resulting classifier yourself, using Google Chrome < Click Here > using the API. Simply drag and drop a sample image and it will return the classification. For the example below, the classifier is 99.9% certain it is a seismic section. Correct!



Moving down one level from these classes to sub-classes, I experimented with some Map Types. The training data available for this short experiment was a little sparse, more uneven and differences between the classes more subtle. This led to an overall accuracy of 76.9%.

This could probably be improved with further iterations, merging some classes and adding more training data. Reviewing the results (see image below), ‘Paleogeographical Maps’, ‘Tectonic Element Maps’ and ‘Seismic Basemaps’ had high accuracy. The subtle differences between other types of maps leading to poorer results based on the limitations described above in this experiment. This presents an area for further research.



In addition to information on what an image is, it is also possible to train a classifier to detect what the image contains. This could range from geological photographs showing depositional, diagenetic and structural (e.g. faults or folds) features; seismic sections showing extensional or compressional features, through to geological cross sections showing roll over anticlines and Lithostratigraphy charts showing symbols for petroleum system elements such as source rock.

The example below shows the latter, with an estimated 90% accuracy. Petroleum Systems Elements (PSE) typically cover source rock, reservoir, migration, seal and trap. Whilst there are sometimes columns on lithostratigraphic charts labelled with text in a variety of ways (e.g. Source Rock, source, SR, Sr. Charge), they are not always present; sometimes a legend is used at the base and sometimes labelling is absent completely. Detecting the presence of these symbols (sometimes black or coloured circles, ticks, diamonds etc.) without relying on OCR and where the symbols occur on the image, could be useful.


By providing examples of each, deep learning can detect patterns enabling classifiers to detect such nuances. These features may not be described using text, so these techniques may surface information that traditional ‘enterprise search’ approaches miss every time…….you may even catch a “big fish” 🙂

Short enterprise search queries: Are users really to blame?


Some practitioners state that users in an enterprise search deployment enter a much smaller number of words in a search query (1.5 average) than on the Internet (3.0 average) and infer it as one of the causes for poor outcomes. This short article presents an argument that this enterprise search user behaviour rather than being a cause, is actually a symptom of factors related to the enterprise environment, including corpus sizes and search query parsing algorithms. User search behaviour (agency) may develop as a result of corpus size/query parsing algorithms (structure) explaining some of the search query length differences between Internet search engines, site-search and enterprise search deployments. These may act as a constraining effect in many enterprises, where user behaviour adapts to these structures. This shift in thinking may enable more effective interventions and solution design.

Download article here in SlideShare: Click Here


The Contradiction & Emergence Engine

This is a general discussion of some ideas I have been formulating for some time, going back to the work I did in 2014 on serendipitous information discovery.

It is becoming commonplace to extract occurrences of entities in document/literature text, their association with other entities and numerical values. This can generate a wealth of structured information (from unstructured text). But what does it mean? How do you determine what is very important and what is not?

Whilst it may be possible to generate new insights directly from the structured information extracted from unstructured text, it is not a given. If it does not tell a person or organization anything they did not already know, then it won’t support the generation of new insights. It may not be completely pointless, as it may simply be another piece of evidence ‘confirming’ what is already known.

In terms of comparing what has been generated to what is already explicitly known (written down) in corporate databases, a suite of ‘contradiction’  & ‘discovery’ algorithms may be needed. These algorithms could scan the newly created structured information (generated from unstructured text) to identify contradictions with the ‘prevailing view’ already stored in structured databases. A form of exploratory data analysis. Or compare structured information generated from unstructured text from company documentation (the prevailing view), to structured information generated from external literature.

A simple example could be highlighting a new ‘data point’ in x,y space on a map. A more complex example could be highlighting a much more ‘positive’ sentiment towards a possibility for action, than the currently prevailing view. 

Furthermore, new associations may be formed by ‘joining’ these information sources together; the whole may be greater than the sum of its parts, leading to the emergence of new information and construction of new knowledge by people. For example, Swanson’s ‘ABC method’ of literature based discovery. This led to the discovery of the link between ‘magnesium defficiency’ and ‘migranes’ which was subsequently proved experimentally. It was only by combining information (it was not present in one source) that the related concepts emerged.

These are likely to be seen as ‘surprising’ by individuals or organizations; surprise could be described as the response given when information is presented that contradicts the existing ‘mental model’ held towards a state of affairs. Ultimately these could be the sparks for data driven learning.

Well known research methods and techniques such as Mixed Methods, Activity Theory and Triangulation have an inherent sensitivity to integrating diverse ‘data’ and identifying tensions, breakdowns, dissonance and contradictions. They attack a problem from a number of different conceptual levels and angles. I have been doing some research comparing different ‘views’ in the literature towards the same subject and how best to visualize these data. The findings will be presented in a future post/article.

Algorithms that ‘sit on top of databases’ that hold both ‘born structured’ data, as well as ‘derived structured’ data (generated from unstructured text), could be useful assistants to surface these contradictions from a sea of data. Valuable discoveries may also emerge.

Put simply:


IF EV = CV THEN Confirmation


EV <> CV THEN Contradiction / Emergence


CV = Current View

EDT = Extracted Data from Text (and/or text external to CV)

EV = Enhanced View


Cognitive Computing Assistants for Geoscientists: Automatic sentiment-tone analysis and similarity prediction.

In previous posts I have shown how ‘sentiment/tone/opinion’ can be automatically extracted from text to stimulate serendipity during search and analyse differences between Geological Formations. One of the findings was that words/sentences deemed as ‘negative’ by out of the box algorithms are not necessarily ‘negative’ in a geological sense (such as ‘hard’, ‘spill’, ‘old’, ‘fault’ and ‘thick’). Another finding was that typical entity extraction or even entity-entity association extractions and ontologies described in the geoscience literature, tend to leave behind valuable sentiment/tone context. Some of my recent academic research (to be presented at the Geological Society of America (GSA) this month) has focused on using machine learning from training sets to assess ‘tone’ in reports relating to working petroleum system elements in space and time. Effectively targeting the generic question that could apply to many domains in Geoscience and beyond:

“Do the right conditions exist for…...?”

Fig 1 shows sentiment geographically and Fig 2 by geological age.


Figure 1 – Map plot. Comparison of positive v negative tone for various elements in different geographical locations

For example, the sentence, “…well northeast of Boomerang Hills has tested this pre-Andean trap concept successfully” will count as a ‘positive’ for the petroleum system element ‘Trap’. The sentence, “downward migration from Upper Cretaceous-sourced oils seems unlikely” will count as a ‘negative’ for the element Migration (SR Charge). Simply ‘counting’ entities is not enough. Consider the mention“The reservoir was absent”!!! Without contextual sentiment it is likely that misleading data will be presented.

This could provide another ‘opinion’ for scientists which could challenge preconceptions about what they may already believe. It could also stimulate learning events (clicking on the sentiment to view the mentions within sentences) that may prove academically and/or commercially valuable.

I have been experimenting with a custom ensemble (skip-gram, lexicon and Bayesian) algorithm taking into account word order, to detect this ‘positive’ and ‘negative’ sentiment (opinion and tone) around entities. Deep learning text embeddings would probably improve the ensemble accuracy (Araque et al 2017) but I have not used them here as I am testing  a very small dataset. See previous posts where I have used these techniques for different purposes on larger datasets.

The proportion of negative v positive instances can then be used to show relative trends (pie-charts in Fig 1) for each element and rolled up to higher level constructs. Figure 2 (multi-series bubble plot) shows the same data (from a very small sample of USGS reports again to illustrate the concept) focusing on the Source Rock/Charge element. This enables the data matrix to be plotted, where a Geological Age has been picked up by the algorithms as well as a location ‘mention’ in the text.


Figure 2 – Geological Time charts plots for sentiment/tone mentions in context to a geographical area/basin. The larger the bubble, the greater the number of mentions.

The categorization is very coarse  (Basin level). Ideally it would be more useful to extract specific Intra-Basin features and/or geographical areas. Geological age of source rocks and charge/migration events are also conflated somewhat in this simplified picture, although they could easily be split out. Also, given enough document volumes, it should be possible to animate Figure’s 1 and 2 through time. For example, to show how sentiment/tone has changed each year from 1990 through to 2017.

By machine reading documents, papers and texts (too many in number to be realistically read by a person & harbouring patterns too subtle to picked up in any single document) a perspective can be obtained which may challenge individual biases and/or organizational dogma.

Public domain reports from the United States Geological Survey (USGS) were downloaded to test. Python & TextBlob scripts were used to convert the reports to text, identify mentions of Petroleum System Elements in the text and whether the context was ‘positive’ or ‘negative’ sentiment. Geo-referencing can be achieved through the centroid of the country, basin or geographical point of interest in question that is associated with that mention.

The algorithm addresses areas such as negation and avoids some of the problems with context free Bag of Words (BoW) models. For example “Source Rock maturity was not a problem” is a ‘positive’ context, despite having individual ‘negative’ words such as ‘not‘ and ‘problem‘. This is where traditional lexicon/taxonomy approaches (even using multiple word concepts) are likely to perform poorly.

Further work is ascertaining precision, recall and F1 accuracy scores and I’m currently working on a test set of over 2,000 examples of positive, negative and neutral sentiment about these entities extracted from public domain sources. Differentiating tone into various dimensions may also be useful. These may be promising techniques to augment geoscientists cognition supporting higher level thinking processes rather than just retrieval (remembering) of documents in traditional search applications.

Although all Geological Basins are unique, from Figure 2 it is obvious that some Basins/Areas may share common aspects. Utilising positive and negative tone by geological age, clustering techniques can be used on the data matrix to suggest analogues (including Intra-basin) just from the latent structure in text. No prior studies have been found which address this area and ascertain its usefulness. Fig 3 shows one such technique applied to positive/negative tone for the Source Rock/Charge element, with correlations and hierarchical clustering shown in a sequential coloured Heatmap (Metsalu and Vilo 2015). Rows and columns have been automatically re-ordered through clustering, the colours displayed are the values in the data matrix.


Figure 3 – Clustering (Correlation Clustering) Basins/Area and Geological Time for Source Rock/Charge by sentiment.

From Figure 3 it can be seen (Dendogram on left) that Sirte & Tamara are the two most similar (with the caveat we are using extremely limited data to illustrate the concept). It is relatively straightforward to see how in theory, this could be applied to a vast amount of sentiment data (more dimensions and Lithostratigraphy perhaps) potentially making more non-obvious connections where similar conditions exist, especially if numerical (integer/float) data is extracted from text and/or brought in from additional sources.

These techniques ‘mimic’ some simple human thought processes, hence the term ‘cognitive’. However, machines in my opinion do not read text “like people do”, despite technology marketing slogans. The Geoscientist may however, benefit from using some of these techniques which are freely available. After all, why would’nt you want to seek opinion from a crowd of somewhat independent scientists who have authored hundreds of thousands of reports? If it confirms your existing mental model then it’s good confirmatory supporting evidence. If it challenges it,  that does not mean you are wrong, but it just may stimulate a little more reflection and investigation. Subsequently, you may stick with what you thought. On the other hand, it may radically change it.

Keywords: Sentiment Analysis , Enterprise Search , Big Data , Text Analytics , Machine Learning , Cognitive Search , Insight Engines , Artificial Intelligence (AI) , Geology , Petroleum Systems , Oil and Gas , Geoinformatics


PhD Judged “Top 5” Internationally for Information Science.

Surprised and delighted to be informed that my PhD has been judged in the “Top 5″ Internationally in 2017 for Information Science in the ProQuest Doctoral Dissertation Award.

My thesis topic was Re-examining and re-conceptualising enterprise search and discovery. The Association for Information Science and Technology (ASIS&T) scope includes any PhD related to, “the production, discovery, recording, storage, representation, retrieval, presentation, manipulation, dissemination, use, and evaluation of information and on the tools and techniques associated with these processes.”

The judges comments include: “As far as I know, this is the first comprehensive and holistic work studying enterprise search; this is a pretty relevant theme and the contributions of the thesis are sizeable” and “Findings from this thesis have direct implications for the theories and practices in information science”.

A big thanks to my supervisory team of Professor Simon Burnett (Robert Gordon University) and Dr Laura Muir (Edinburgh Napier) along with everyone who has helped and encouraged me. It further motivates me to continue academic research in this area and to make further contributions to the discipline in what is a tremendously exciting time.