Cognitive Computing Assistants for Geoscientists: Automatic sentiment-tone analysis and similarity prediction.

In previous posts I have shown how ‘sentiment/tone/opinion’ can be automatically extracted from text to stimulate serendipity during search and analyse differences between Geological Formations. One of the findings was that words/sentences deemed as ‘negative’ by out of the box algorithms are not necessarily ‘negative’ in a geological sense (such as ‘hard’, ‘spill’, ‘old’, ‘fault’ and ‘thick’). Another finding was that typical entity extraction or even entity-entity association extractions and ontologies described in the geoscience literature, tend to leave behind valuable sentiment/tone context. Some of my recent academic research (to be presented at the Geological Society of America (GSA) this month) has focused on using machine learning from training sets to assess ‘tone’ in reports relating to working petroleum system elements in space and time. Effectively targeting the generic question that could apply to many domains in Geoscience and beyond:

“Do the right conditions exist for…...?”

Fig 1 shows sentiment geographically and Fig 2 by geological age.


Figure 1 – Map plot. Comparison of positive v negative tone for various elements in different geographical locations

For example, the sentence, “…well northeast of Boomerang Hills has tested this pre-Andean trap concept successfully” will count as a ‘positive’ for the petroleum system element ‘Trap’. The sentence, “downward migration from Upper Cretaceous-sourced oils seems unlikely” will count as a ‘negative’ for the element Migration (SR Charge). Simply ‘counting’ entities is not enough. Consider the mention“The reservoir was absent”!!! Without contextual sentiment it is likely that misleading data will be presented.

This could provide another ‘opinion’ for scientists which could challenge preconceptions about what they may already believe. It could also stimulate learning events (clicking on the sentiment to view the mentions within sentences) that may prove academically and/or commercially valuable.

I have been experimenting with a custom ensemble (skip-gram, lexicon and Bayesian) algorithm taking into account word order, to detect this ‘positive’ and ‘negative’ sentiment (opinion and tone) around entities. Deep learning text embeddings would probably improve the ensemble accuracy (Araque et al 2017) but I have not used them here as I am testing  a very small dataset. See previous posts where I have used these techniques for different purposes on larger datasets.

The proportion of negative v positive instances can then be used to show relative trends (pie-charts in Fig 1) for each element and rolled up to higher level constructs. Figure 2 (multi-series bubble plot) shows the same data (from a very small sample of USGS reports again to illustrate the concept) focusing on the Source Rock/Charge element. This enables the data matrix to be plotted, where a Geological Age has been picked up by the algorithms as well as a location ‘mention’ in the text.


Figure 2 – Geological Time charts plots for sentiment/tone mentions in context to a geographical area/basin. The larger the bubble, the greater the number of mentions.

The categorization is very coarse  (Basin level). Ideally it would be more useful to extract specific Intra-Basin features and/or geographical areas. Geological age of source rocks and charge/migration events are also conflated somewhat in this simplified picture, although they could easily be split out. Also, given enough document volumes, it should be possible to animate Figure’s 1 and 2 through time. For example, to show how sentiment/tone has changed each year from 1990 through to 2017.

By machine reading documents, papers and texts (too many in number to be realistically read by a person & harbouring patterns too subtle to picked up in any single document) a perspective can be obtained which may challenge individual biases and/or organizational dogma.

Public domain reports from the United States Geological Survey (USGS) were downloaded to test. Python & TextBlob scripts were used to convert the reports to text, identify mentions of Petroleum System Elements in the text and whether the context was ‘positive’ or ‘negative’ sentiment. Geo-referencing can be achieved through the centroid of the country, basin or geographical point of interest in question that is associated with that mention.

The algorithm addresses areas such as negation and avoids some of the problems with context free Bag of Words (BoW) models. For example “Source Rock maturity was not a problem” is a ‘positive’ context, despite having individual ‘negative’ words such as ‘not‘ and ‘problem‘. This is where traditional lexicon/taxonomy approaches (even using multiple word concepts) are likely to perform poorly.

Further work is ascertaining precision, recall and F1 accuracy scores and I’m currently working on a test set of over 2,000 examples of positive, negative and neutral sentiment about these entities extracted from public domain sources. Differentiating tone into various dimensions may also be useful. These may be promising techniques to augment geoscientists cognition supporting higher level thinking processes rather than just retrieval (remembering) of documents in traditional search applications.

Although all Geological Basins are unique, from Figure 2 it is obvious that some Basins/Areas may share common aspects. Utilising positive and negative tone by geological age, clustering techniques can be used on the data matrix to suggest analogues (including Intra-basin) just from the latent structure in text. No prior studies have been found which address this area and ascertain its usefulness. Fig 3 shows one such technique applied to positive/negative tone for the Source Rock/Charge element, with correlations and hierarchical clustering shown in a sequential coloured Heatmap (Metsalu and Vilo 2015). Rows and columns have been automatically re-ordered through clustering, the colours displayed are the values in the data matrix.


Figure 3 – Clustering (Correlation Clustering) Basins/Area and Geological Time for Source Rock/Charge by sentiment.

From Figure 3 it can be seen (Dendogram on left) that Sirte & Tamara are the two most similar (with the caveat we are using extremely limited data to illustrate the concept). It is relatively straightforward to see how in theory, this could be applied to a vast amount of sentiment data (more dimensions and Lithostratigraphy perhaps) potentially making more non-obvious connections where similar conditions exist, especially if numerical (integer/float) data is extracted from text and/or brought in from additional sources.

These techniques ‘mimic’ some simple human thought processes, hence the term ‘cognitive’. However, machines in my opinion do not read text “like people do”, despite technology marketing slogans. The Geoscientist may however, benefit from using some of these techniques which are freely available. After all, why would’nt you want to seek opinion from a crowd of somewhat independent scientists who have authored hundreds of thousands of reports? If it confirms your existing mental model then it’s good confirmatory supporting evidence. If it challenges it,  that does not mean you are wrong, but it just may stimulate a little more reflection and investigation. Subsequently, you may stick with what you thought. On the other hand, it may radically change it.

Keywords: Sentiment Analysis , Enterprise Search , Big Data , Text Analytics , Machine Learning , Cognitive Search , Insight Engines , Artificial Intelligence (AI) , Geology , Petroleum Systems , Oil and Gas , Geoinformatics



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s