Month: October 2017

The Contradiction & Emergence Engine

This is a general discussion of some ideas I have been formulating for some time, going back to the work I did in 2014 on serendipitous information discovery.

It is becoming commonplace to extract occurrences of entities in document/literature text, their association with other entities and numerical values. This can generate a wealth of structured information (from unstructured text). But what does it mean? How do you determine what is very important and what is not?

Whilst it may be possible to generate new insights directly from the structured information extracted from unstructured text, it is not a given. If it does not tell a person or organization anything they did not already know, then it won’t support the generation of new insights. It may not be completely pointless, as it may simply be another piece of evidence ‘confirming’ what is already known.

In terms of comparing what has been generated to what is already explicitly known (written down) in corporate databases, a suite of ‘contradiction’  & ‘discovery’ algorithms may be needed. These algorithms could scan the newly created structured information (generated from unstructured text) to identify contradictions with the ‘prevailing view’ already stored in structured databases. A form of exploratory data analysis. Or compare structured information generated from unstructured text from company documentation (the prevailing view), to structured information generated from external literature.

A simple example could be highlighting a new ‘data point’ in x,y space on a map. A more complex example could be highlighting a much more ‘positive’ sentiment towards a possibility for action, than the currently prevailing view. 

Furthermore, new associations may be formed by ‘joining’ these information sources together; the whole may be greater than the sum of its parts, leading to the emergence of new information and construction of new knowledge by people. For example, Swanson’s ‘ABC method’ of literature based discovery. This led to the discovery of the link between ‘magnesium defficiency’ and ‘migranes’ which was subsequently proved experimentally. It was only by combining information (it was not present in one source) that the related concepts emerged.

These are likely to be seen as ‘surprising’ by individuals or organizations; surprise could be described as the response given when information is presented that contradicts the existing ‘mental model’ held towards a state of affairs. Ultimately these could be the sparks for data driven learning.

Well known research methods and techniques such as Mixed Methods, Activity Theory and Triangulation have an inherent sensitivity to integrating diverse ‘data’ and identifying tensions, breakdowns, dissonance and contradictions. They attack a problem from a number of different conceptual levels and angles. I have been doing some research comparing different ‘views’ in the literature towards the same subject and how best to visualize these data. The findings will be presented in a future post/article.

Algorithms that ‘sit on top of databases’ that hold both ‘born structured’ data, as well as ‘derived structured’ data (generated from unstructured text), could be useful assistants to surface these contradictions from a sea of data. Valuable discoveries may also emerge.

Put simply:

CV + EDT = EV

IF EV = CV THEN Confirmation

ELSE IF

EV <> CV THEN Contradiction / Emergence

Where:

CV = Current View

EDT = Extracted Data from Text (and/or text external to CV)

EV = Enhanced View

 

Advertisements

Cognitive Computing Assistants for Geoscientists: Automatic sentiment-tone analysis and similarity prediction.

In previous posts I have shown how ‘sentiment/tone/opinion’ can be automatically extracted from text to stimulate serendipity during search and analyse differences between Geological Formations. One of the findings was that words/sentences deemed as ‘negative’ by out of the box algorithms are not necessarily ‘negative’ in a geological sense (such as ‘hard’, ‘spill’, ‘old’, ‘fault’ and ‘thick’). Another finding was that typical entity extraction or even entity-entity association extractions and ontologies described in the geoscience literature, tend to leave behind valuable sentiment/tone context. Some of my recent academic research (to be presented at the Geological Society of America (GSA) this month) has focused on using machine learning from training sets to assess ‘tone’ in reports relating to working petroleum system elements in space and time. Effectively targeting the generic question that could apply to many domains in Geoscience and beyond:

“Do the right conditions exist for…...?”

Fig 1 shows sentiment geographically and Fig 2 by geological age.

Sentiment3

Figure 1 – Map plot. Comparison of positive v negative tone for various elements in different geographical locations

For example, the sentence, “…well northeast of Boomerang Hills has tested this pre-Andean trap concept successfully” will count as a ‘positive’ for the petroleum system element ‘Trap’. The sentence, “downward migration from Upper Cretaceous-sourced oils seems unlikely” will count as a ‘negative’ for the element Migration (SR Charge). Simply ‘counting’ entities is not enough. Consider the mention“The reservoir was absent”!!! Without contextual sentiment it is likely that misleading data will be presented.

This could provide another ‘opinion’ for scientists which could challenge preconceptions about what they may already believe. It could also stimulate learning events (clicking on the sentiment to view the mentions within sentences) that may prove academically and/or commercially valuable.

I have been experimenting with a custom ensemble (skip-gram, lexicon and Bayesian) algorithm taking into account word order, to detect this ‘positive’ and ‘negative’ sentiment (opinion and tone) around entities. Deep learning text embeddings would probably improve the ensemble accuracy (Araque et al 2017) but I have not used them here as I am testing  a very small dataset. See previous posts where I have used these techniques for different purposes on larger datasets.

The proportion of negative v positive instances can then be used to show relative trends (pie-charts in Fig 1) for each element and rolled up to higher level constructs. Figure 2 (multi-series bubble plot) shows the same data (from a very small sample of USGS reports again to illustrate the concept) focusing on the Source Rock/Charge element. This enables the data matrix to be plotted, where a Geological Age has been picked up by the algorithms as well as a location ‘mention’ in the text.

Sentiment_Time_Chart_Latest

Figure 2 – Geological Time charts plots for sentiment/tone mentions in context to a geographical area/basin. The larger the bubble, the greater the number of mentions.

The categorization is very coarse  (Basin level). Ideally it would be more useful to extract specific Intra-Basin features and/or geographical areas. Geological age of source rocks and charge/migration events are also conflated somewhat in this simplified picture, although they could easily be split out. Also, given enough document volumes, it should be possible to animate Figure’s 1 and 2 through time. For example, to show how sentiment/tone has changed each year from 1990 through to 2017.

By machine reading documents, papers and texts (too many in number to be realistically read by a person & harbouring patterns too subtle to picked up in any single document) a perspective can be obtained which may challenge individual biases and/or organizational dogma.

Public domain reports from the United States Geological Survey (USGS) were downloaded to test. Python & TextBlob scripts were used to convert the reports to text, identify mentions of Petroleum System Elements in the text and whether the context was ‘positive’ or ‘negative’ sentiment. Geo-referencing can be achieved through the centroid of the country, basin or geographical point of interest in question that is associated with that mention.

The algorithm addresses areas such as negation and avoids some of the problems with context free Bag of Words (BoW) models. For example “Source Rock maturity was not a problem” is a ‘positive’ context, despite having individual ‘negative’ words such as ‘not‘ and ‘problem‘. This is where traditional lexicon/taxonomy approaches (even using multiple word concepts) are likely to perform poorly.

Further work is ascertaining precision, recall and F1 accuracy scores and I’m currently working on a test set of over 2,000 examples of positive, negative and neutral sentiment about these entities extracted from public domain sources. Differentiating tone into various dimensions may also be useful. These may be promising techniques to augment geoscientists cognition supporting higher level thinking processes rather than just retrieval (remembering) of documents in traditional search applications.

Although all Geological Basins are unique, from Figure 2 it is obvious that some Basins/Areas may share common aspects. Utilising positive and negative tone by geological age, clustering techniques can be used on the data matrix to suggest analogues (including Intra-basin) just from the latent structure in text. No prior studies have been found which address this area and ascertain its usefulness. Fig 3 shows one such technique applied to positive/negative tone for the Source Rock/Charge element, with correlations and hierarchical clustering shown in a sequential coloured Heatmap (Metsalu and Vilo 2015). Rows and columns have been automatically re-ordered through clustering, the colours displayed are the values in the data matrix.

Clustering_latest

Figure 3 – Clustering (Correlation Clustering) Basins/Area and Geological Time for Source Rock/Charge by sentiment.

From Figure 3 it can be seen (Dendogram on left) that Sirte & Tamara are the two most similar (with the caveat we are using extremely limited data to illustrate the concept). It is relatively straightforward to see how in theory, this could be applied to a vast amount of sentiment data (more dimensions and Lithostratigraphy perhaps) potentially making more non-obvious connections where similar conditions exist, especially if numerical (integer/float) data is extracted from text and/or brought in from additional sources.

These techniques ‘mimic’ some simple human thought processes, hence the term ‘cognitive’. However, machines in my opinion do not read text “like people do”, despite technology marketing slogans. The Geoscientist may however, benefit from using some of these techniques which are freely available. After all, why would’nt you want to seek opinion from a crowd of somewhat independent scientists who have authored hundreds of thousands of reports? If it confirms your existing mental model then it’s good confirmatory supporting evidence. If it challenges it,  that does not mean you are wrong, but it just may stimulate a little more reflection and investigation. Subsequently, you may stick with what you thought. On the other hand, it may radically change it.

Keywords: Sentiment Analysis , Enterprise Search , Big Data , Text Analytics , Machine Learning , Cognitive Search , Insight Engines , Artificial Intelligence (AI) , Geology , Petroleum Systems , Oil and Gas , Geoinformatics