Big Data in the Geosciences

Will be presenting some of my research on detecting “geological sentiment” in text, at the Janet Watson Geological Society of London meeting on 27th Feb.

This will include showing how the Geological-sentiment AnalyZER (GAZER) algorithm I developed in Python, compares to the sentiment API classifiers from Google Cloud, IBM Watson, Microsoft Azure, Amazon Comprehend and Lexalytics Semantria.

The test set used to compare accuracy, is over 1,000 sentences relating to petroleum systems labelled by retired geologists as either ‘positive’, ‘negative’ or ‘neutral’.

Some interesting findings!

Advertisements

2018 Runner Up International iSchools Doctoral Dissertation Award.

Delighted to receive the 2018 Runner Up Prize for the International iSchools Doctoral Dissertation Award. The first time a Scottish university has been recognized for its dissertation. The awards recognize outstanding work in the information field; specifically the relationship between information, people and technology. My dissertation topic was enterprise search and discovery.

Nominations are solicited from all members of the iSchools organization, now more than 80 universities worldwide, and judged by an award committee drawn from leading international schools. Congratulations to the winner, Galen Panger from the University of California Berkeley (and Researcher at Google).

A very big thank you to my supervisory team, Professor Simon Burnett and Dr Laura Muir and colleagues at Robert Gordon University. I’d also like to thank everyone who has (and still) helps and encourages my research. It is very much appreciated.

Look forward to many more exciting collaborations as I research social informatics and how advanced analytics & machine learning can be blended with search techniques to augment human computer interaction in the workplace.

Happy New Year!

Press release here and more details here

Using Streamgraphs to visualize results from geological text analytics

streamgraph_time4

Figure 1 – Frequency of Geological Concept ‘Mentions’ in text Co-occurring with Petroleum Systems Elements by Geological Time. Streamgraph (stacked area chart) Sankey Curves; Three visualizations shown for the ssame data: Silhouette [left], expanded [centre], zero-offset [right]. Extraction from 40 public domain articles using Python/RawGraphs.

The ‘dark layers’ are associated to source rock, organic lithologies or anoxic environments. The ‘yellow/orange’ layers are associated to reservoirs, ‘purple layers’ are seals and traps, ‘red’ are volcanics. The ‘light blue’ is carbonate lithology, ‘darker blue’ is lacustrine, river and marine depositional environments. The key is shown in Figure 2.

Large_KeyFigure 2 – Silhouette streamgraph and key. Source Rock (SR) in black can be clearly seen.

 Streamgraphs can be an ’emotionally’ engaging and useful way to show the ebb and flow (hence river metaphor) for large amounts of topic changes over time. Stacked area charts have two main purposes, to show a trend of a specific categeory as well as trends of aggregated categories. Early work dates back to 1999 ThemeRiver with the ordering of the layers, baseline layers, scaling and colours the four key areas. Bryon and Wattenberg (2008) emphasize the criticality of ordering and colouring in streamgraphs.

Quotes from users of systems deploying Streamgraphs include “helps see big picture“,  “quickly led me to investigate the topics that had unique or extreme temporal qualities” and how events could have triggered such peaks (Bradley et al 2013).

There are no known published examples applying the technique to geological text extractions by time. Figure 1 may be the first published example. This is interesting as the ‘layering’ effect over time has certain geological connotations. Figure 3 shows the typical horizontal display (by time).

streamgraph_label1

Figure 3 – Horizontal ‘Traditional’ Display

The expanded display forces each topic to be proportionally represented vertically (Figure 4). This can be misleading ‘horizontally’, but can surface some interesting trends that may otherwise remain hidden. For example, in Fig 4 we can see ‘marl’ in the Paleogene relatively ‘thicker’ (from a relative frequency perspective) to other concepts, although mention of source rocks for that time period is absent. That may warrant closer inspection and could lead to a new insight perhaps.

streamgraph_label2

Figure 4 – Expanded display

There may be value in using these visualizations as an interactive interface. Allowing geoscientists to characterize a geological basin and drill down to the documents and sentences where the concepts co-occur. Due to the nature of these visualizations, they may resonate with geoscientists more so than other displays, to convey large amounts of data from text analytics and machine learned topics. This presents an area for further research.

 

 

Geological Expressions: Clustering the results of text analytics for exploratory data analysis.

In previous articles I have discussed how concepts can be detected and extracted from text. The patterns of these concepts (such as their proportions with respect to other concepts) provide a signature or ‘expression’ that can be compared. That could be at a ‘micro’ scale such as a Geological Formation or a ‘macro’ scale such as a Geological Basin.

These multivariate data can be clustered in a number of ways for similarity. Typical Euclidian methods focus on the magnitude of the concept frequencies, whilst Pearson Correlation focuses on the relative proportions with respect to one another (the profile). Due to the sampling of text reports and the likelihood that you will not have the same proportion of mentions or documents for every entity you want to compare (e.g. basin), correlation methods may be better suited for the results of text analytics.

The toy model in the figure below illustrates how some concepts (along the x-axis) extracted for Geological Basins (along the y-axis)  in text can be clustered using correlation. The Heatmap shows where the major parameter (concept) variations (outliers) are located, dark red cells in the matrix above the mean, dark blue below it, whiter colours around it. For example, concept parameter P21 for Basin #2 literally sticks out like a sore thumb! Is this an artefact or something more interesting – this is what exploratory data analysis is all about..

cluster basins.JPG

The Dendograms cluster the concepts (parameters) along the x-axis and basins along the y-axis. As you move up the Dendogram, items get further away from one another.

Basins are grouped by Klemme Types. So in this example, all terrestrial rift valleys (depicted in orange on the left hand side) are grouped nicely together. Forearc basins (in green on the left hand side) can be seen to cluster together, however, one can see that Basin #42 (in red – a Backarc basin) is clustered in the middle of these. This outlier (based on the data generated from the text analytics) may indicate something unusual about this basin or perhaps its type has been misinterpreted. It may provide a stimulus for a closer inspection.

These techniques differ from a one step unsupervised Latent Semantic Analysis (LSA) or Neural Probabilistic methods of word co-occurrence (such as Word2vec). They are effectively a two step process; firstly a semi-supervised extraction of concepts followed by a second step of unsupervised clustering. This has the benefits of specifically targeting the clustering on appropriate geological characteristics, rather than ‘all of the text’ which may unduly bias clustering to non-geological characteristics. For example, basins being deemed similar simply because they contain similar geographical location words and phrases. This presents an area for further research.

Applying Deep Learning to Geoscience Image Type Classification in Literature: Some Early Research Findings.

eye2

Before I delve into this topic, I’ll start with a story that led me here. This year I went on a fossil hunting expedition with my family to the Dorset coast in the UK. We spent several hours scanning the beach performing our usual ‘pattern recognition’ to look for ‘flying saucer’ shaped pebbles of a certain colour. I was lucky enough to find a nodule containing a Jurassic Marine Fish (Dapedium). The nodule is shown below, on the left you can see the back of the skull and the thick scales, next to a coin for scale and an artists impression of the fish.

Dapedium

So what has this to do with Deep Learning?

Well, I tested some photographs I had taken on the beach with the Google and Microsoft Image API’s available on the web (there are others as well of course, such as OpenSource TensorFlow libraries that can be used in Python). I took a screen ‘snip’ of the photographs and saved a JPEG to ensure the image had no locational metadata. When I ‘drag and dropped’ the image into these API’s, I was stunned to find that one of them had geo-located (positioned on a map) the photograph on the very beach where I had stood and taken the photo! Others may not think this ‘landmark’ detection anything special, but I still marvel at this. This started me thinking what else could we do with image classification in the Geosciences. There has been significant and ongoing research using deep learning on high resolution images in the geosciences (using powerful microscope images of microfossils, SEM, seismic and remote sensing data). But what about the relatively poor quality material (in terms of resolution) typically added to the average document, article and report?

Geoscience literature and reports contain numerous images (such as charts, maps, sections, plots etc.). Whilst general OpenSource Optical Character Recognition (OCR) will extract explicit text on any image, there are other opportunities to extract implicit information about (and from) these image objects.

Classifying the ‘type’ of image is probably at the lower value end, but may be useful. Especially as some images may not be associated with text or captions/figure labelling.

I will be conducting some user Human Computer Interaction (HCI) studies in academia with Geoscientists from different industries and roles to ascertain what is most important and why.

Classes

By looking through a representative sample of public domain Petroleum Systems reports, seven high level common classes of geological image types were identified: Seismic sections, Maps, Photographs, Cross Plots, Stratigraphic Charts, Logs and Cross Sections.

A Deep Learning Convolutional Neural Network (CNN) with transfer learning, was applied to balanced training sets of approximately 200 public domain images per class. Of these, approximately 80% was used for training and 20% for testing. Transfer Learning   ‘piggy backs’ off pre-built models that have used hundreds of thousands of images, by using those existing ‘generic’ layers and supplementing them with ‘domain specific’ ones. This is useful as for many subject domain classes and features, it is likely that only a small number of training images is easily available.

Part of the pre-trained models used include weights from the VGG16 Model which is a deep (16 layer) convolutional net trained on 1.3 million images for 1,000 general image classes, that generalises well to other datasets. There are models ResNet that are much deeper (can be hundreds of layers) and the ImageNet research initiative contains over 14 million images linked to WordNet.

Results

Testing on the geoscience images and classes gave a projected accuracy of 92.7%. This is the likelihood that an image that belongs to one of the seven classes, will be classified to the correct class using examples (the 20%) not used during training (machine learning). Cut-offs or other techniques can be used to ignore images not related to the pre-defined classes if they are encountered.

Image Classification

You can try the resulting classifier yourself, using Google Chrome < Click Here > using the Vize.ai API. Simply drag and drop a sample image and it will return the classification. For the example below, the classifier is 99.9% certain it is a seismic section. Correct!

Seismic_classifier

Sub-Classes

Moving down one level from these classes to sub-classes, I experimented with some Map Types. The training data available for this short experiment was a little sparse, more uneven and differences between the classes more subtle. This led to an overall accuracy of 76.9%.

This could probably be improved with further iterations, merging some classes and adding more training data. Reviewing the results (see image below), ‘Paleogeographical Maps’, ‘Tectonic Element Maps’ and ‘Seismic Basemaps’ had high accuracy. The subtle differences between other types of maps leading to poorer results based on the limitations described above in this experiment. This presents an area for further research.

Classification_maps

Features/Objects

In addition to information on what an image is, it is also possible to train a classifier to detect what the image contains. This could range from geological photographs showing depositional, diagenetic and structural (e.g. faults or folds) features; seismic sections showing extensional or compressional features, through to geological cross sections showing roll over anticlines and Lithostratigraphy charts showing symbols for petroleum system elements such as source rock.

The example below shows the latter, with an estimated 90% accuracy. Petroleum Systems Elements (PSE) typically cover source rock, reservoir, migration, seal and trap. Whilst there are sometimes columns on lithostratigraphic charts labelled with text in a variety of ways (e.g. Source Rock, source, SR, Sr. Charge), they are not always present; sometimes a legend is used at the base and sometimes labelling is absent completely. Detecting the presence of these symbols (sometimes black or coloured circles, ticks, diamonds etc.) without relying on OCR and where the symbols occur on the image, could be useful.

Lithostrat

By providing examples of each, deep learning can detect patterns enabling classifiers to detect such nuances. These features may not be described using text, so these techniques may surface information that traditional ‘enterprise search’ approaches miss every time…….you may even catch a “big fish” 🙂

Short enterprise search queries: Are users really to blame?

ipad-820272_1920

Some practitioners state that users in an enterprise search deployment enter a much smaller number of words in a search query (1.5 average) than on the Internet (3.0 average) and infer it as one of the causes for poor outcomes. This short article presents an argument that this enterprise search user behaviour rather than being a cause, is actually a symptom of factors related to the enterprise environment, including corpus sizes and search query parsing algorithms. User search behaviour (agency) may develop as a result of corpus size/query parsing algorithms (structure) explaining some of the search query length differences between Internet search engines, site-search and enterprise search deployments. These may act as a constraining effect in many enterprises, where user behaviour adapts to these structures. This shift in thinking may enable more effective interventions and solution design.

Download article here in SlideShare: Click Here

 

The Contradiction & Emergence Engine

This is a general discussion of some ideas I have been formulating for some time, going back to the work I did in 2014 on serendipitous information discovery.

It is becoming commonplace to extract occurrences of entities in document/literature text, their association with other entities and numerical values. This can generate a wealth of structured information (from unstructured text). But what does it mean? How do you determine what is very important and what is not?

Whilst it may be possible to generate new insights directly from the structured information extracted from unstructured text, it is not a given. If it does not tell a person or organization anything they did not already know, then it won’t support the generation of new insights. It may not be completely pointless, as it may simply be another piece of evidence ‘confirming’ what is already known.

In terms of comparing what has been generated to what is already explicitly known (written down) in corporate databases, a suite of ‘contradiction’  & ‘discovery’ algorithms may be needed. These algorithms could scan the newly created structured information (generated from unstructured text) to identify contradictions with the ‘prevailing view’ already stored in structured databases. A form of exploratory data analysis. Or compare structured information generated from unstructured text from company documentation (the prevailing view), to structured information generated from external literature.

A simple example could be highlighting a new ‘data point’ in x,y space on a map. A more complex example could be highlighting a much more ‘positive’ sentiment towards a possibility for action, than the currently prevailing view. 

Furthermore, new associations may be formed by ‘joining’ these information sources together; the whole may be greater than the sum of its parts, leading to the emergence of new information and construction of new knowledge by people. For example, Swanson’s ‘ABC method’ of literature based discovery. This led to the discovery of the link between ‘magnesium defficiency’ and ‘migranes’ which was subsequently proved experimentally. It was only by combining information (it was not present in one source) that the related concepts emerged.

These are likely to be seen as ‘surprising’ by individuals or organizations; surprise could be described as the response given when information is presented that contradicts the existing ‘mental model’ held towards a state of affairs. Ultimately these could be the sparks for data driven learning.

Well known research methods and techniques such as Mixed Methods, Activity Theory and Triangulation have an inherent sensitivity to integrating diverse ‘data’ and identifying tensions, breakdowns, dissonance and contradictions. They attack a problem from a number of different conceptual levels and angles. I have been doing some research comparing different ‘views’ in the literature towards the same subject and how best to visualize these data. The findings will be presented in a future post/article.

Algorithms that ‘sit on top of databases’ that hold both ‘born structured’ data, as well as ‘derived structured’ data (generated from unstructured text), could be useful assistants to surface these contradictions from a sea of data. Valuable discoveries may also emerge.

Put simply:

CV + EDT = EV

IF EV = CV THEN Confirmation

ELSE IF

EV <> CV THEN Contradiction / Emergence

Where:

CV = Current View

EDT = Extracted Data from Text (and/or text external to CV)

EV = Enhanced View