Month: December 2017

Using Streamgraphs to visualize results from geological text analytics

streamgraph_time4

Figure 1 – Frequency of Geological Concept ‘Mentions’ in text Co-occurring with Petroleum Systems Elements by Geological Time. Streamgraph (stacked area chart) Sankey Curves; Three visualizations shown for the ssame data: Silhouette [left], expanded [centre], zero-offset [right]. Extraction from 40 public domain articles using Python/RawGraphs.

The ‘dark layers’ are associated to source rock, organic lithologies or anoxic environments. The ‘yellow/orange’ layers are associated to reservoirs, ‘purple layers’ are seals and traps, ‘red’ are volcanics. The ‘light blue’ is carbonate lithology, ‘darker blue’ is lacustrine, river and marine depositional environments. The key is shown in Figure 2.

Large_KeyFigure 2 – Silhouette streamgraph and key. Source Rock (SR) in black can be clearly seen.

 Streamgraphs can be an ’emotionally’ engaging and useful way to show the ebb and flow (hence river metaphor) for large amounts of topic changes over time. Stacked area charts have two main purposes, to show a trend of a specific categeory as well as trends of aggregated categories. Early work dates back to 1999 ThemeRiver with the ordering of the layers, baseline layers, scaling and colours the four key areas. Bryon and Wattenberg (2008) emphasize the criticality of ordering and colouring in streamgraphs.

Quotes from users of systems deploying Streamgraphs include “helps see big picture“,  “quickly led me to investigate the topics that had unique or extreme temporal qualities” and how events could have triggered such peaks (Bradley et al 2013).

There are no known published examples applying the technique to geological text extractions by time. Figure 1 may be the first published example. This is interesting as the ‘layering’ effect over time has certain geological connotations. Figure 3 shows the typical horizontal display (by time).

streamgraph_label1

Figure 3 – Horizontal ‘Traditional’ Display

The expanded display forces each topic to be proportionally represented vertically (Figure 4). This can be misleading ‘horizontally’, but can surface some interesting trends that may otherwise remain hidden. For example, in Fig 4 we can see ‘marl’ in the Paleogene relatively ‘thicker’ (from a relative frequency perspective) to other concepts, although mention of source rocks for that time period is absent. That may warrant closer inspection and could lead to a new insight perhaps.

streamgraph_label2

Figure 4 – Expanded display

There may be value in using these visualizations as an interactive interface. Allowing geoscientists to characterize a geological basin and drill down to the documents and sentences where the concepts co-occur. Due to the nature of these visualizations, they may resonate with geoscientists more so than other displays, to convey large amounts of data from text analytics and machine learned topics. This presents an area for further research.

 

 

Advertisements

Geological Expressions: Clustering the results of text analytics for exploratory data analysis.

In previous articles I have discussed how concepts can be detected and extracted from text. The patterns of these concepts (such as their proportions with respect to other concepts) provide a signature or ‘expression’ that can be compared. That could be at a ‘micro’ scale such as a Geological Formation or a ‘macro’ scale such as a Geological Basin.

These multivariate data can be clustered in a number of ways for similarity. Typical Euclidian methods focus on the magnitude of the concept frequencies, whilst Pearson Correlation focuses on the relative proportions with respect to one another (the profile). Due to the sampling of text reports and the likelihood that you will not have the same proportion of mentions or documents for every entity you want to compare (e.g. basin), correlation methods may be better suited for the results of text analytics.

The toy model in the figure below illustrates how some concepts (along the x-axis) extracted for Geological Basins (along the y-axis)  in text can be clustered using correlation. The Heatmap shows where the major parameter (concept) variations (outliers) are located, dark red cells in the matrix above the mean, dark blue below it, whiter colours around it. For example, concept parameter P21 for Basin #2 literally sticks out like a sore thumb! Is this an artefact or something more interesting – this is what exploratory data analysis is all about..

cluster basins.JPG

The Dendograms cluster the concepts (parameters) along the x-axis and basins along the y-axis. As you move up the Dendogram, items get further away from one another.

Basins are grouped by Klemme Types. So in this example, all terrestrial rift valleys (depicted in orange on the left hand side) are grouped nicely together. Forearc basins (in green on the left hand side) can be seen to cluster together, however, one can see that Basin #42 (in red – a Backarc basin) is clustered in the middle of these. This outlier (based on the data generated from the text analytics) may indicate something unusual about this basin or perhaps its type has been misinterpreted. It may provide a stimulus for a closer inspection.

These techniques differ from a one step unsupervised Latent Semantic Analysis (LSA) or Neural Probabilistic methods of word co-occurrence (such as Word2vec). They are effectively a two step process; firstly a semi-supervised extraction of concepts followed by a second step of unsupervised clustering. This has the benefits of specifically targeting the clustering on appropriate geological characteristics, rather than ‘all of the text’ which may unduly bias clustering to non-geological characteristics. For example, basins being deemed similar simply because they contain similar geographical location words and phrases. This presents an area for further research.