In previous articles I have discussed how concepts can be detected and extracted from text. The patterns of these concepts (such as their proportions with respect to other concepts) provide a signature or ‘expression’ that can be compared. That could be at a ‘micro’ scale such as a Geological Formation or a ‘macro’ scale such as a Geological Basin.
These multivariate data can be clustered in a number of ways for similarity. Typical Euclidian methods focus on the magnitude of the concept frequencies, whilst Pearson Correlation focuses on the relative proportions with respect to one another (the profile). Due to the sampling of text reports and the likelihood that you will not have the same proportion of mentions or documents for every entity you want to compare (e.g. basin), correlation methods may be better suited for the results of text analytics.
The toy model in the figure below illustrates how some concepts (along the x-axis) extracted for Geological Basins (along the y-axis) in text can be clustered using correlation. The Heatmap shows where the major parameter (concept) variations (outliers) are located, dark red cells in the matrix above the mean, dark blue below it, whiter colours around it. For example, concept parameter P21 for Basin #2 literally sticks out like a sore thumb! Is this an artefact or something more interesting – this is what exploratory data analysis is all about..
The Dendograms cluster the concepts (parameters) along the x-axis and basins along the y-axis. As you move up the Dendogram, items get further away from one another.
Basins are grouped by Klemme Types. So in this example, all terrestrial rift valleys (depicted in orange on the left hand side) are grouped nicely together. Forearc basins (in green on the left hand side) can be seen to cluster together, however, one can see that Basin #42 (in red – a Backarc basin) is clustered in the middle of these. This outlier (based on the data generated from the text analytics) may indicate something unusual about this basin or perhaps its type has been misinterpreted. It may provide a stimulus for a closer inspection.
These techniques differ from a one step unsupervised Latent Semantic Analysis (LSA) or Neural Probabilistic methods of word co-occurrence (such as Word2vec). They are effectively a two step process; firstly a semi-supervised extraction of concepts followed by a second step of unsupervised clustering. This has the benefits of specifically targeting the clustering on appropriate geological characteristics, rather than ‘all of the text’ which may unduly bias clustering to non-geological characteristics. For example, basins being deemed similar simply because they contain similar geographical location words and phrases. This presents an area for further research.