I’ve clustered the labels I annotated recently for 22,528 sentences (extracted from randomly sampled public domain petroleum exploration reports). There are 73 labels, I’ve shown a subset in the poster above. The labels represent 96,197 label relations (arc edges).
The hierarchical cluster heatmap (Metsalu and Vilo 2015) in the poster uses Pearson Correlation (rather than Euclidean) better suited for text extractions and clustering ‘DNA’ profiles of geoscience elements. The red/orange colours indicate above average association between labels, those in dark blue show the opposite. Label categories (commercial, geoscience, petroleum system, potential negativity and potential play/opportunity) are colour coded on the edges of the heatmap using pastel colours. Principal Component Analysis (PCA) and KnowledgeGraph plots are also included in the poster to hint at the richness of these annotations.
I’ve highlighted a few areas where there is preferential annotation association and groups. For example, ‘tectonics’ with ‘magmatism’ and at a finer scale within the ‘petroleum system’, the association between ‘salt tectonics’ and ‘trap’.
This is all a bit of fun really as this clustering is just using the annotations, so pretty coarse. It is unlikely a geoscientist will discover something they don’t already know. In the coming weeks and months I will start building supervised machine learning models with much finer grained statistical models using the words in the labelled sentences (467,314 words) and the annotations. The word order combinations in such a set will run into many millions.
I have created labels such as ‘opportunity’ where I identified potentially favourable situations for oil and gas plays and opportunities. Using these (combined with other labels) I will test to what extent (if at all) ‘unseen’ favourable patterns can be detected highlighting potentially new oil and gas plays and opportunities just from text.