Unsupervised machine learning techniques exploit latent patterns in text (in layman’s terms – normally some form of complex word co-occurrence) rather than rules driven by human labelled data. As this is essentially an inductive technique, it can be useful to stimulate ideas and questions that the information professional or geoscientist a priori, may not have thought of without the help of the algorithm.
A typical output is a cross-plot used to represent the most discriminatory features found in the text through various statistical methods. The goal of clustering these data is to sub-divide a set of items so similar items fall into same cluster, dissimilar items into different clusters. Common consensus is that there is no one-size-fits all solution to clustering, or even a consensus of what ‘good’ clustering should look like. Some clustering algorithms force you to choose the number of clusters that are computed, some algorithms identify hierarchical relationships between clusters. Ultimately, any algorithm imposes its own set of biases on the clusters it constructs, so to avoid bias a rule of thumb may be to use numerous clustering algorithms.
The example clustering below (Fig 1) is using the Society of Economic Geology (SEG) corpus of 100 years of research articles (courtesy of GeoScienceWorld). Unlike normal clustering applied to the whole ‘document’, this is working off text co-occurrence windows. In this case the sentences that mention the term ‘Precambrian’, using PCA.
Fig 1 – Manual clustering applied to the SEG corpus search term of Precambrian
Four main clusters were identified, relating to ‘Iron Formations’, ‘Shields’, ‘Gold Deposits’ and ‘Igneous Mineralization’.
The could be seen to represent a high level ‘summary’ of the main topics related to the Precambrian in this corpus. So a form of high level subject summarization. There are of course numerous methods and the LDA Topic Modelling method itself. I conducted some research a few years ago using Topic Modelling (Blei) with petroleum geoscientists and engineers, finding evidence that these techniques were capable of surfacing new knowledge to experienced oil & gas professionals, driven by text corpora.
At the recent Oil & Gas Technology Centre (OGTC) event I was asked what techniques geoscientists should use, which is best: supervised machine learning or unsupervised machine learning for text. This may be akin to a knowledge organization question on which is best for organizing documents: ‘folders or metadata tags’? There will be evangelists on both sides, but when you scrutinize the evidence, pro’s and con’s, it is likely the answer will be ‘use both as a strategy’, as it depends on the situation! I find its always useful to examine text using some form of clustering before we impose all our biases on it!