Combining meaning and prediction in text analytics

Part of Speech (POS) tagging is an important technique in Natural Language Processing (NLP). For example, differentiating between ‘play’ (grey) as a noun and ‘play’ (green) as a verb. Whilst most practitioners use this technique for rule-based NLP approaches, it also has its uses in unsupervised Machine Learning (ML). For example when using vectorspace/text embeddings (e.g. Word2Vec, Doc2Vec, ELMo, GloVe) some simple pre-processing steps before ML such as appending POS can lead to more accurate vectors. Trask et al (2015) published variants of this approach Sense2vec

I have experimented extensively with using text embeddings to surface geoscience analogues, developing a pre-processing methodology to derive the best results. Unlike sentiment analysis which is predominantly non-noun based, for deriving analogues through unsupervised learning, weighting nouns is more likely to deliver optimum results in terms of ‘analogous’ similarity output from the neural network.

Lithostratigraphic analogues are an interesting use case. There are many aspects to consider. A simple unsupervised ML approach is unlikely to give optimal results for a variety of reasons including ‘noise’, in terms of suggesting likely analogues. Conversely, simply applying a RULES BASED thesauri/taxonomy to text and then using the extractions to create vectors, will lose a lot of the latent meaning and structure in the text.

A middle ground is to use ‘theory guided’ machine learning. Encoding certain geoscience theories in the pre-processing step prior to ML. Early results show this is a promising technique to deliver meaningful and useful analogues to geoscientists automatically using unstructured text.

Share this:

Leave a comment Cancel reply