In Petroleum Geoscience, traces of hydrocarbons are referred to as a ‘show’ or ‘shows’.
Thousands of labelled example sentences were used to build a predictive machine classifier (based on word patterns). However, this did not work so well compared to detecting and disambiguating other concepts such as ‘mature’, ‘migration’.
This was probably due to the very subtle differences in language:
the well had shows in the Jurassic – (Correct)
the well shows that in the Jurassic – (False)
Just keyword extracting the word ‘show’, ‘shows’ will give 49% false positives. Conversely, only extracting bigrams like ‘oil show, ‘HC show’ etc led to 30% false negatives. There are other synonyms of course (e.g. oil stain) but those are easier to detect.
I started looking at the Part of Speech (POS) patterns as features where ‘show’ or ‘shows’ was mentioned. This can identify whether a word is used as a noun, verb, article, adjective, preposition, pronoun, adverb, conjunction and interjection. With variants there a many more of these POS codes. For example the simple Penn Treebank https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Whilst checking for a POS of noun ‘NN’ and ‘NNS’ gave 100% recall, precision was only 18%. I found that it was not enough to simply identify the POS of the word itself, but to also look at the words around it to form a DNA-like sequence. Figure 1 is a fragment from part of the sequence
Fig 1 – DNA fragment – Combining POS to form sequences to identify patterns
I compared the POS pattern sequences from thousands of labelled Petroleum Geoscience sentences for the target sense, versus the patterns in sentences which were false positives.
It was discovered that a unique set of ‘codes’ could almost perfectly detect the use of the word in the target sense. This gave an F1 accuracy of 99% when applied to 500 geoscience test sentences containing the word, that not been seen a priori.
This is one of many techniques used in the OpportunityFinder™ algorithm I introduced in my last post. This takes the whole DNA-like sequencing to higher levels in order to detect hidden plays in text. https://infosciencetechnologies.com/