The Hidden Codes in Geoscience Text

dna-3539309_1920

In Petroleum Geoscience, traces of hydrocarbons are referred to as a ‘show’ or ‘shows’.

Thousands of labelled example sentences were used to build a predictive machine classifier (based on word patterns). However, this did not work so well compared to detecting and disambiguating other concepts such as ‘mature’, ‘migration’.

This was probably due to the very subtle differences in language:

For example:

the well had shows in the Jurassic – (Correct)

the well shows that in the Jurassic – (False)

Just keyword extracting the word ‘show’, ‘shows’ will give 49% false positives. Conversely, only extracting bigrams like ‘oil show, ‘HC show’ etc led to 30% false negatives. There are other synonyms of course (e.g. oil stain) but those are easier to detect.

I started looking at the Part of Speech (POS) patterns as features where ‘show’ or ‘shows’ was mentioned. This can identify whether a word is used as a noun, verb, article, adjective, preposition, pronoun, adverb, conjunction and interjection. With variants there a many more of these POS codes. For example the simple Penn Treebank https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

 Whilst checking for a POS of noun ‘NN’ and ‘NNS’ gave 100% recall, precision was only 18%. I found that it was not enough to simply identify the POS of the word itself, but to also look at the words around it to form a DNA-like sequence. Figure 1 is a fragment from part of the sequence

DNA_code_show

Fig 1 – DNA fragment – Combining POS to form sequences to identify patterns

I compared the POS pattern sequences from thousands of labelled Petroleum Geoscience sentences for the target sense, versus the patterns in sentences which were false positives.

It was discovered that a unique set of ‘codes’ could almost perfectly detect the use of the word in the target sense. This gave an F1 accuracy of 99% when applied to 500 geoscience test sentences containing the word, that not been seen a priori.

This is one of many techniques used in the OpportunityFinder™ algorithm I introduced in my last post. This takes the whole DNA-like sequencing to higher levels in order to detect hidden plays in text. https://infosciencetechnologies.com/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: