Month: February 2018

Big Data in the Geosciences : Geoscience Aware Sentiment Analyzer


Geoscience aware text sentiment algorithm improves on out-of-the-box specific sentiment tools like IBM Watson, Google, Microsoft and Amazon by over 30% for geoscience sentiment in text.

Presented early research findings today at the Janet Watson ‘Big Data in the Geosciences’ conference at the Geological Society of London.

Google opened proceedings with a talk on Satellite Imagery and the Earth Engine, subsequent talks ranged from using Twitter for early warnings of Earthquakes, Virtual Reality and Digital Analogues through to applying deep learning to detect volcano deformation. Some fascinating insights.

My latest research addressed sentiment/tone, the context, around mentions of petroleum system elements (such as source rock, migration, reservoir and trapping) in literature, company reports and presentations. The hypothesis is that stacked somewhat independent opinion/tone in text, the averages, the outliers, the contradictions –may potentially show geoscientists what they don’t know and challenge what they think they do know.

The research question was to assess whether a geoscience-aware algorithm could improve on existing API’s/algorithms in use for sentiment analysis and how useful resulting visualization might be.

Using a held-back set of 750 labelled examples to test, the Geoscience Aware text sentiment analyZER (GAZER) algorithm achieved 90.4% accuracy for two classes (positive and negative) and 84.26% accuracy for 3 classes (positive, negative and neutral sentiment). This compared favourably with generic paragraph Vector and Naïve Bayes out-of-the-box generic approaches. It also compares favourably to the out-of-the-box sentiment Cloud API’s from IBM Watson, Microsoft, Amazon and Google that averaged approximately 50% accuracy for the 3 classes.

This supports findings in in other areas showing the need for customization for sentiment in domain areas and the criticality of specific training data for the work task in hand. The findings also support existing literature that suggested generative probabilistic machine learning algorithms may perform better than discriminatory ones when trying to classify snippets of information such as sentences and bullets in PowerPoint presentations.

Early evidence suggested resulting visualizations such as streamgraphs of the sentiment data could be used to challenge individual biases and organizational dogma, potentially generating new knowledge – presenting an area for further research.

Presentation available in SlideShare Click Here

750 Labelled sentences (the test set) and simple Python Extraction Script on Github