Presentation given at the Geological Society of America (GSA) Annual Conference in Seattle this week (Geoinformatics session). Click here for Slideshare
Presentation given at the Geological Society of America (GSA) Annual Conference in Seattle this week (Geoinformatics session). Click here for Slideshare
In previous posts I have shown how ‘sentiment/tone/opinion’ can be automatically extracted from text to stimulate serendipity during search and analyse differences between Geological Formations. One of the findings was that words/sentences deemed as ‘negative’ by out of the box algorithms are not necessarily ‘negative’ in a geological sense (such as ‘hard’, ‘spill’, ‘old’, ‘fault’ and ‘thick’). Another finding was that typical entity extraction or even entity-entity association extractions and ontologies described in the geoscience literature, tend to leave behind valuable sentiment/tone context. Some of my recent academic research (to be presented at the Geological Society of America (GSA) this month) has focused on using machine learning from training sets to assess ‘tone’ in reports relating to working petroleum system elements in space and time. Effectively targeting the generic question that could apply to many domains in Geoscience and beyond:
“Do the right conditions exist for…...?”
Fig 1 shows sentiment geographically and Fig 2 by geological age.
Figure 1 – Map plot. Comparison of positive v negative tone for various elements in different geographical locations
For example, the sentence, “…well northeast of Boomerang Hills has tested this pre-Andean trap concept successfully” will count as a ‘positive’ for the petroleum system element ‘Trap’. The sentence, “downward migration from Upper Cretaceous-sourced oils seems unlikely” will count as a ‘negative’ for the element Migration (SR Charge). Simply ‘counting’ entities is not enough. Consider the mention“The reservoir was absent”!!! Without contextual sentiment it is likely that misleading data will be presented.
This could provide another ‘opinion’ for scientists which could challenge preconceptions about what they may already believe. It could also stimulate learning events (clicking on the sentiment to view the mentions within sentences) that may prove academically and/or commercially valuable.
I have been experimenting with a custom ensemble (skip-gram, lexicon and Bayesian) algorithm taking into account word order, to detect this ‘positive’ and ‘negative’ sentiment (opinion and tone) around entities. Deep learning text embeddings would probably improve the ensemble accuracy (Araque et al 2017) but I have not used them here as I am testing a very small dataset. See previous posts where I have used these techniques for different purposes on larger datasets.
The proportion of negative v positive instances can then be used to show relative trends (pie-charts in Fig 1) for each element and rolled up to higher level constructs. Figure 2 (multi-series bubble plot) shows the same data (from a very small sample of USGS reports again to illustrate the concept) focusing on the Source Rock/Charge element. This enables the data matrix to be plotted, where a Geological Age has been picked up by the algorithms as well as a location ‘mention’ in the text.
Figure 2 – Geological Time charts plots for sentiment/tone mentions in context to a geographical area/basin. The larger the bubble, the greater the number of mentions.
The categorization is very coarse (Basin level). Ideally it would be more useful to extract specific Intra-Basin features and/or geographical areas. Geological age of source rocks and charge/migration events are also conflated somewhat in this simplified picture, although they could easily be split out. Also, given enough document volumes, it should be possible to animate Figure’s 1 and 2 through time. For example, to show how sentiment/tone has changed each year from 1990 through to 2017.
By machine reading documents, papers and texts (too many in number to be realistically read by a person & harbouring patterns too subtle to picked up in any single document) a perspective can be obtained which may challenge individual biases and/or organizational dogma.
Public domain reports from the United States Geological Survey (USGS) were downloaded to test. Python & TextBlob scripts were used to convert the reports to text, identify mentions of Petroleum System Elements in the text and whether the context was ‘positive’ or ‘negative’ sentiment. Geo-referencing can be achieved through the centroid of the country, basin or geographical point of interest in question that is associated with that mention.
The algorithm addresses areas such as negation and avoids some of the problems with context free Bag of Words (BoW) models. For example “Source Rock maturity was not a problem” is a ‘positive’ context, despite having individual ‘negative’ words such as ‘not‘ and ‘problem‘. This is where traditional lexicon/taxonomy approaches (even using multiple word concepts) are likely to perform poorly.
Further work is ascertaining precision, recall and F1 accuracy scores and I’m currently working on a test set of over 2,000 examples of positive, negative and neutral sentiment about these entities extracted from public domain sources. Differentiating tone into various dimensions may also be useful. These may be promising techniques to augment geoscientists cognition supporting higher level thinking processes rather than just retrieval (remembering) of documents in traditional search applications.
Although all Geological Basins are unique, from Figure 2 it is obvious that some Basins/Areas may share common aspects. Utilising positive and negative tone by geological age, clustering techniques can be used on the data matrix to suggest analogues (including Intra-basin) just from the latent structure in text. No prior studies have been found which address this area and ascertain its usefulness. Fig 3 shows one such technique applied to positive/negative tone for the Source Rock/Charge element, with correlations and hierarchical clustering shown in a sequential coloured Heatmap (Metsalu and Vilo 2015). Rows and columns have been automatically re-ordered through clustering, the colours displayed are the values in the data matrix.
Figure 3 – Clustering (Correlation Clustering) Basins/Area and Geological Time for Source Rock/Charge by sentiment.
From Figure 3 it can be seen (Dendogram on left) that Sirte & Tamara are the two most similar (with the caveat we are using extremely limited data to illustrate the concept). It is relatively straightforward to see how in theory, this could be applied to a vast amount of sentiment data (more dimensions and Lithostratigraphy perhaps) potentially making more non-obvious connections where similar conditions exist, especially if numerical (integer/float) data is extracted from text and/or brought in from additional sources.
These techniques ‘mimic’ some simple human thought processes, hence the term ‘cognitive’. However, machines in my opinion do not read text “like people do”, despite technology marketing slogans. The Geoscientist may however, benefit from using some of these techniques which are freely available. After all, why would’nt you want to seek opinion from a crowd of somewhat independent scientists who have authored hundreds of thousands of reports? If it confirms your existing mental model then it’s good confirmatory supporting evidence. If it challenges it, that does not mean you are wrong, but it just may stimulate a little more reflection and investigation. Subsequently, you may stick with what you thought. On the other hand, it may radically change it.
Keywords: Sentiment Analysis , Enterprise Search , Big Data , Text Analytics , Machine Learning , Cognitive Search , Insight Engines , Artificial Intelligence (AI) , Geology , Petroleum Systems , Oil and Gas , Geoinformatics
Surprised and delighted to be informed that my PhD has been judged in the “Top 5″ Internationally in 2017 for Information Science in the ProQuest Doctoral Dissertation Award.
My thesis topic was Re-examining and re-conceptualising enterprise search and discovery. The Association for Information Science and Technology (ASIS&T) scope includes any PhD related to, “the production, discovery, recording, storage, representation, retrieval, presentation, manipulation, dissemination, use, and evaluation of information and on the tools and techniques associated with these processes.”
The judges comments include: “As far as I know, this is the first comprehensive and holistic work studying enterprise search; this is a pretty relevant theme and the contributions of the thesis are sizeable” and “Findings from this thesis have direct implications for the theories and practices in information science”.
A big thanks to my supervisory team of Professor Simon Burnett (Robert Gordon University) and Dr Laura Muir (Edinburgh Napier) along with everyone who has helped and encouraged me. It further motivates me to continue academic research in this area and to make further contributions to the discipline in what is a tremendously exciting time.
Delighted to have made the front cover for Sep/Oct 2017 issue
I presented at the International Society of Knowledge Organization (ISKO) this week, sharing findings of an exploratory study. A Knowledge Organization System (KOS) was automatically applied to the annual company reports of four similar sized oil and gas companies to detect forward-looking strong and hesitant sentiment, in order to detect rhetoric, social phenomena and predict future business performance.
The “Discovery” part of “Enterprise Search & Discovery” is arguably downplayed in much of the existing academic and practitioner literature. In addition to finding what you know exists (or finding document ‘containers’ that you did not), there may be a case to embed various sentiment algorithms as standard in enterprise search & discovery technology deployments. Designing with ‘serendipity in mind’, this may move the intent of a deployment from one of pure retrieval, to one of pattern recognition. Where ‘trace fossils’ may exist in the information aggregate, not discernible from any single document.
The utilization of such algorithms to ‘compare’ and ‘contrast’ perhaps in a web part in the user interface, may move the enterprise search & discovery tool further up the Bloom’s Taxonomy pyramid, in assisting higher forms of thinking (along with delivering the surprising). It may not make sense for many queries made in general purpose ‘Google-like’ search tools deployed behind a company’s firewall, but detecting queries which do could be a useful undertaking. As described in a previous post many things can have a ‘sentiment’ which may act as a catalyst for further inquiry and potential new learnings. Whilst sentiment analysis is a useful technique when you have an a priori hypothesis in mind, it could well surface interesting phenomena even when you don’t.
Just a quick update on what I have been up to these past few hectic months as my last blog post was back in May this year. Below are some papers I have been working on over the summer and upcoming conferences I will be presenting at:
Conducted some research recently in California (more on this in later posts)
Sentiment Analysis in organizational reports
I will be presenting on the 11th September in London at the ISKO conference in a collaboration with Laura Muir (Associate Professor of Information Systems at Edinburgh Napier University). The topic will be applying automated sentiment analysis to identify forward-looking sentiment(about the future) in company reports. This provides an indicator of how confident an organization feels about the future and may be dosed with rhetoric. We used biologically inspired word diversity algorithms which to our knowledge have not been used before to assess forward-looking sentiment. We also investigated predictive links to future financial performance and organizational phenomena such as the reaction to a crisis. I hope to share the presentation and paper shortly in the public domain. I think there are some very exciting findings and opportunities for companies to develop new knowledge as well as conduct further research: http://www.iskouk.org/content/isko-uk-conference-2017-knowledge-organization-whats-story
Search Engine Bias
Information Today published an extended article I wrote on search engine bias in their Sep/Oct 2017 edition here: http://www.infotoday.com/OnlineSearcher/Issue/7398-September-October-2017.shtml . It is an extension of the blog post I made earlier this year https://paulhcleverley.com/2017/04/24/are-search-algorithms-neutral/ including links to ‘fake news’ and bias within enterprise search & discovery technology. Information Today requires a subscription for the latest issues.
Cognitive Search Assistants in the Geosciences
Delighted that my paper on Cognitive Search Assistants in the Geosciences was accepted for the Annual Meeting of the Geological Society of America (GSA) in Seattle during October 2017. This builds on and further extends existing research I published previously on this site: https://paulhcleverley.com/2017/05/28/text-analytics-meets-geoscience/ , https://paulhcleverley.com/2016/08/01/teaching-machines-about-a-subject-like-oil-and-gas/ and work I presented a few years ago in Turkey https://paulhcleverley.com/2015/05/13/creating-sparks/ . These tools and techniques move beyond traditional deductive inference, to include both an inductive and abductive inference focus. I will be sharing the presentation and paper in the public domain later in the year.
I presented some text analytics work at a recent GeoScienceWorld (GSW) meeting in New Mexico, USA. GSW is a not-for-profit cooperation of Geological Societies, Associations & Institutes to disseminate geoscience information. First, some information on the trip, then the analytics!
The Geological Field Trip was to Santa Fe and Abiquiu areas approx. 7,000 Ft above sea level. To the west across the Rio Grande Rift Basin are the Jemez Mountains (a super volcano) and the town of Los Alamos (home of the Manhattan project). To the North is the Colorado Plateau and Ghost Ranch where over 100 articulated skeletons of the Triassic Theropod Dinosaur Coelophysis have been found (the state fossil of New Mexico). These would have stood about one metre tall at the hips and up to three metres long.
The red cliffs at Chimney Rock contain Triassic deposits overlain unconformably by cross bedded Jurassic desert sandstones topped with white limestone and gypsum in places. The beautiful scenery of Chimney Rock can be seen in photo below:
The view from the top of Chimney Rock is even more breath-taking in the photo I took below.
There has been a continuing shift from just Information Retrieval (IR) systems – a search box and ten blue links, to the search for patterns, through what is increasingly called ‘insight engines’ within the cognitive computing paradigm. After all, big data is about small patterns.
All of the work below is approximately five days work and shows what is possible using some of the techniques available today in a short space of time. I wrote scripts using Python and used OpenSource utilities, these included some new techniques not published before. For the analytic content, I used the Society of Economic Geology (SEG) text corpus (1905-2017) as an example focused on mining, mainly of heavy metals. This consists of over 6,800 articles, 4.3 million lines of text and 35 million words. Several examples of analytics techniques are shown below increasing in their sophistication.
The results of counting the frequency of terms and their adjacent associations are called n-grams. The image below is a graph of nodes (unigrams) and edges (bigram associations) automatically generated from the SEG text corpus of journals.
Terms that have high authority (many links) can clearly be seen, along with rarer terms with few associative links. This is one way to explore text in an easy and visual way which can be linked to queries to the documents or contexts in which those words or associations occur.
These displays can be complemented by word clouds, with the most frequent associations stripped down to reveal the ‘more interesting’. Previous research I performed with geoscientists indicated that more frequent associations were ‘relevant but not interesting’. So stripping away the most frequent may be desirable, hyperlinking every word so scientists can drill down into the articles and sentences in which the associations are mentioned. The example below is for the search query ‘precambrian’.
Just as the Google n-gram viewer allows someone to explore word usage through time, similar approaches can be taken with journals. The image below shows the trends of some common words in the SEG text corpus over the past twenty years. The y-axis is normalized relative word frequency (compared to total number of words used in that journal in that year).
For example, from the image above it is plain to see that the popularity of the terms ‘gold’ and ‘hydrothermal’ (frequency of occurrence) have increased over the past twenty years, whilst the terms ‘manganese’ and ‘metamorphic’ have decreased. The increase in popularity of ‘gold’ has been theorized as possibly related to gold price which has also been plotted on this chart!
Extracting entities from text and associating them to a spatial context and geological time period has been of increasing interest to both academia and practice. The NSF EarthCube GeoDeepDive Cyber Infrastructure is one such example with some fascinating findings related to stromatolite distribution and sea water chemistry for example driven by patterns in text.
The example below shows the frequency of mentions (histogram in green) in SEG journal articles of geological periods (including their constituent sub-divisions). A knowledge representation (taxonomy) has been applied to the text in order to surface a pattern. This would appear to support a proposition that over more than a hundred years, the focus for mining geologists has been the Pre-Cambrian and Tertiary (Neogene and Paleocene) periods (denoted by the acronyms ‘PC’ and ‘T’ respectively) on the y-axis. The Silurian period appears to have been of least interest.
Plotting the world-wide distribution of copper ore by Geological age (orange line) as a form of ‘control’, supports the theory that patterns in journal text may surface ‘real’ trends and phenomena of interest.
Another relatively common technique is to extract numerical integer and float data. The chart below shows the results of automatically extracting integer and float data in association with the mnemonic ‘ppm’ (parts per million), plotting where it can be associated to a chemical element. The ppm data is on the y-axis (logarithmic), with mentions on the x-axis (1,709 were found in total). This could be turned into a hyper-linkable user interface, taking the user to the sentence/paragraph in question for each data point. This type of extraction is quite trivial although potentially under-used by organizations despite much of these data not necessarily being stored in structured databases.
Another common technique is entity-entity matrices, showing how common two entities occur together in the same sentence or equivalent semantic text unit. The example below shows lithology and minerals for the SEG corpus.
The associations are clustered using least squares to group similar lithology and mineral associations. You may just be able to pick out ‘Diamond’ on the middle right and its strongest association to Breccia and Conglomerate. These displays may reveal surprising associations worth of further exploration and are used extensively in biomedical for tasks such as gene discovery.
Looking at individual geological formation names as they appear in text, it may be possible to derive ‘sentiment’ and ‘subjectivity’ of the formation. Using Part of Speech (POS) tagging, nouns that occur before the phrase ‘Formation’ or ‘Fm’ for example, can be extracted.
The cross-plot below shows some Formation names that appear around the search query term ‘leaching’. Polarity is on the x-axis, denoting how the Formation is perceived, negative (-1) versus a positive (+1) light. This is achieved by analysing the words (using Bayesian Statistical algorithms) that co-occur with the geological Formation mentions in text. Simplifying, terms such as ‘good’, ‘surprising’ and ‘abundance’ area deemed as ‘positive’ whereas terms such as ‘poor’, ‘error’ and ‘problem’ are deemed as negative. On the y-axis is subjectivity, from objective (0) to subjective (1). Terms such as ‘strongly suggest’ and ‘by far’ being indicative of subjective views. Standard sentiment algorithms cannot be used with accuracy on geoscience content, as the everyday terms ‘old’, ‘fault’ and ‘thick’ for example, which can be used in social media to denote negative views, are not negative in a geoscience sense!
From these data, the Citronelle Formation appears in a negative light that may stimulate the scientist to investigate the sentences (context) which may lead to a learning event. Conversely, the high ‘subjectivity’ of the Popovich Formation may also trigger curiosity to understand the context which may lead to a re-interpretation.
Geo-referencing journal articles is not a new technique. However, in many cases it is the entire journal article (or just images within the article) that is referenced. In essence it is a summary of ‘aboutness’. The map below shows ‘mentions’ of the search query term concept ‘precambrian’ in the full text (body text) of all articles in the SEG corpus where they can be automatically geo-located. Comparing to techniques that only use keywords and/or abstracts of the journal article (the ‘information container’), there is a 200% enhancement (increase) of geo-locations. Clicking on the locations to show the ‘mentions’, the sentence or paragraphs in which the query term concept is mentioned, may yield insights that simply geo-locating entire journal articles cannot.
Instead of colour coding the frequency of occurrence by colour, bubble plots can be used, where the size of the bubble relates to the frequency of mention. This can also be combined with external data to the text corpus. An example is shown below integrating with the surface geology on the United States using US Geological Survey (USGS) WMS GIS spatial data.
More granular geo-coding is simply a case of adding in more specific lookup lists for latitude and longitude of any entity.
Another common technique to ‘summarize’ the essence of what ‘lies beneath’ in text, relies on a range of techniques from complex word co-occurrence patterns, Principal Component Analysis (PCA) to Eigen Values and Vectors. The image below shows the clusters of topics for the search query ‘leaching’ in the SEG text corpus. These techniques can be applied at any level of granularity, on an abstract, a single article, a whole corpus or as a delta between journals or corpora. Topic modelling is typically applied longitudinally (through time) to surface changes in the intent behind text.
The word co-occurrence patterns of any entity can be converted into a mathematical vector and the similarity compared with one another. From literature reviews, these techniques have been applied more sparsely/largely non-existent within the geoscience discipline compared to simple entity extraction and association.
The cross-plot below shows geological periods in the SEG corpus plotted to their similarity to the word vector of ‘volcanics’ on the x-axis and ‘limestone’ on the y-axis. In the bottom right, the Pre-Cambrian Archean period (2-4 Billion years ago largely before life on earth) is very similar to ‘volcanics’ and not ‘limestone’ which is what you would expect. Conversely, the Mississippian (top middle) is very similar to ‘limestone’ and not ‘volcanics’ which is what you would expect as sea level was very high with warm shallow seas. So again, this supports the theory that word vectors from text can surface real word patterns that make sense. Perhaps they can also reveal what we don’t yet know.
A variation of this technique which it is believed may have never been tried before in the Geosciences, is combining data from a database, with word vector information. In the cross-plot below, US states (e.g. Florida, Wyoming, Oregon) are plotted by their annual rainfall on the y-axis (from the National Oceanic and Atmospheric Administration (NOAA) database) and their similarity to the word vector ‘Arsenic’ in the SEG corpus on the x-axis. A weak correlation (R2=0.26) is found, implying more similarity to the word vector ‘Arsenic’ with decreasing rainfall. Simplistically, this could be due to more arid environments (less rainfall) leading to higher Ph conditions with Arsenic more likely to mobilise from the underlying geology into groundwater and aquifers.
This could point to the potential value of combining word vector similarities from text with traditional measured data stored in structured databases. The whole may be greater than the sum of the parts.
A final example integrates data from the US National Cancer Database (CNC), Alzheimer’s Association and again, text vectors from the SEG corpus. The average cancer mortality rate per US state (per 100,000 people) is plotted on the x-axis, the average Alzheimer’s mortality rate per US state is plotted on the y-axis. The similarity of US state word vectors to the heavy metal ‘Cadmium’ word vector is shown by the colour and size of marker. The more similar, the larger the circle. Those above average similarity in the sample are coloured orange, below average are coloured blue. There is no statistically significant correlation and even if there was, correlation is of course not causation. There are many demographic and socio-economic factors at play in a complex system. However, these techniques may be useful in surfacing patterns that warrant further investigation or hypothesis testing.
The final example compares the linkage between the word vector of every concept in the corpus, with the word vector of every other concept in the corpus and their similarity to the word vectors of a hypothesized theme. In the example below, the theme are elements typically associated with geogenesis (natural) contamination in groundwater (e.g. Aluminium, Iron, Copper, Mercury, Lead).
A new simple ratio has been developed (Cleverley 2017) by combining linear regression with a scaling factor to represent the individual similarity of the concept(s) to the theme, to surface potentially the ‘unusual’ associations which may warrant further discovery. In the run below, over 150Million word vector combinations were tested by an automated algorithm. This took four hours on a standard laptop.
For example, ‘Argon dating’ and ‘Feldspar Chlorite’ as individual concepts, do not have high similarity to the theme. However, as an association, they have a disproportionately higher correlation than one would expect, which may warrant further exploration to identify a causal mechanism.
Just as Swanson (1988) manually identified (inferred) a link between magnesium deficiency and migranes, that was not present in any single article, but they shared similar concepts, these automated techniques could highlight new associations. This could lead to new knowledge and ultimately, new scientific discoveries that are hidden amongst our text in plain sight.
Knowledge is socially constructed and different text copora will likely lead to different word vectors for the same concepts depending on the sub-discipline and nature of the text. These differences may also surface clues to new phenomena of interest.
Based on literature reviews, the use of word vector similarities of entities with external data is potentially under-utilized in the geosciences. Future work will most likely expand the research to apply to much larger quantities of journals and further develop automated approaches. Questions, comments and ideas are always welcome, feel free to contact me on the email below.
Paul Cleverley PhD
Robert Gordon University
A PDF of this article is available By Clicking <Here>