Month: April 2018

STEPS Distinguished Lecture on Big Data

Invited to give the Distinguished Lecture on Big Data next month for the Science and Technology Exploration and Production (STEPS) program run by Halliburton. The program aims to foster geoscience excellence through the facilitation of thematic research and offers the opportunity for academics to engage with Landmark (Halliburton) and the wider exploration and production community.

The lecture title is Big Data – Small Patterns: Applying Geoscience Sentiment Analysis to Unstructured Text.

Will be sharing recent results and findings of the Geoscience aware sentiment AnalyZER (GAZER) algorithm I developed in Python which has been applied to Geological elements in public domain texts. It is designed to surface interesting associative patterns relating to concepts such as ‘source rock’, ‘reservoir’, ‘trap’ and ‘seal’ that might be unknown to exploration geoscientists as they are buried in volumes of documents too large to ever be read and too subtle to be detected by traditional search engines.

The hypothesis is that if a geoscientist can be surprised by these patterns, and there is legitimate evidence for that ‘surprise’, it is likely to lead to a learning event and potentially a new play/model; changing what people know – or think they know.

More here:–-Small-Patterns-


Sentiment Analysis of Oil Company Annual Reports


A research paper I co-authored with Laura Muir, Associate Professor at the School of Computing Edinburgh Napier University has been published this week in the Journal of Knowledge Organization.

It is being increasingly recognized that sentiment analysis is a key part of enterprise search & discovery capability.

We applied sentiment analysis to public oil company annual reports. One company stands out for its over-positive rhetoric, the “Pollyanna Effect” towards the future, relative to its peers.

A lexicon was developed to detect edge member strong and hesitant forward looking language. Biologically inspired diversity algorithms were used to identify word patterns over time in companies, compared to subsequent revenue changes. One oil company showed a statistically significant association: their diversity of strong/hesitant language increased prior to a subsequent decrease in relative business performance.

A major industrial accident was also detected in another company’s reports without a need to read them. These were manifested through spike increases in the relative frequency of the topic ‘lessons’ followed by a spike in topics relating to the ‘future’. The effects of the catastrophe were still evident in word patterns several years after its occurrence. This supports the probable existence of Discourse of Renewal Theory (DRT) in practice.

The findings support the assertion that various social phenomena can be found in company reports by analysing word patterns over time – and some may have predictive properties. There may be benefits of applying sentiment algorithms (as standard) in enterprise search and discovery deployments.

Links here: Issue 2 KO and Institutional Repository


Enterprise Search: New Methods for Inferring User Satisfaction ?

Measuring user satisfaction with an enterprise search tool can be difficult. Feedback mechanisms on the user interface tend to only capture a small self-selected sample that may be skewed towards negative views. Whilst surveys can capture more data, they are also self-selecting and tend to be small scale compared to actual enterprise usage. Clickthrough data is useful as a surrogate for search quality and session behaviour but does not necessarily translate into user satisfaction.

A small experiment was undertaken with a domain search tool in a large oil & gas company. Using the search log data, a random sample (n=47) of users who had used the search tool in the past 2 weeks were invited to participate in a questionnaire. They were asked to provide their level of satisfaction with the search tool based on the previous 2 week period using a 5 point Likert item. This was subsequently correlated with the existing search log data that they did not see (the number of days they had used the search tool during that 2 week period). Figure 1 shows the results.

Search Usage and Satisfaction Figure 1 – Search satisfaction against usage (number of days during a 2 week period)

There were 6 users who were very satisfied but only used the search tool once/twice during the 2 week period. Conversely, there were six users who were dissatisfied that used the search tool only once/twice during the 2 week period. Gender and age was not statistically significant.

The data points outlined in the red circle are interesting. In the small sample tested, all the users who had used the search tool on over 50% of the working days (5 working days) over the prior 2 week period (10 working days), were satisfied/very satisfied. These could be considered ‘happy repeat customers’.

This could be a marker for inferring from large volumes of search log data, one subset of users who are satisfied with a search tool. This could be a marker for inferring from large volumes of search log data, one subset of users who are satisfied with a search tool. It is only a subset, as this group only represented 30.7% of all users who were satisfied (recall) but it was 100% accurate (precision).

As a causal mechanism, it is postulated that it would be unlikely that a user would use a search tool in an enterprise ‘every other day’ if they were not getting some value out of it. An alternative explanation is that the user has no choice, they have to use the tool as there is no other way to locate their information, i.e. high usage does not necessarily translate into satisfaction. However, there is plenty of evidence for poor take-up of enterprise search tools (people find other ways to locate what they need), so the best explanation is likely to be that they have some positive experience with the tool to explain the recurring behaviour.

That is not to say that users who use a search tool less are not satisfied of course (as these data show). This could be one marker for companies to assess user satisfaction exploiting large usage volumes rather than self selecting surveys. At present, there is no statistical significance to this finding and the data set is small, presenting an area for further research.