Author: phcleverley

STEPS Distinguished Lecture on Big Data

Invited to give the Distinguished Lecture on Big Data next month for the Science and Technology Exploration and Production (STEPS) program run by Halliburton. The program aims to foster geoscience excellence through the facilitation of thematic research and offers the opportunity for academics to engage with Landmark (Halliburton) and the wider exploration and production community.

The lecture title is Big Data – Small Patterns: Applying Geoscience Sentiment Analysis to Unstructured Text.

Will be sharing recent results and findings of the Geoscience aware sentiment AnalyZER (GAZER) algorithm I developed in Python which has been applied to Geological elements in public domain texts. It is designed to surface interesting associative patterns relating to concepts such as ‘source rock’, ‘reservoir’, ‘trap’ and ‘seal’ that might be unknown to exploration geoscientists as they are buried in volumes of documents too large to ever be read and too subtle to be detected by traditional search engines.

The hypothesis is that if a geoscientist can be surprised by these patterns, and there is legitimate evidence for that ‘surprise’, it is likely to lead to a learning event and potentially a new play/model; changing what people know – or think they know.

More here:

http://www.ienergy.community/NewsDetails/dt/Detail/ItemID/361/Big-Data-–-Small-Patterns-

Advertisements

Sentiment Analysis of Oil Company Annual Reports

oil-rig-2191711_1920

A research paper I co-authored with Laura Muir, Associate Professor at the School of Computing Edinburgh Napier University has been published this week in the Journal of Knowledge Organization.

It is being increasingly recognized that sentiment analysis is a key part of enterprise search & discovery capability.

We applied sentiment analysis to public oil company annual reports. One company stands out for its over-positive rhetoric, the “Pollyanna Effect” towards the future, relative to its peers.

A lexicon was developed to detect edge member strong and hesitant forward looking language. Biologically inspired diversity algorithms were used to identify word patterns over time in companies, compared to subsequent revenue changes. One oil company showed a statistically significant association: their diversity of strong/hesitant language increased prior to a subsequent decrease in relative business performance.

A major industrial accident was also detected in another company’s reports without a need to read them. These were manifested through spike increases in the relative frequency of the topic ‘lessons’ followed by a spike in topics relating to the ‘future’. The effects of the catastrophe were still evident in word patterns several years after its occurrence. This supports the probable existence of Discourse of Renewal Theory (DRT) in practice.

The findings support the assertion that various social phenomena can be found in company reports by analysing word patterns over time – and some may have predictive properties. There may be benefits of applying sentiment algorithms (as standard) in enterprise search and discovery deployments.

Links here: Issue 2 KO and Institutional Repository

 

Enterprise Search: New Methods for Inferring User Satisfaction ?

Measuring user satisfaction with an enterprise search tool can be difficult. Feedback mechanisms on the user interface tend to only capture a small self-selected sample that may be skewed towards negative views. Whilst surveys can capture more data, they are also self-selecting and tend to be small scale compared to actual enterprise usage. Clickthrough data is useful as a surrogate for search quality and session behaviour but does not necessarily translate into user satisfaction.

A small experiment was undertaken with a domain search tool in a large oil & gas company. Using the search log data, a random sample (n=47) of users who had used the search tool in the past 2 weeks were invited to participate in a questionnaire. They were asked to provide their level of satisfaction with the search tool based on the previous 2 week period using a 5 point Likert item. This was subsequently correlated with the existing search log data that they did not see (the number of days they had used the search tool during that 2 week period). Figure 1 shows the results.

Search Usage and Satisfaction Figure 1 – Search satisfaction against usage (number of days during a 2 week period)

There were 6 users who were very satisfied but only used the search tool once/twice during the 2 week period. Conversely, there were six users who were dissatisfied that used the search tool only once/twice during the 2 week period. Gender and age was not statistically significant.

The data points outlined in the red circle are interesting. In the small sample tested, all the users who had used the search tool on over 50% of the working days (5 working days) over the prior 2 week period (10 working days), were satisfied/very satisfied. These could be considered ‘happy repeat customers’.

This could be a marker for inferring from large volumes of search log data, one subset of users who are satisfied with a search tool. This could be a marker for inferring from large volumes of search log data, one subset of users who are satisfied with a search tool. It is only a subset, as this group only represented 30.7% of all users who were satisfied (recall) but it was 100% accurate (precision).

As a causal mechanism, it is postulated that it would be unlikely that a user would use a search tool in an enterprise ‘every other day’ if they were not getting some value out of it. An alternative explanation is that the user has no choice, they have to use the tool as there is no other way to locate their information, i.e. high usage does not necessarily translate into satisfaction. However, there is plenty of evidence for poor take-up of enterprise search tools (people find other ways to locate what they need), so the best explanation is likely to be that they have some positive experience with the tool to explain the recurring behaviour.

That is not to say that users who use a search tool less are not satisfied of course (as these data show). This could be one marker for companies to assess user satisfaction exploiting large usage volumes rather than self selecting surveys. At present, there is no statistical significance to this finding and the data set is small, presenting an area for further research.

 

Transforming Digital Worlds

transforming digital worlds

Along with 450 academics and practitioners, I attended the iSchools Transforming Digital Worlds conference this week at the University of Sheffield. Some fascinating presentations on information behaviour, information seeking and information retrieval.

I was particularly interested in the keynote from  Dr Lynn Connaway. Many of the messages although not new and perhaps well known to some, were put in a tone and context that really resonated with me – in a business world when we are often too quick to jump to the solution or answer:

“To identify why and how people get information we must first watch and listen”

“We need to understand motivations and expectations for using technologies”

In an interview study of 164 people from high schools and universities, some insightful gems were uncovered regarding digital literacy. This is in a landscape where critical thinking skills – the ability to examine the credibility and trustworthiness of information are increasingly significant. Take this quote from a 17 year old high school student gathered during interviews:

“I always stick with the first thing that comes up on Google because I think that’s the most popular site which means that’s the most correct.”

Connaway then makes the point “Critical thinking skills are a primary concern of university administrators and are crucial for developing an informed citizenry.” This was supported by the quote from a University Provost during the interviews:

We should be helping people learn how to think, learn how to be skeptical, learn how to use critical thinking skills, learn how to be self-reflective. I think because those things are so much harder to assess and to demonstrate we have not done as good a job telling that story.”

Although no mention was made of the business workplace, I have seen equivalent issues with digital literacy amongst seasoned professionals especially around ‘search’. Not only in their use of their own corporate search engines but also using Internet search engines for work.

This is by no means universal, for example, I was asked recently by a Geoscientist to recommend Internet search engines other than Google (e.g. duckduckgo) because they recognized and were concerned Google was personalizing the results too much and blinding them to potential information discoveries. There are many cases however, where I have observed critical geoscientific information missed in work tasks, simply because of search literacy capabilities.

Continuing to develop digital literacy capabilities in the ‘Digitalization’ workplace (not just how people use technology, but how they interact with information through technology) may be highly significant for organizations in gaining a competitive advantage.

Beyond Google

google-485611_1920

I gave a lecture this week on search & analytics to students on the online Petroleum Data Management course at Robert Gordon University. Some excellent discussions, debate, questions and a thoroughly enjoyable session with knowledgeable students mostly in full time employment from around the world.

My topic was ‘Beyond the Search Box and Ten Blue Links’. How a new generation of search tools are emerging in the workplace and challenging the cosy search ‘habitus’ we are used to. These are taking search from simply remembering (retrieving information) up Bloom’s taxonomy to higher levels of cognitive ‘human-like’ tasks such as comparing, contrasting, summarizing and predicting. I also presented some practical examples of work task specific search tools and use cases in the Oil and Gas Industry.

Internet search engines like Google have been (and are) tremendously successful – a social phenomenon. They have probably become an epistemology for some tasks ‘how we come to know things‘ covered in my previous posts last year. However, ranking tends to be popularity based so some knowledge may be hidden by its obscurity and despite people finding & discovering relevant and useful information, they may be limited by their own knowledge of keywords to enter into the search box. In some research I conducted with geoscientists published in peer reviewed journals back in 2014  some issues with Internet Searching with tools like Google were highlighted:

“Problem I have with Google is choosing right selection of words to find something” “With analogues you don’t know what terms to query on because you don’t know what they are” “I use Google as an exploratory tool. Some things difficult to find in Google” “How do I know I have really found the most relevant?”

With exploratory search task goals there is not a single correct answer (like Lookup/known item search task goals) – the user is learning with awareness of information changing their information needs which are dynamic. That means that a user can often find plenty of useful relevant material on the first few search results pages and be ‘obliviously satisfied’ because they are unaware of the critical (perhaps even more game-changing) information they failed to find.

Some of my previous published research supported this, showing there was no relationship between ‘user satisfaction’ and how well users actually performed exploratory search tasks, using standard ‘Google-like’ search tools.

Any user interface is effectively a theory – what we think is the best way to present search results. The Internet Search Engine Google-Like ‘search box and ten blue links’ seems pretty efficient to meet certain search task goals. However, for exploratory searching in the workplace and meeting the needs of subject matter experts in organizations for specific tasks we may need to think ‘outside the box’. There has been significant and ongoing research into exploratory search user interfaces.

As one research participant (a subject matter expert in geoscience) commented about search results that were ranked by traditional frequency/popularity methods – “relevant but not interesting’. Perhaps for some subject matter experts in some tasks they really want to be shown something they don’t already know. Whilst it is arguably impossible for a technology system to ‘know’ what someone may always find surprising, some techniques have been proven to be more likely to surface these than others through content driven algorithmic prompts.

So perhaps we need to move beyond relevance in some situations in the workplace to what is useful. According to John Maeda in the laws of simplicity, ‘simplicity is about subtracting the obvious and leaving the meaningful’.

Finally, beyond the ‘search box and ten blue links’ means more than user interface design and technology. The literacy of the searcher is likely to be key for exploratory search. Not just query formulation but their metacognitive processes of planning, monitoring and reflecting as they search. Some commentators describe search as one where if you get your content organized and your technology right ‘search will take care of itself‘. Perhaps for the very simple lookup/known item search goals, but highly unlikely for exploratory search – it is a philosophy that taken to its conclusion – denies human agency in the information search process. The evidence points to search literacy as being key for exploratory search tasks in the workplace.

Big Data in the Geosciences : Geoscience Aware Sentiment Analyzer

Blog_Picture5

Geoscience aware text sentiment algorithm improves on out-of-the-box specific sentiment tools like IBM Watson, Google, Microsoft and Amazon by over 30% for geoscience sentiment in text.

Presented early research findings today at the Janet Watson ‘Big Data in the Geosciences’ conference at the Geological Society of London.

Google opened proceedings with a talk on Satellite Imagery and the Earth Engine, subsequent talks ranged from using Twitter for early warnings of Earthquakes, Virtual Reality and Digital Analogues through to applying deep learning to detect volcano deformation. Some fascinating insights.

My latest research addressed sentiment/tone, the context, around mentions of petroleum system elements (such as source rock, migration, reservoir and trapping) in literature, company reports and presentations. The hypothesis is that stacked somewhat independent opinion/tone in text, the averages, the outliers, the contradictions –may potentially show geoscientists what they don’t know and challenge what they think they do know.

The research question was to assess whether a geoscience-aware algorithm could improve on existing API’s/algorithms in use for sentiment analysis and how useful resulting visualization might be.

Using a held-back set of 750 labelled examples to test, the Geoscience Aware text sentiment analyZER (GAZER) algorithm achieved 90.4% accuracy for two classes (positive and negative) and 84.26% accuracy for 3 classes (positive, negative and neutral sentiment). This compared favourably with generic paragraph Vector and Naïve Bayes out-of-the-box generic approaches. It also compares favourably to the out-of-the-box sentiment Cloud API’s from IBM Watson, Microsoft, Amazon and Google that averaged approximately 50% accuracy for the 3 classes.

This supports findings in in other areas showing the need for customization for sentiment in domain areas and the criticality of specific training data for the work task in hand. The findings also support existing literature that suggested generative probabilistic machine learning algorithms may perform better than discriminatory ones when trying to classify snippets of information such as sentences and bullets in PowerPoint presentations.

Early evidence suggested resulting visualizations such as streamgraphs of the sentiment data could be used to challenge individual biases and organizational dogma, potentially generating new knowledge – presenting an area for further research.

Presentation available in SlideShare Click Here

750 Labelled sentences (the test set) and simple Python Extraction Script on Github