Driven by an information need to ‘show me something I don’t already know’, I conducted an exploratory study recently to investigate whether algorithms in general had the potential to suggest ‘surprising sentences’ from geoscience text.
Ten geoscientists (consisting of 8 experienced exploration geoscientists and 2 support staff with geoscience backgrounds) each rated 100 test sentences randomly extracted from public domain geoscience reports on the likelihood the sentence had an affordance to be surprising.
These data were then compared to ‘surprisingness scores’ generated by an algorithm that had already been applied to the same data. The algorithm predicted what geoscientists found surprising 75.17% of the time (F1 score). The algorithm was able to predict sentences which were judged with the highest likelihood to be surprising 92.45% of the time.1
There was fair2 inter-rater agreement (65.78% pairwise agreement, Kappa=0.315) indicating that structure may exist within text which can have common tendencies to surprise us all. This provides evidence that while surprise is unlikely to be governed by absolute laws and will most likely be in-part subjectively linked to a geoscientist’s own context and experiences, algorithms can utilise features in text to suggest geoscience sentences with tendencies to be surprising.
The algorithm was created using two methods. Firstly, trained from selected informative features derived from a Bayesian classifier using word order applied to several thousand previously labelled sentences from public domain petroleum geoscience literature. Labelling was undertaken using a variety of theoretical assumptions I have touched upon in previous presentations. Secondly, noun, noun-phrases were weighted in order to boost the signal.
The study asked geoscientists and support staff to make judgements on what they thought would be surprising, so a limitation of the study is transcendental ‘as if’ judgements. Further work could include in-situ experiments with scientists working in specific areas using specific relevant content to identify cases of actual surprise and learning using these types of algorithms. Another limitation is using sentences only rather than trends through time, novel concept associations or wider context which presents areas for further research. This has been discussed in previous posts on this blog and in presentations and articles. It is likely that many of the features are generic and transferable outside of geoscience to other disciplines, which presents another area for further research.
With information volumes & search results typically too vast for geoscientists to feasibly read and 90% of people often not looking past page #1 of search results, we may not really know what potential knowledge could be hidden from us. These findings could therefore be used to accelerate learning.
By designing algorithms ‘to surprise’ and suggesting sentences (or wider contexts) as people search, may lead to scientists stumbling across unexpected, insightful and valuable information. This may lead to discoveries and lines of thought that otherwise would not have occurred had it not been for the algorithm. Facilitating serendipity.
A very big thankyou to all participants without which this research would not be possible. An extended paper is planned for later this year.
1 Figures slightly different to those reported on 25th March after a small error was noticed. This has increased the accuracy by 0.45%
2 Interpretation of Fleiss’ kappa (from Landis and Koch 1977)