This is a general discussion of some ideas I have been formulating for some time, going back to the work I did in 2014 on serendipitous information discovery.
It is becoming commonplace to extract occurrences of entities in document/literature text, their association with other entities and numerical values. This can generate a wealth of structured information (from unstructured text). But what does it mean? How do you determine what is very important and what is not?
Whilst it may be possible to generate new insights directly from the structured information extracted from unstructured text, it is not a given. If it does not tell a person or organization anything they did not already know, then it won’t support the generation of new insights. It may not be completely pointless, as it may simply be another piece of evidence ‘confirming’ what is already known.
In terms of comparing what has been generated to what is already explicitly known (written down) in corporate databases, a suite of ‘contradiction’ & ‘discovery’ algorithms may be needed. These algorithms could scan the newly created structured information (generated from unstructured text) to identify contradictions with the ‘prevailing view’ already stored in structured databases. A form of exploratory data analysis. Or compare structured information generated from unstructured text from company documentation (the prevailing view), to structured information generated from external literature.
A simple example could be highlighting a new ‘data point’ in x,y space on a map. A more complex example could be highlighting a much more ‘positive’ sentiment towards a possibility for action, than the currently prevailing view.
Furthermore, new associations may be formed by ‘joining’ these information sources together; the whole may be greater than the sum of its parts, leading to the emergence of new information and construction of new knowledge by people. For example, Swanson’s ‘ABC method’ of literature based discovery. This led to the discovery of the link between ‘magnesium defficiency’ and ‘migranes’ which was subsequently proved experimentally. It was only by combining information (it was not present in one source) that the related concepts emerged.
These are likely to be seen as ‘surprising’ by individuals or organizations; surprise could be described as the response given when information is presented that contradicts the existing ‘mental model’ held towards a state of affairs. Ultimately these could be the sparks for data driven learning.
Well known research methods and techniques such as Mixed Methods, Activity Theory and Triangulation have an inherent sensitivity to integrating diverse ‘data’ and identifying tensions, breakdowns, dissonance and contradictions. They attack a problem from a number of different conceptual levels and angles. I have been doing some research comparing different ‘views’ in the literature towards the same subject and how best to visualize these data. The findings will be presented in a future post/article.
Algorithms that ‘sit on top of databases’ that hold both ‘born structured’ data, as well as ‘derived structured’ data (generated from unstructured text), could be useful assistants to surface these contradictions from a sea of data. Valuable discoveries may also emerge.
CV + EDT = EV
IF EV = CV THEN Confirmation
EV <> CV THEN Contradiction / Emergence
CV = Current View
EDT = Extracted Data from Text (and/or text external to CV)
EV = Enhanced View