Using taxonomies and ontologies to extract knowledge from text.
Domain Taxonomies can play a crucial role in many automated Machine Learning tasks. However, in one Study research showed that over 34% of concepts in a taxonomy can remain undetected (false negatives) if a taxonomy is only created manually. Augmenting the taxonomy design process with inductive statistical vectorspace techniques is likely to lead to significantly improved results.
Backtracking, creating domain knowledge representations such as taxonomies is seen as a crucial aspect of many machine learning tasks. They help us delineate and detect topics in text which in turn allow us to “reason about other things about those things” surfacing trends and outliers that can lead to new insights and knowledge. In some cases their everyday application has epistemological significance – how we come to know things.
Taxonomies are one such example, hierarchical ‘is a’ or ‘part of’ relations (semantic similarity) between things. A geoscience example would be ‘oolite’ as a ‘child’ of ‘limestone’. Or spatially, an Oil & Gas prospect occurring within a Geological Basin. This allows simple inference by machines reading text, so machines understand if the word ‘oolite’ is encountered then it must be true that ‘limestone’ is also being discussed in that sentence, as an oolite ‘is a’ limestone (according to the knowledge representation given). From the specific to the general.
Where the same term has more than one meaning (e.g. well, play, source) disambiguation (rules based and Machine Learning) techniques can be applied and designed into the knowledge representation. This can mitigate false positives. These concepts have implications for Information Retrieval and Text Analytics.
Broader semantic relatedness associations can also be modelled such as ‘dunes’ <formedIn> ‘desert’. These structured relations may not always be true unless other conditions are met, so reasoning can be more complex as we move from simple hierarchical taxonomies to network-like ontologies or associations, as ‘we codify what we know’.
These ‘rule based’ linguistic methods can also be used as an aid for document, topic, sentence classification and question/answering. This contrasts with supervised statistical classification methods where typically examples of each class are given and ‘clues’ (informative features) are deductively derived from the examples through statistical methods. In unsupervised machine learning, clusters may inductively lead to the definition of the classes themselves as a form of exploratory data analysis. All techniques have their benefits and drawbacks depending on the task in hand.
Hybrid statistic-linguistic methods are attractive due to sparse training sets and recognition of the significant knowledge we already have in certain scientific models that can improve machine learning for narrow tasks.
When designing taxonomies as an aid for specific automated machine learning tasks (rather than simply for manual human based library indexing), there are common mistakes evidenced in the Literature. The most common is for ‘experts’ to describe terms or classes of terms, without forensic consideration for how they are actually described within texts – their ‘everyday parlance’.
In one published Study using 13,000 documents and 334 terms from an industry taxonomy, vectorspace techniques were used to show that 34% of relevant ‘mentions’ were not being detected (false negatives) by automatically applying that industry taxonomy to text in documents and reports. As Korzybski remarked “The map is not the territory”.
Vectorspace techniques can be useful, exploiting complex co-occurrence patterns in text and cosine similarity to determine relations and synonyms. This can be used to build data driven associative networks with wide coverage, the results of which can be stored in Graph like structured and repositories. However, determining universal similarity cut-offs for true synonyms/domain taxonomy creation is not straightforward and is unlikely to be possible, although it is an area of active and ongoing research. For example, in an unpublished study I conducted using 6,000 Geoscience journal articles it was found that the most similar term to ‘Jurassic’ was ‘Cretaceous’. They are related terms but not synonyms or even broader/narrower terms (using Thesauri terminology). Pure statistical methods have their own limitations.
This all points to a ‘best of both worlds’ approach. Manual modelling of human knowledge – taxonomies for automatic inference for the task in hand, combined with various data driven statistical methods that exploit patterns within texts.
A form of ‘model based machine learning’.