When presented with large volumes of text there are a number of techniques when applying text analytics. I developed the DMA Model as a simple conceptual way to categorize the main types. Rules based or machine learning techniques can be used individually or together for each of these 3 areas:
This scenario occurs when the text analytics methods refer to the document (I use the term loosely as to what a document could be – from 200 page report to a Tweet). This may be classifying what the document is about, identifying the entities (Dates, People, Companies, Topics etc.) that occur in the document, the sentiment of the document, an automated summary of the document and so on. But the outputs of the process are document centric and tend to be ‘linguistic’ in nature even if machine learning was used to generate them. These classifications and extractions can be ‘written’ back into the document metadata. This is useful to support search & discovery, trend analysis, faceted browsing etc.
This scenario occurs when the text analytics methods refer to an entire corpus (of documents) in generally large volumes. Where the whole is greater than the sum of the parts (to quote Smuts). This model is statistical. Where word patterns (mostly complex word co-occurrence) is turned into a mathematical representation. It can be unsupervised (e.g. simple n-grams to transformer based Text embedding language models) or supervised (such as using the Stanford SQUaD Q&A training set). In Information Retrieval (IR) we have used statistical models such as TF-IDF (how important a word is to a document given its relative corpus frequency) for decades. With the emergence of deep learning methods, there are many more ‘types’ of statistical models we can build from our text to support question & answering, natural language generation, anomalies trends and other capabilities.
Concept Association Centric
This scenario occurs where our interest lies in associations between defined concepts and other things. It differs from document centric methods in that the whole corpus is used as its the concepts and their associations which are important, not the documents. It differs from model centric approaches in that it is not purely statistical and generally cannot be unsupervised as concepts need to be defined in some way. Concepts are modelled using either rules from a knowledge representation (taxonomy, ontology) and/or labelled examples and machine learning. Critically, it is not just the existence of these concepts or entities in text which is important, but the existence in association with other defined concepts and entities of interest. The output is typically a Graph of nodes and edges (concepts and associations) which can be used to address a specific business question, support conversational assistants and search tools among other things.
There are obviously variants on these three conceptual models. I often get asked which is best? This is a bit like saying which do you prefer, a knife, fork or spoon? It depends on what you are doing and in some cases, you definitely need all three!