In order to understand a text not only superficially, but also in its deeper meaning, not only single entities should be recognized without any connection, but the recognized entities have to be embedded into their contexts. Word embeddings based on Word2vec or Co-occurrence analyses are statistical methods of text corpus analysis that are often used to automatically calculate contexts of terms and phrases.
However, these methods quickly reach their limits when more complex background knowledge is required for text classification, which does not occur explicitly in the text corpus and therefore cannot be derived from it. Typical examples are medical texts, legal contracts or also technical documentation, which often have to be classified on the basis of if-then rules, whereby the conditions themselves are multidimensional. This complexity quickly leads to the fact that in a rule-based system that is to perform this task automatically, countless combinations of input parameters have to be stored and managed. This often leads to problems and errors during ongoing maintenance and any expansion of the system and rules.
Our method of semantic text analysis transforms all input data, including unstructured texts, into semantic knowledge graphs based on RDF. Using entity linking techniques based on NLP and ML methods, any text expressed as an RDF graph can be embedded into a larger context, a domain-specific knowledge graph. Using the Shapes Constraint Language (SHACL), a specification of the World Wide Web Consortium (W3C) for the validation of graph-based data on the basis of a series of conditions, those texts can then be determined automatically that correspond to an information need that was initially formulated in natural language.
Essential advantages of this approach are:
We will discuss this approach based on several use cases, incl. the MATRIO Data Cleanup method."