Document Understanding redefines the interface between users and the knowledge they seek from unstructured content.
At a basic level, search engines look at the terms in the query, then consider a group of documents and bring back those that are the most relevant. For the user, the most relevant document is the one that contains the information they seek. But for the search software, the query is all they know about the user’s intent, and the documents are essentially bags of words. The most relevant document is the one that scores highest. The question for the engine is which bag has the best chance of containing the answer. This is the essence of Document Understanding.
In the past Enterprise Search vendors argued about which had the best relevance algorithm, but today, most commercial and open source search engines use a very similar approach to relevance.
In fact almost all engines now rely on the open source Apache Lucene library as the basis for relevance ranking. The bag of words has gotten more sophisticated in that the engines consider what order the words are in, which words are near each other, whether they have synonyms, etc. But still they really don’t know much about what lives inside these documents and why. To truly improve relevancy, we need to go one bit step farther.
Document Understanding helps your search applications "know" what the documents are actually about.
We need to get inside each document and think about what its author is trying to convey. The good news is that using modern AI tools – Natural Language Processing, Machine Learning, Knowledge Graphs and Cloud Based services coupled to a Search Engine this can be achieved at a fraction of the cost or time that it would have taken a few years ago.
By applying sophisticated NLP and Machine Learning as the documents are processed and indexed, we can teach the computer to get inside the content and extract insights. Entity extraction is one example of an NLP technique that has gotten more sophisticated through the application of machine learning. Cloud-based tools such as Google Document AI and Microsoft Azure Form Recognizer make these sophisticated technologies more accessible.
These and other tools help establish user intent so that we can get a more precise view of what each person is seeking. As a result, we can create applications that understand and extract knowledge from content, generating greater user productivity and value from documents and unstructured content.
Contact us to learn more about how to leverage Document Understanding tools and techniques in your search applications to deliver better search results and to help users get the greatest insights from the information contained in your website or document repositories.
Below are some of the best examples that we have been involved in:
- Market Intelligence – harvesting and Analysing Customer and Competitor data
- Matching Job descriptions to CVs within the Recruitment/Staffing Industry
- Risk Analysis – analysing legal documents to identify areas of risk, perhaps due to legislation changes.
- Identifying Personal / Private Information
- Storage Analytics – scanning and categorizing internal data with a view to reducing storage costs or assisting with Cloud Migrations
- Internal Threat Detection – Analyzing communication and event data within and across an organisation