Realize LLM-based visual machine reading comprehension technology
Towards "tsuzumi" that can read and understand visual documents
NTT Corporation has made significant progress in the field of artificial intelligence (AI) with their LLM-based visual machine reading comprehension technology. This breakthrough aims to enable AI systems to answer a wide range of questions based on document images, which is crucial for digital transformation (DX).
Real-world documents often contain both text and visual elements (such as icons, diagrams, etc.). However, existing AI models, including large language models (LLMs), primarily focus on understanding text information.
To address this limitation, NTT proposed Visual Machine Reading Comprehension Technology. The goal was to create an AI system that can read and understand visual documents/ information, similar to how humans do.
Comparison of Text-based and Visual Machine Reading Comprehension. |
NTT aimed to develop a visual machine reading comprehension model with high instruction-following ability, akin to LLMs.
NTT successfully developed a new visual machine reading comprehension technology that leverages the reasoning ability of LLMs.
The model visually understands documents by analyzing both text and visual information. It can answer complex questions involving diagrams, such as understanding pie charts or other visual representations.
The research results were presented at the 38th Annual AAAI Conference on Artificial Intelligence and received the Outstanding Paper Award at the 30th Annual Conference of the Association for Natural Language Processing.
Notably, this paper is the first to propose a specific methodology for LLM-based visual document understanding.
Tsuzumi
NTT's large language model, called 'Tsuzumi' plays a central role in this technology. Tsuzumi is designed to address the energy consumption challenges associated with large-scale LLMs. It aims to reduce learning and inference costs while maintaining high performance.
The name "Tsuzumi" symbolizes the start of a Gagaku (ancient Japanese court music and dance) ensemble, emphasizing its role in driving industrial development.
Technology
NTT's visual machine reading comprehension technology visually understands documents by utilizing the high reasoning ability of LLMs (Figure below). To achieve this goal, (1) NTT researchers developed a new adapter technology5 that can convert document images into LLM's representations, and (2) constructed the first large-scale visual instruction tuning datasets for diverse visual document understanding tasks. These enable LLMs to understand the content of documents by combining vision and language information and to perform arbitrary tasks without additional training.Overview of LLM-based Visual Machine Reading Comprehension Technology. |
In a conclusion, NTT's breakthrough in LLM-based visual machine reading comprehension technology brings us closer to AI systems capable of understanding and answering questions based on visual documents—a critical step in the digital transformation journey.
This result is the outcome of joint research with Professor Jun Suzuki in Center for Data-driven Science and Artificial Intelligence Tohoku University in FY2023.
This technology will contribute to the development of important industrial services such as web search and question answering based on real-world visual documents. We aim to establish the technology to realize AI that creates new values by collaborating with humans, including work automation.
Advertisements