Extracting key information from visually rich documents (VRDs) such as receipts, tickets, and licenses is a complex task. Traditional optical character recognition (OCR) techniques often fall short due to the diverse layouts and formats of such documents. To tackle this challenge, we present a lightweight yet powerful approach, GraphRevisedIE, combining textual, visual, and layout features with graph-based methods to handle key information extraction.
Motivation
The core challenge in VRDs is dealing with varied document layouts that can create semantic ambiguity. For instance, two identical text segments like "03" might represent a train number or a month depending on their global position and visual information in the ticket. GraphRevisedIE is designed to handle these challenges by embedding multimodal features into a graph structure, allowing it to understand the global context and resolve such ambiguities effectively.
Model Architecture
GraphRevisedIE consists of three components: Multimodal Feature Embedding, Graph Module, and Decoding.
Multimodal Feature Embedding
This component extracts textual, visual, and layout features from the image.
- Textual Features Recognized text segments are embedded using character level one-hot encoding.
- Visual Features Convolutional Neural Networks (CNNs) extract visual attributes like font, color, and size.
- Layout Features Relative positional embeddings capture spatial relationships between text segments, ensuring robustness against image distortions like rotation.
The above architectural diagram shows the multimodal embedding process, with separate paths for textual, visual, and layout embeddings. These embeddings are added elementwisely and then encoded with Transformer encoder.
Graph Module
The graph module is crucial in capturing global context among document segments.
- Graph Construction Text segments are represented as nodes, with edges denoting their relationships. The initial graph is then refined using a graph revision technique.
- Graph Convolution An attention-based convolution propagates context across segments, enriching their embeddings with global information.
This approach allows GraphRevisedIE to dynamically learn relationships between text segments, unlike previous methods that rely on static graph structures. The following image shows how segment-level visual embeddings are generated in the graph module.
Decoding
GraphRevisedIE uses a BiLSTM-CRF model for decoding the enriched embeddings, predicting entity tags for each character. This step helps restore the original document structure by piecing together the segments.
Performance Evaluation
GraphRevisedIE has been tested across various datasets, including SROIE (receipts), CORD (invoices), FUNSD (forms), and Train Tickets. It consistently shows improved generalization capabilities, particularly on documents with complex layouts.
Dataset | Precision (%) | Recall (%) | F1 Score (%) |
---|---|---|---|
SROIE | 96.80 | 96.04 | 96.42 |
CORD | 93.91 | 94.61 | 94.26 |
Train Ticket | 99.07 | 98.76 | 98.91 |
Business License | 99.37 | 99.37 | 99.37 |
Strengths of GraphRevisedIE
- Lightweight Design GraphRevisedIE can be trained effectively in low-resource setting within a few hours.
- Dynamic Graph Learning Unlike static graphs, it refines relationships during training, adapting to varied layouts.
- Rich Multimodal Fusion Combines text, visuals, and spatial layouts, making it versatile across different document types.
Conclusion
By effectively combining textual, visual, and layout information with graph-based methods, GraphRevisedIE effectively handles the task of information extraction from visually rich documents. The model’s open-source code and datasets are available on GitHub. More details can be found in the paper published at Pattern Recognition Journal 2023.