Introduction
With the explosion of digital documents in business, finance, and beyond, extracting key information from scanned files has become a vital task. Traditional Key Information Extraction (KIE) models often depend on highly manual, token-level annotation and are sensitive to OCR errors, making them prone to inaccuracies. To address these issues, we developed GenKIE — a generative, multimodal KIE model that is lightweight and can be trained in low resource setting. GenKIE’s unique generative approach automatically corrects OCR errors while minimizing the need for complex annotations. Its versatile prompting mechanism enables it to handle diverse document types, streamlining information extraction with robustness and flexibility.
Motivation
Most KIE solutions today adopt a discriminative approach, classifying entities within text and layout-based tags, but they encounter two major issues:
- OCR Error Sensitivity: Errors in OCR recognition distort extracted entities like names and addresses, making the extracted data unreliable.
- Annotation Complexity: Token-level labeling in traditional models requires time-intensive, manual effort, particularly when working with complex layouts.
GenKIE, however, is built around a generative sequence-to-sequence architecture that integrates textual, visual, and layout data into a unified representation. By using prompt-based entity generation, GenKIE not only reduces annotation needs but also enables real-time error correction during the generation process.
Key Features of GenKIE
- Multimodal Generative Architecture: At its core, GenKIE uses an encoder-decoder structure based on the OFA (One-For-All) model, blending text, layout, and visual features into a single framework. This setup allows GenKIE to generalize across varied document layouts, capturing key entities with high accuracy.
- Prompt-based Information Generation: Instead of tagging every element, GenKIE utilizes prompts (e.g., "Company is?" or "Address is?") to generate the necessary information. This prompt system reduces the need for token-level tagging, saving significant annotation time and adapting to different document types and layouts with ease.
- OCR Error Resilience: GenKIE’s architecture enables it to correct OCR errors dynamically, ensuring that extracted data remains accurate even when OCR misreads parts of the text.
GenKIE’s capabilities were evaluated against several top-performing KIE models across three datasets: SROIE, CORD, and FUNSD.
Model | Modality | SROIE F1 Score | CORD F1 Score | FUNSD F1 Score |
---|---|---|---|---|
LayoutLMv3 | Text+Layout+Visual | 95.30 | 96.56 | 90.29 |
DocFormer | Text+Layout+Visual | - | 96.33 | 83.34 |
GenKIE | Text+Layout+Visual | 97.40 | 95.75 | 83.45 |
In the table above, GenKIE’s performance is comparable to traditional SOTA discriminative models.
Ablation Study
To understand which aspects of GenKIE most contribute to its effectiveness, we conducted a detailed ablation study, testing various model configurations with and without specific features.
1. Multimodality Effectiveness
GenKIE’s performance was analyzed using different feature combinations.
- Text Only: The model’s F1 score on SROIE was lowest when only text features were used.
- Text + Layout: Adding layout features significantly improved accuracy, highlighting the importance of spatial information.
- Text + Visual + Layout: Combining all three modalities yielded the best results, with an F1 score of 97.40, emphasizing the value of visual context in addition to text and layout.
Modality | Precision | Recall | F1 Score |
---|---|---|---|
Text Only | 96.24 | 96.24 | 96.24 |
Text + Visual | 97.20 | 96.82 | 97.01 |
Text + Layout | 96.89 | 97.39 | 97.14 |
Text + Layout + Visual | 97.40 | 97.40 | 97.40 |
This ablation, shown in the above table, demonstrates the model’s high performance when incorporating all modalities, validating GenKIE’s design choice to leverage a multimodal approach.
2. Prompt Effectiveness
We also explored different prompt types to identify the most effective format.
- Template Prompts (e.g., "Company is [value]") provided the most accurate results for entity extraction tasks, particularly in cases where all entity types were included in a single prompt.
- Question Prompts (e.g., "What is the address?") proved more effective in dynamic scenarios requiring entity generation, like document labeling tasks.
3. Prefix Beam Search Efficiency
Adding prefix constraints to beam search during decoding significantly improved efficiency and accuracy, especially for the FUNSD dataset. This feature enabled the model to produce accurate outputs faster by constraining search space based on prompt format.
GenKIE’s resilience to OCR noise was tested under varying levels of artificial OCR errors, demonstrating its robustness. Even with error rates as high as 50%, GenKIE maintained an impressive F1 score, outperforming other models like LayoutLMv2, whose performance dropped sharply under similar conditions. More details about this experiment can be found in the original paper published at EMNLP 2023.
Conclusion
GenKIE is setting a new standard for document information extraction by blending generative multimodal learning with innovative prompt-based design. Its resilience to OCR errors and efficient prompting framework make it a powerful tool for real-world applications across diverse document types. With GenKIE, organizations can achieve accurate, high-speed document processing, minimizing the manual effort traditionally needed for key information extraction.