Introduction
This project designs a newspaper segmentation algorithm, which utilizes a bottom-up technique to classify and merge distinct elements like text blocks, titles, images, and lines within newspaper images. The results indicate that the method performs well on diverse layouts, overcoming noise-related challenges.
The process of converting printed newspapers into digital formats involves identifying and categorizing regions accurately. However, this task is complex due to variations in newspaper layouts, irregular text shapes, and occasional connected noise artifacts, like human annotations or smudges. This study builds on a segmentation algorithm that systematically identifies and merges components of the layout, ultimately achieving a clear classification of elements within newspaper images.
Methodology
This segmentation algorithm consists of multiple steps, each designed to refine the classification of components within a newspaper image. Below, we outline the primary stages.
1. Locating Rectangles
Using a bottom-up approach, the algorithm identifies initial regions, which are clusters of loosely connected black pixels representing the smallest identifiable components. A rect is defined as a 3x3 region containing at least one black pixel, saving processing time by avoiding a full scan of each pixel in the image.
2. Formation and Classification of Patterns
Following the detection of rects, these are grouped into patterns, based on their proximity. A pattern consists of multiple adjacent rects, and each pattern is classified according to predefined rules based on size, shape, and pixel density. The categories include:
- Text: Defined by pixel density and area.
- Title: Characterized by larger font height and distinct spacing.
- Graphic or Drawing: Identified by irregular shapes and pixel density.
- Lines: Horizontal or vertical patterns based on alignment.
3. Line Extraction and Noise Removal
The algorithm addresses noise by generating virtual lines to separate components. This is done by defining box patterns that are isolated by their edges and then identifying individual line segments. Patterns within boxes are processed without noise, improving the accuracy of overall segmentation process.
4. Region Formation
With the classified patterns, the algorithm then groups them into regions based on spatial proximity. These blocks form larger components such as paragraphs or images. Two key distance parameters, horizontal_gap and vertical_gap, help merge patterns of the same type, ensuring that they represent contiguous content within the layout. Word vectors inside patterns are summarized and classified by a neural network to improve the accuracy of the spatial classifier.
The algorithm was tested on multiple newspaper images from the ProQuest database and showed promising results. The algorithm was implemented with Cython, which reduced the algorithm runtime by 5 to 10 times per image compared to the version implemented in Python. More experiment results can be found in the poster in the end of the blog.
Discussion and Challenges
While the algorithm effectively segments and classifies newspaper components, certain challenges remain. For example, low-resolution images can result in missing black pixels, which affects rect and pattern accuracy. Additionally, parameter tuning, such as adjusting thresholds for titles, graphics, and line widths, is essential to accommodate varying newspaper designs.
Misclassifications can still occur, where small images or noise are sometimes classified as text, or title blocks are divided due to large inter-title spacing. Refining these heuristics remains a primary focus for future iterations.
Conclusion
This project presents an approach to newspaper segmentation through a bottom-up algorithm. The method performs well in differentiating and merging text, titles, and graphics, despite some challenges posed by noise and document variability. This algorithm was presented in the company's internal conference with the below poster.