Tile Extractor

Parsing the Unparsable: Building a Layout-Aware Computer Vision Pipeline for 50,000+ Stone SKUs

Executive Summary

The stone and marble industry operates on visual catalogs. Manufacturers publish hundreds of pages of PDF catalogs showing marble slabs, tile patterns, texture variations, and dimension tables. For digital inventory platforms and wholesalers, extracting these products to populate databases is a massive bottleneck.

Standard OCR (Optical Character Recognition) tools fail immediately because these catalogs are highly visual, containing complex grid structures where product images are loosely aligned with text descriptions, dimensions, and SKU codes. Ananta Labs was hired to design a layout-aware computer vision and text parsing pipeline that could ingest multi-page catalogs, segment individual product tiles, extract their corresponding text details, and output clean, database-ready JSON arrays. The target was 95%+ accuracy over a database of 50,000+ unique marble and stone SKUs.

The Architecture: Segmentation-First Parsing

Traditional text extraction tools parse documents top-to-bottom, left-to-right. In a product catalog, this approach merges the text of Slab A with the dimensions of Slab B.

To prevent data mismatch, we implemented a segmentation-first approach. Instead of reading the document as text, we treat each catalog page as an image canvas, locate the individual physical grid cells (tiles), isolate them, and then run OCR within the boundaries of each isolated cell.

Project Metrics & Impact

Throughput: Processing a standard 100-page catalog (containing roughly 1,200 product variations) took less than 180 seconds.
Accuracy: Out of 50,000+ processed stone tiles, our layout segmentation maintained an extraction accuracy of 96.4%.
Human Verification: Reduced manual data entry time by 94%, shifting the operator's role from manual transcription to simply reviewing a clean, visual admin UI validation screen.

Step 1: Document Rasterization and Pre-processing

We use PyMuPDF to rasterize incoming PDF pages into high-resolution PNG images (300 DPI) to ensure fine print text is highly legible. The document is converted page-by-page, and zoomed in to optimize the text characters before OCR processing occurs.

Step 2: Contour Detection & Grid Cell Isolation

Catalog pages usually group slab images and SKU data inside visual grid cells or boxes. We use computer vision (OpenCV) to detect these bounding boxes:

Binarization: Convert the page image to grayscale and apply adaptive thresholding to isolate boundaries.
Morphological Operations: Apply vertical and horizontal kernels to detect solid horizontal and vertical grid lines, creating a clean binary mask of the catalog layout.
Contour Extraction: Find contours on the grid mask and filter out shapes that are too small (noise) or too large (page borders).

Step 3: Isolated OCR and Data Normalization

Once we have the coordinates (x, y, w, h) of each tile cell, we crop the image of the stone slab from the top half of the cell, crop the text area from the bottom half, and run OCR exclusively on the cropped text area.

By running OCR on a tiny, isolated box rather than the whole page, we guarantee that the extracted SKU, finish (polished/honed), and size parameters belong only to the stone slab image cropped from the same box.

Key Engineering Challenges Solved

1. The Borderless Grid Problem

Some catalogs do not have visible grid lines; they display product images floating on a white page with text underneath. When morphological grid detection returns zero cells, the pipeline switches to a clustering-based layout analyzer. We use projection profiles (scanning rows and columns for white-space gaps) to programmatically compute virtual grid lanes, establishing bounding coordinate zones dynamically.

2. Text-to-Data Normalization

OCR outputs raw string data like "Volacas Wt (Pol) 60x120cm - SKU9087". We run the OCR output through a regex parser and a light local dictionary matching layer. The parser strips punctuation, standardizes measurements (600x1200mm, 60x120 to standard metric floats), and categorizes stone colors and finishes into database-ready enumerations (Material: Marble, Color: White, Finish: Polished).

Conclusion

Parsing highly visual document layouts requires moving beyond raw character recognition. By merging traditional computer vision techniques (contour detection, morphological thresholding) with targeted localized OCR, Tile Extractor transformed chaotic catalogs into clean, standardized commercial APIs. Building systems that bridge the gap between unstructured visual media and structured databases is at the core of what we do at Ananta Labs.

推荐订阅源

DEV Community