When the Cleanup Code Becomes the Project

Tesseract can't do handwriting. Time to spend money.

AWS Textract. Cloud service, built-in handwriting detection, pay per page. If I'm paying for it, the output should at least be usable.

Textract

import boto3
from pathlib import Path

def extract_document(image_path):
    client = boto3.client('textract')

    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    response = client.detect_document_text(
        Document={'Bytes': image_bytes}
    )

    lines = []
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            lines.append({
                'text': block['Text'],
                'confidence': block['Confidence']
            })

    return lines

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        for line in extract_document(image_file):
            print(f"  [{line['confidence']:5.1f}%] {line['text']}")

Confidence scores are a nice touch. Accuracy is better - maybe 40-60% on a good document.

But "better" isn't "good enough." "2 1/4 cups flour" comes back as "2 1/4 c fleur." "1 tsp baking soda" becomes "1 tso bokrig sado."

The Real Problem

Even when it gets the words right, Textract doesn't know what any of it means. Flat text. Lines in reading order. My documents have titles, ingredient lists, instruction paragraphs, notes in margins. Textract sees none of that. Just characters on a page.

So now I'm writing parsers.

import boto3
import json
import re
from pathlib import Path

def extract_document(image_path):
    client = boto3.client('textract')

    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    response = client.detect_document_text(
        Document={'Bytes': image_bytes}
    )

    lines = []
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            lines.append({
                'text': block['Text'],
                'confidence': block['Confidence']
            })

    return lines

def parse_structured_data(raw_lines):
    title = None
    items = []
    instructions = []

    quantity_pattern = r'^(\d+[\s/]*\d*)\s*(cups?|tbsp?|tsp|oz|lbs?|g|ml|c)\s+(.+)'

    for line in raw_lines:
        text = line['text'].strip()
        match = re.match(quantity_pattern, text, re.IGNORECASE)

        if match:
            items.append({
                'quantity': match.group(1),
                'unit': match.group(2),
                'item': match.group(3)
            })
        elif not title:
            title = text
        else:
            instructions.append(text)

    return {'title': title, 'items': items, 'instructions': instructions}

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        raw_lines = extract_document(image_file)
        result = parse_structured_data(raw_lines)
        print(json.dumps(result, indent=2))

Works on 30% of the documents. The other 70% break at least one assumption. Title not on line one. Quantities written backwards. Abbreviations I've never seen. Crossed-out text mixed into the content. Multi-line entries split apart.

Every new document, a new edge case. Every new edge case, another if, another regex.

Saturday Night

Here's where I'm at:

Pre-processing with 6 configurable parameters
200+ lines of regex and heuristics
70% of documents still need a human
Accuracy I'm being generous calling 30%

The parser is now more work than just typing things by hand.

And every time I fix one document's output, three others break. The heuristics are fragile. Interconnected. Basically untestable because no two documents look alike.

One document has a crossed-out line. Original text scratched out, correction written above. Any person glances at it and reads the correction. Half a second.

Textract returns both lines. Jumbled. My parser doesn't know what a strikethrough is. Teaching it would mean analyzing the spatial layout of ink strokes. That's not a text problem anymore. That's a computer vision problem.

I'm a full day in. The system I'm building reads worse than I do, and the code to make it slightly less bad is growing faster than the documents it's supposed to process.

推荐订阅源

DEV Community

Textract

The Real Problem

Saturday Night