One API Call Changed Everything

Sunday morning. I'm about ready to type everything by hand and call it a weekend.

But I want to try one more thing. Instead of OCR to extract characters and then code to figure out what those characters mean - what if I just send the image to a vision model and ask what the document says?

The Code

import openai
import base64
import json
from pathlib import Path

def extract_document_vision(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """Extract handwritten documents into structured 
                JSON. Identify:
                - title: the document title or heading
                - items: array of {quantity, unit, item} for any 
                  listed items with measurements
                - instructions: array of step strings for any 
                  procedural content
                - notes: any additional annotations or side notes

                If something is crossed out, ignore it.
                If you can't read something clearly, make your 
                best interpretation and add a "uncertain": true 
                flag to that field."""
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_data}"
                        }
                    }
                ]
            }
        ],
        response_format={"type": "json_object"}
    )

    message = response.choices[0].message
    if message.content is None:
        raise ValueError(f"No content returned. Finish reason: {response.choices[0].finish_reason}. Refusal: {message.refusal}")
    return json.loads(message.content)

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        try:
            result = extract_document_vision(image_file)
            print(json.dumps(result, indent=2))
        except ValueError as e:
            print(f"Skipped: {e}")

No pre-processing. No regex. No parser. One API call and a prompt in English.

Compare that to yesterday.

What Came Back

The document that Textract turned into "2 1/4 c fleur, 1 tso bokrig sado":

{
  "title": "Chocolate Chip Cookies",
  "items": [
    {"quantity": "2 1/4", "unit": "cups", "item": "flour"},
    {"quantity": "1", "unit": "tsp", "item": "baking soda"},
    {"quantity": "1", "unit": "tsp", "item": "salt"},
    {"quantity": "1", "unit": "cup", "item": "butter", "notes": "2 sticks"}
  ],
  "instructions": [
    "Preheat oven to 375°F",
    "Combine flour, baking soda and salt in small bowl",
    "Beat butter, granulated sugar, brown sugar and vanilla extract in large mixer bowl until creamy",
    "Add eggs, beating well",
    "Gradually beat in flour mixture",
    "Stir in chocolate chips",
    "Drop rounded tablespoon of dough onto ungreased baking sheets",
    "Bake for 9 to 11 minutes or until golden brown"
  ],
  "notes": null
}

First try. "c" became "cups." "tsp" stayed "tsp" because that's already standard. It caught "(2 sticks)" as a note on the butter and put it in the right field.

Why

OCR asks "what characters are in this image?" Hard problem when the characters are messy handwriting.

The vision model asks "what does this document say?" Sounds like the same question. It's not.

Think about how you read someone's handwriting. You don't decode each letter and build words from shapes. You look at the whole thing and between context and layout and your knowledge of language, you just know. Even when individual letters are a mess.

That's what's happening here. The model isn't a better letter-recognizer. It's skipping that problem entirely.

The Stuff That Broke OCR

The crossed-out line that killed my parser? Vision model saw the strikethrough, ignored it, read the correction. No code for that. Just worked.

Marginal notes Textract mixed into the main text? Identified as supplementary. Put in the "notes" field.

Abbreviations Tesseract turned into garbage? Interpreted from context.

The layout I spent 200 lines of regex on? Figured out on its own. Titles in "title." Items in "items." Steps in "instructions."

Three Approaches, Same Document

Tesseract:

Chocohite Ch p Cookes
2 114 cps flcar
1 tso bokrg sado
l tsp sit
1 c (2 stcks) btter

Textract (after all the pre-processing and parsing):

Title: Chocokite Chtp Cookes (confidence: 0.67)
Items:
  - 2 1/4 c fleur
  - 1 tso bokrig sado  
  - l tsp slt
  - 1 c (2 stcks) btter
[MANUAL REVIEW REQUIRED - 4 items below confidence threshold]

Vision API:

{
  "title": "Chocolate Chip Cookies",
  "items": [
    {"quantity": "2 1/4", "unit": "cups", "item": "flour"},
    {"quantity": "1", "unit": "tsp", "item": "baking soda"},
    {"quantity": "1", "unit": "tsp", "item": "salt"},
    {"quantity": "1", "unit": "cup", "item": "butter", "notes": "2 sticks"}
  ]
}

Metric	Tesseract	Textract	Vision API
Character accuracy	30-40%	40-60%	95%+
Structure accuracy	N/A	~30%	~90%
Manual review needed	~90%	~70%	~5-10%
Pre-processing	Yes	Yes (6 params)	None
Lines of code	~50	300+	~30
Dev time	~4 hours	~40 hours	~2 hours

The vision model's mistakes are small. A "3" that might be an "8." An abbreviation it flagged as uncertain. Stuff you catch in seconds. Not garbled output you have to retype.

By Sunday afternoon, everything is processed. The thing I spent all of Saturday failing to do took a couple of hours once I changed the approach.

推荐订阅源

DEV Community

The Code

What Came Back

Why

The Stuff That Broke OCR

Three Approaches, Same Document