惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
Lohrmann on Cybersecurity
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Recorded Future
Recorded Future
S
Schneier on Security
I
Intezer
Latest news
Latest news
N
News and Events Feed by Topic
Scott Helme
Scott Helme
T
Threat Research - Cisco Blogs
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
U
Unit 42
量子位
博客园 - 【当耐特】
S
Security @ Cisco Blogs
Google Online Security Blog
Google Online Security Blog
博客园 - 叶小钗
酷 壳 – CoolShell
酷 壳 – CoolShell
NISL@THU
NISL@THU
The Cloudflare Blog
李成银的技术随笔
T
ThreatConnect
L
LINUX DO - 最新话题
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
有赞技术团队
有赞技术团队
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Jina AI
Jina AI
T
Tor Project blog
The Hacker News
The Hacker News
人人都是产品经理
人人都是产品经理
小众软件
小众软件
S
Security Archives - TechRepublic
美团技术团队
博客园 - Franky
Security Latest
Security Latest
J
Java Code Geeks
P
Proofpoint News Feed
V
V2EX
The GitHub Blog
The GitHub Blog
WordPress大学
WordPress大学
Application and Cybersecurity Blog
Application and Cybersecurity Blog
H
Help Net Security
PCI Perspectives
PCI Perspectives
Cyberwarzone
Cyberwarzone
Hugging Face - Blog
Hugging Face - Blog
N
Netflix TechBlog - Medium
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
SecWiki News
SecWiki News
腾讯CDC
爱范儿
爱范儿
D
Docker

DEV Community

Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account The Internet Career Nobody Talks About Enough: What Is DevRel? Solar Panel Wiring Diagram: Series vs Parallel Hello everyone! Glad to join the dev.to community I Built an AI Agent That Tailors My Resume - Here's How Agents Actually Work I Built a WhatsApp OTP + AI Chatbot Platform for African Businesses MTP Explained — And Why It Matters for Android on Mac Most Beginners Learn Full-Stack Development Backwards GitHub Glow-Up: Open Source, READMEs, Badges, Streaks, Git and gh CLI System Design Cheat Sheet: Concepts Every Developer Should Know Are Junior Developer Roles Actually Dying? A Fresher's Honest Take Using DigitalOcean Droplets as Ephemeral Sandboxes for AI Agents I built a VSCode extension that visualises your code navigation as a call tree — made for legacy codebase pain Vite predev/prebuild: chaining scripts without losing your mind A website to save you from messy browser tabs Dear Web2 Developer... Solana is here calling Postgres JSONB indexes: GIN vs BTREE on the same column The $5 AI That Remembers Everything What are your goals for the week? #180 Zettelkasten for Developers: A Practical Method That Works OpenClaw vs Hermes Agent: Stars, Downloads & Usage 2026 `act` vs. `waitFor` Global Teams Don’t Struggle With Time Zones. They Struggle With Context Python as a JavaScript Dev $5.4 Billion in Damage. 8.5 Million Machines Down. Three YAML Controls Would Have Prevented It. Here's the Structural Analysis. 🚫 Stop Using PN532 V1 for Your NFC Projects (Real Debugging Experience) Probabilistic Graph Neural Inference for smart agriculture microgrid orchestration for extreme data sparsity scenarios Inference Is Becoming the New Steady-State Cost Center Why AI-Generated Code Is Always Good Enough — And Never Great I built a dark admin dashboard template in HTML — no React, no npm, just pure HTML What is the Difference Between Lattice-Based and Hash-Based Signatures? Next.js App Router caching: revalidate, dynamic, and no-store without the folklore Next.js App Router caching: revalidate, dynamic y no-store sin folklore I built Stashly — a full-stack content manager with a rich text editor published: false tags: react, node, mongodb, typescript Why I Started Building React Projects Instead of Just Watching Tutorials ? Every Tool Eventually Becomes Tuesday Nobody Warns You That Real Software Engineering Feels Chaotic Tích hợp VNPay, Stripe trong Odoo 19 BeautifulSoup and Requests for Web Scraping With Python: When Simple Still Works I Was Stuck Debugging React — Then Developer Tools Changed It Buck Converter Ripple: Sizing the Inductor and Capacitor With Confidence AWS Just Made Its MCP Server Generally Available. Here's What It Actually Gives AI Agents. RAMPART Tests Your AI Agents in Dev. What Catches Malicious Tool Calls in Production? Vibe Team Software Engineering: What a Real AI Human Dev Team Workflow Actually Looks Like An npm Package for AI Agent Orchestration Just Shipped With Its Front Door Unlocked. Here's What the CVE Actually Reveals. Microsoft Foundry Just Added CI/CD for AI Agents. Here's What That Actually Changes. The Best Career Insurance Is a Tech Event You Don't Want to Attend Your GitHub Profile Already Tells Recruiters More Than Your Resume. Most Devs Just Don't Surface It. How to Add Execution Budgets to OpenAI Agents SDK Binary Tree Interview Problems: 6 Traversal Patterns, 15 Problems We trained a personal voice DoRA on Qwen3-8B for $1.50 — beat stock model 100% in blind A/B Stop Leaking API Keys: Why I Built a Local-First Vault for Developers 🔐 RAG Explained: How Retrieval-Augmented Generation Actually Works I Built a Fast Async JioSaavn API Wrapper in Python 🎧 chown & chgrp Deploying Your First App on Kubernetes: A Beginner's Guide (Minikube & Kind) Logs in code It's called a PR "review" for a reason DePIN GPU Market: The Failed Job Receipt Developers Should Demand Why Your AI Agent Monitoring is Wrong (And How to Fix It) Lock Down Your Cloud Shares: A Beginner’s Guide to Azure Files Security. Building a Multi-Channel Content Syndication Pipeline with EmDash Plugins Turn Your Phone Into Voice Input for Any React Text Field Which package is bloating your Docker image? Putting Claude Code Under Version Control: Configs Since July, Memory Since April What I Thought DevRel Was vs. What It Actually Is (A Mentee's Honest Take) What I Thought DevRel Was vs. What It Actually Is (A Mentee's Honest Take) 400 Million Tokens Burned Overnight Reviving My Linux Mastery Game from a Merge Conflict — A Finish-Up-A-Thon Comeback Don’t let AI break your collective thinking: a practical guide for engineering teams First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp Per-Turn Evaluation: Dynamic Governance for AI Agents The AI Triforce of seed4j: Power, Wisdom, and Courage for Your Dev Agent Your AI agent reports 80% task completion. It fabricated it. Pourquoi les overlays d'accessibilité ne tiennent pas leurs promesses (et ce que la FTC vient d'acter) AI May Break Product-Market Fit in Enterprise Software I’m Building Around the Gap Between AI Output and Repo Truth How to Build a Stripe Customer Portal in Next.js SaaS On-Demand Pricing Feels Safe - Until You See the Bill Building an Internal Developer Portal with Backstage A Production Deployment Guide After the Last Song Sudoers Configuration in Linux Terraform + Terragrunt + Ansible: A Hands-On Learning Journey Switching Users in Linux (su, sudo) AI 智能体的鲁莽速度 Quick Win Card #01 — Ton backlog.md t'a menti (la cure en 30 secondes) Quick Win Card #01 — Your backlog.md lied to you (a 30-second cure) How to Manage an IT Team: Structure, Scaling, and Daily Workflows That Work Speccing Is the New Coding CAC 250만 원을 뚫기 위해 퍼널 세 곳을 뜯어고친 3개월 Creating My First Token on Solana Devnet as a Web2 Developer Five Salesforce Reports Every Nonprofit Leadership Team Should Have Beyond the West: What Eastern AI Models Mean for Enterprises, Developers, and Digital Sovereignty Class and Pseudo Class
One API Call Changed Everything
Andrew Judd · 2026-05-25 · via DEV Community

Sunday morning. I'm about ready to type everything by hand and call it a weekend.

But I want to try one more thing. Instead of OCR to extract characters and then code to figure out what those characters mean - what if I just send the image to a vision model and ask what the document says?

The Code

import openai
import base64
import json
from pathlib import Path

def extract_document_vision(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """Extract handwritten documents into structured 
                JSON. Identify:
                - title: the document title or heading
                - items: array of {quantity, unit, item} for any 
                  listed items with measurements
                - instructions: array of step strings for any 
                  procedural content
                - notes: any additional annotations or side notes

                If something is crossed out, ignore it.
                If you can't read something clearly, make your 
                best interpretation and add a "uncertain": true 
                flag to that field."""
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_data}"
                        }
                    }
                ]
            }
        ],
        response_format={"type": "json_object"}
    )

    message = response.choices[0].message
    if message.content is None:
        raise ValueError(f"No content returned. Finish reason: {response.choices[0].finish_reason}. Refusal: {message.refusal}")
    return json.loads(message.content)

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        try:
            result = extract_document_vision(image_file)
            print(json.dumps(result, indent=2))
        except ValueError as e:
            print(f"Skipped: {e}")

Enter fullscreen mode Exit fullscreen mode

No pre-processing. No regex. No parser. One API call and a prompt in English.

Compare that to yesterday.

What Came Back

The document that Textract turned into "2 1/4 c fleur, 1 tso bokrig sado":

{
  "title": "Chocolate Chip Cookies",
  "items": [
    {"quantity": "2 1/4", "unit": "cups", "item": "flour"},
    {"quantity": "1", "unit": "tsp", "item": "baking soda"},
    {"quantity": "1", "unit": "tsp", "item": "salt"},
    {"quantity": "1", "unit": "cup", "item": "butter", "notes": "2 sticks"}
  ],
  "instructions": [
    "Preheat oven to 375°F",
    "Combine flour, baking soda and salt in small bowl",
    "Beat butter, granulated sugar, brown sugar and vanilla extract in large mixer bowl until creamy",
    "Add eggs, beating well",
    "Gradually beat in flour mixture",
    "Stir in chocolate chips",
    "Drop rounded tablespoon of dough onto ungreased baking sheets",
    "Bake for 9 to 11 minutes or until golden brown"
  ],
  "notes": null
}

Enter fullscreen mode Exit fullscreen mode

First try. "c" became "cups." "tsp" stayed "tsp" because that's already standard. It caught "(2 sticks)" as a note on the butter and put it in the right field.

Why

OCR asks "what characters are in this image?" Hard problem when the characters are messy handwriting.

The vision model asks "what does this document say?" Sounds like the same question. It's not.

Think about how you read someone's handwriting. You don't decode each letter and build words from shapes. You look at the whole thing and between context and layout and your knowledge of language, you just know. Even when individual letters are a mess.

That's what's happening here. The model isn't a better letter-recognizer. It's skipping that problem entirely.

The Stuff That Broke OCR

The crossed-out line that killed my parser? Vision model saw the strikethrough, ignored it, read the correction. No code for that. Just worked.

Marginal notes Textract mixed into the main text? Identified as supplementary. Put in the "notes" field.

Abbreviations Tesseract turned into garbage? Interpreted from context.

The layout I spent 200 lines of regex on? Figured out on its own. Titles in "title." Items in "items." Steps in "instructions."

Three Approaches, Same Document

Tesseract:

Chocohite Ch p Cookes
2 114 cps flcar
1 tso bokrg sado
l tsp sit
1 c (2 stcks) btter

Enter fullscreen mode Exit fullscreen mode

Textract (after all the pre-processing and parsing):

Title: Chocokite Chtp Cookes (confidence: 0.67)
Items:
  - 2 1/4 c fleur
  - 1 tso bokrig sado  
  - l tsp slt
  - 1 c (2 stcks) btter
[MANUAL REVIEW REQUIRED - 4 items below confidence threshold]

Enter fullscreen mode Exit fullscreen mode

Vision API:

{
  "title": "Chocolate Chip Cookies",
  "items": [
    {"quantity": "2 1/4", "unit": "cups", "item": "flour"},
    {"quantity": "1", "unit": "tsp", "item": "baking soda"},
    {"quantity": "1", "unit": "tsp", "item": "salt"},
    {"quantity": "1", "unit": "cup", "item": "butter", "notes": "2 sticks"}
  ]
}

Enter fullscreen mode Exit fullscreen mode

Metric Tesseract Textract Vision API
Character accuracy 30-40% 40-60% 95%+
Structure accuracy N/A ~30% ~90%
Manual review needed ~90% ~70% ~5-10%
Pre-processing Yes Yes (6 params) None
Lines of code ~50 300+ ~30
Dev time ~4 hours ~40 hours ~2 hours

The vision model's mistakes are small. A "3" that might be an "8." An abbreviation it flagged as uncertain. Stuff you catch in seconds. Not garbled output you have to retype.

By Sunday afternoon, everything is processed. The thing I spent all of Saturday failing to do took a couple of hours once I changed the approach.