惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

DEV Community

🔮 Hermes Agent 🤖: A Practical Guide 🔥 — and How It Stacks Up Against OpenClaw & GoClaw 📊 CSS @function Agent Payment Stablecoin Fallbacks: Do Not Retry the Changed Quote Daily-summary-agent Opus 4.8 barely moved the leaderboard. It moved the one number that decides if your agents can be trusted. I Built an AI Interview Coach That Turns Any Resume Into a Personalized Prep Package — No API Keys Needed The best Claude Code agents are defined by what they refuse to do I Built a Tiny Skeleton Loader for React SPIFFE Compliance Deep Dive PostgreSQL 08007 오류 원인과 해결 방법 완벽 가이드 I Was Tired of Writing Daily Standups, So I Built an AI Agent using claude code I got tired of LLM observability tools getting acquired. So I built one that can't be. Oracle ORA-00072 오류 원인과 해결 방법 완벽 가이드 Multi-Agent Negotiation Protocols: How AI Agents Should Bargain for Resources uBlock Origin No Longer Works on Chrome - Here Are the Best Alternatives in 2026 SSH Agent Forwarding vs ProxyJump: Why Agent Forwarding Is Dangerous and What to Use Instead The Best Technology Disappears I Built a Production-Oriented Multi-Provider AI Chatbot in Rust — Here's How Markov Chain Coin Sequence: E[HH] vs E[HTH] Explained LLM Deal Flow Automation in CRM The Do-Over Game: Nash Equilibrium at the Golden Ratio Cash Flow Waterfall Model for LBO Automated Client Reporting The Monty Hall Problem: Why Switching Wins 2/3 of the Time Chat With Your Database Using Natural Language: The Future of Business Analytics Google Apps Script Automation Amoeba Extinction Probability: The Branching Process Solution RAG Architecture Deep Dive Real-Time KPI Dashboards OpenAI Agents SDK的5个隐藏用法 🔥 Algorithmic Trading Pipelines 131 tokens per second on GPU under Kubernetes one of the best blogs about hermes agent Nous Research Hermes Agent: Setup and Tutorial Guide Day 20 - AWS Lambda Spending Hours Designing the UI? Or Just Telling AI the Pain Story Karpenter on AKS in 2026: What Actually Works I built a Chrome extension that shows your ChatGPT token usage in real-time Day 1 Field Report — Barriers to an Autonomous Agent Earning Money Online Mastering Background Processing in Rails 8: Sidekiq & Redis Optimization I shipped three fixes to my product in seven days. All three came from readers. Claude Code Model Switching: The Verification Notes That Could Save You $200/Month Three agent-memory threads this week, one missing field The Way to Break Through: Why Others Sail Through While You Struggle Simple Snap Layout Overlay for Tauri v2 CSS Animation vs Lottie: Which Should You Use in 2025? How to Add Lottie Animations to Vue.js (2025 Guide) Building BayouOps Suite Pro — Lightweight Operational Readiness & Visibility for IT Teams Detecting Adversary-in-the-Middle (T1557) with Data Science HTTP Headers Every Developer Should Know (2026) Detecting Ingress Tool Transfer (T1105) with Python Linux Command Line: The 25 Commands I Use Every Day (2026) Starting My Cybersecurity Learning Journey 🚀 CSS in 2026: Modern Techniques You Might Not Know (2026) TypeScript Deep Dive: Advanced Types and Patterns (2026) Three SQL Injection Patterns That Still Ship in Node.js — And the ESLint Rule That Catches Them From Idea to Production: How I Built a Decoupled Chatbot Ordering Engine I Spent 8 Months Building a Framer Killer as a Solo Undergrad. Here's What Happened. unknown 5 Git Commands I Wish I Knew 5 Years Ago How to Find users who don't follow you back in Github Bulk-check DNS, SSL and email auth for a whole list of domains (no scraping) Monolithic vs Microservices Architecture: Which One Should You Choose? The Full-Stack Developer's 2026 Playbook: 7 Shifts That Separate Senior Engineers from the Rest MCP Tool Budget for AI SaaS: Stop Agents From Burning Tokens, Tools, and Trust Untrusted Code, Trusted Cluster Scaling Secure AI Agent Workspaces with GKE Agent Sandbox Learning, Experimenting - Concurrency in Go Building Dhrishti Part 2: Go-Lang Quirks Announcing My New Book: Web Automation with Playwright and Python using AI and MCP Why MTP Batch Transfers Slow Down Between Files How We Cut Our AI Coding Bill by 65% Without Sacrificing Quality Claude vs Gemini Across 4 Security Domains: A Dead Heat — and the Hardening 63% of AI Code Skips I Benchmarked 4 Lightweight Transformers for Fault Detection. Here's What Survived. 🗡️ Tsundoku Slayer: An Agent That Decides What Not To Read Animated Icons for Web Apps — The Complete 2025 Guide How to Use Lottie Animations in React (2025 Guide) Azure API Management - Deploy gRPC API on Azure API management using self hosted gateway I Built pretext-pdf: Serverless PDFs Without Chromium Lottie JSON vs .lottie Format — What's the Difference and Which Should You Use? SVG Icon Systems in 2025 — Everything You Need to Know My Trading Bot Tried to Execute the Same Trade Twice. That Became SafeAgent. Free Loading Animations for Web Apps — Lottie, GIF, and SVG Spinners (2025) How to Add Lottie Animations to Your Website (Free JSON Files Included) Idempotency Keys: The One API Pattern That Prevents Duplicate Payments (and Worse) CONFIGURING SEMANTIC MODEL IN POWER BI Surviving Global Vendor Outages: Federated Cellular Architecture with EKS, AKS, and Istio I Turned My Cursor + Claude Code Setup Into 12 Reusable Files I Built a Cognitive Threat Hunter on Hermes Agent — It Analyzed the Session Where I Built It and Found Three Blind Spots Making AI-Generated Code Fail Gracefully How to Convert Lottie JSON to GIF (Free, Browser-Based, No Signup) Observability 2.0: Tracing AI "Thought Chains" with OpenTelemetry Best Free Lottie Animation Tools in 2025 (No Signup, No Paywall) What Is a Function in Scala Three ways to gate an MCP server: OAuth, L402, and proof-of-work You don't know kubectl — you know how to Google kubectl. The first-principles fix. Building a DevOps Incident Investigator with Coral SQL — From 15 Minutes to 15 Seconds When the Default Postgres Pool Died at 3 AM What Is Database Sharding — and When Does Your Startup Actually Need It Anti Refusal LLM Service A repeatable workflow for paper figures so you stop redrawing them every revision
Why I Generated Synthetic Patients to Make Identity Matching Better
Saiteja Jonnalagadda · 2026-05-31 · via DEV Community

If you have ever worked on matching or deduplicating people records at scale, you already know the quiet truth of the job. The data is imperfect, and the most challenging, most useful examples are exactly the ones you are not allowed to touch.

I work on patient identity at enterprise scale, and this problem sits at the center of my day. I want to walk through why it is so hard, along with a counterintuitive approach I described in a recent paper. The idea is to generate synthetic patients on purpose, flaws and all, to make identity-matching systems better.

What patient identity matching actually is

In a large health system, the same person rarely exists as one clean record. The same individual appears across pharmacy, clinical, claims, and digital systems, each with its own spelling, formatting, and history. Patient identity matching, also called record linkage or entity resolution, is the work of deciding which of these records belong to the same person, assigning a single trusted identity, and merging duplicates without ever merging two different people.

That last point is the hard constraint. A missed match fragments a person's medical history. A wrong match merges two people's histories, which is far more dangerous.

Why getting it right matters

This is not a back-office cleanup chore. Accurate identity resolution supports:

  • Patient safety. Duplicate or fragmented records cause medication errors and missed allergies.
  • Care coordination. Pharmacists, clinicians, and care teams need one complete view at the point of care.
  • Regulatory compliance. Accurate identity is foundational to meeting healthcare data requirements.
  • Operational cost. Reconciling mismatched records by hand is slow and expensive.
  • Trustworthy analytics. Every downstream report and model inherits the quality of the identity layer.

The data problem

To build and stress-test a matching system, you want a large supply of realistic, imperfect examples, such as near-duplicates, typos, and transposed digits. The most realistic data, however, is actual patient records, and that is exactly what you cannot freely copy, share, or experiment on, because of privacy law and fragmentation across systems. As a result, teams end up testing on data far cleaner than reality, and then production surprises them.

The counterintuitive idea, making the data imperfect on purpose

The instinct with synthetic data is to make it clean. For this problem, clean synthetic data is almost useless. A model that only ever sees tidy records learns nothing about the hard cases, and the hard cases are the entire point.

So instead of flawless synthetic patients, I generated imperfect ones, on purpose. The records carry the same human errors that break matching in the real world, including phonetic misspellings, transposed and reversed digits, and the clerical noise that accumulates whenever people enter other people's information under pressure.

The flaws are not a side effect. The flaws are the test.

How I built it

The approach combines a few well-understood pieces:

  • Synthea generates a realistic synthetic patient population, with demographics and longitudinal medical histories, that mirrors the statistical shape of real clinical data without containing any real person's information.
  • Generative models, both Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), produce high-fidelity synthetic records that reflect the characteristics of real-world datasets.
  • On top of that, I added deliberate clerical errors, such as phonetic misspellings and reversed digits, to recreate the failure modes that matter for entity resolution.

The stack was intentionally standard and reproducible. I used TensorFlow for the model architecture, the RecordLinkage toolkit for matching, and Pandas for data manipulation. The working set was a focused sample of 389 synthetic cases, each combining demographic attributes, longitudinal records, and intentional data-quality issues. I then trained deduplication models on this augmented data and measured accuracy and recall.

How generative AI helps, beyond simply adding more data

A few qualities make generative models a strong fit for this work:

  • They capture the shape of human error. Rather than writing rules for every typo pattern by hand, the models learn the imperfect distribution of real data.
  • They remove the privacy obstacle. Synthetic records contain no real protected health information, so you can share, experiment, and benchmark freely.
  • They provide a controllable test bed. You can raise or lower error types and rates to find exactly where a matcher breaks.
  • They scale. Once the pipeline exists, generating more edge cases is inexpensive.

What I found

The result that mattered most was clear. The generative models were able to represent the patterns of human error, and training the deduplication models on that data improved their sensitivity to true matches by a significant margin, with better accuracy and recall on the near-duplicate records that usually slip through.

It achieved this while removing the original obstacle entirely. Because every record was synthetic, the improvement came without touching or exposing a single real patient record. The privacy constraint that created the problem in the first place simply stops being a constraint.

Practical lessons if you try this

  • Realism and controllability pull in opposite directions. Treat error injection as its own controllable layer rather than hoping the generator produces enough natural noise.
  • Synthea gives you clean bones; you supply the bruises. The simulator produces well-formed records, and the value for entity resolution comes from the error layer you add on top, modeled on the failures you actually see.
  • Watch recall. For deduplication, missing a true match is usually the costlier error, so improvements in sensitivity are what justify the work.
  • Start small and focused. A tight, well-labeled synthetic set is worth more than a large, uncontrolled one.

This is not only a healthcare technique

Although I work in healthcare, the same idea applies anywhere you perform entity resolution on sensitive data, such as banking customers, citizens across government systems, or identities across enterprise platforms. Wherever privacy blocks access to realistic, imperfect data, synthetic data with the errors built in is a practical way around the wall.

Conclusion

Identity resolution quietly sits underneath accurate medical histories, clean master data, and trustworthy analytics. The hard part was never only the algorithm. It was obtaining realistic, imperfect, shareable data to build against. Generating synthetic data with human errors on purpose turns that constraint into something you control.

Perfect data makes for a confident demo and a fragile system. Deliberately imperfect data does the opposite.


This post summarizes my peer-reviewed paper, "Generative AI for Synthetic Patient Data Generation to Enhance Identity Matching and Deduplication Models," International Journal of Computer Applications, Vol. 187, No. 103, pp. 32–38 (2026). Full paper: https://www.ijcaonline.org/archives/volume187/number103/generative-ai-for-synthetic-patient-data-generation-to-enhance-identity-matching-and-deduplication-models/

Saiteja Jonnalagadda is a Senior Cloud Engineer working on AI-driven, large-scale healthcare data platforms.