Why I Generated Synthetic Patients to Make Identity Matching Better

If you have ever worked on matching or deduplicating people records at scale, you already know the quiet truth of the job. The data is imperfect, and the most challenging, most useful examples are exactly the ones you are not allowed to touch.

I work on patient identity at enterprise scale, and this problem sits at the center of my day. I want to walk through why it is so hard, along with a counterintuitive approach I described in a recent paper. The idea is to generate synthetic patients on purpose, flaws and all, to make identity-matching systems better.

What patient identity matching actually is

In a large health system, the same person rarely exists as one clean record. The same individual appears across pharmacy, clinical, claims, and digital systems, each with its own spelling, formatting, and history. Patient identity matching, also called record linkage or entity resolution, is the work of deciding which of these records belong to the same person, assigning a single trusted identity, and merging duplicates without ever merging two different people.

That last point is the hard constraint. A missed match fragments a person's medical history. A wrong match merges two people's histories, which is far more dangerous.

Why getting it right matters

This is not a back-office cleanup chore. Accurate identity resolution supports:

Patient safety. Duplicate or fragmented records cause medication errors and missed allergies.
Care coordination. Pharmacists, clinicians, and care teams need one complete view at the point of care.
Regulatory compliance. Accurate identity is foundational to meeting healthcare data requirements.
Operational cost. Reconciling mismatched records by hand is slow and expensive.
Trustworthy analytics. Every downstream report and model inherits the quality of the identity layer.

The data problem

To build and stress-test a matching system, you want a large supply of realistic, imperfect examples, such as near-duplicates, typos, and transposed digits. The most realistic data, however, is actual patient records, and that is exactly what you cannot freely copy, share, or experiment on, because of privacy law and fragmentation across systems. As a result, teams end up testing on data far cleaner than reality, and then production surprises them.

The counterintuitive idea, making the data imperfect on purpose

The instinct with synthetic data is to make it clean. For this problem, clean synthetic data is almost useless. A model that only ever sees tidy records learns nothing about the hard cases, and the hard cases are the entire point.

So instead of flawless synthetic patients, I generated imperfect ones, on purpose. The records carry the same human errors that break matching in the real world, including phonetic misspellings, transposed and reversed digits, and the clerical noise that accumulates whenever people enter other people's information under pressure.

The flaws are not a side effect. The flaws are the test.

How I built it

The approach combines a few well-understood pieces:

Synthea generates a realistic synthetic patient population, with demographics and longitudinal medical histories, that mirrors the statistical shape of real clinical data without containing any real person's information.
Generative models, both Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), produce high-fidelity synthetic records that reflect the characteristics of real-world datasets.
On top of that, I added deliberate clerical errors, such as phonetic misspellings and reversed digits, to recreate the failure modes that matter for entity resolution.

The stack was intentionally standard and reproducible. I used TensorFlow for the model architecture, the RecordLinkage toolkit for matching, and Pandas for data manipulation. The working set was a focused sample of 389 synthetic cases, each combining demographic attributes, longitudinal records, and intentional data-quality issues. I then trained deduplication models on this augmented data and measured accuracy and recall.

How generative AI helps, beyond simply adding more data

A few qualities make generative models a strong fit for this work:

They capture the shape of human error. Rather than writing rules for every typo pattern by hand, the models learn the imperfect distribution of real data.
They remove the privacy obstacle. Synthetic records contain no real protected health information, so you can share, experiment, and benchmark freely.
They provide a controllable test bed. You can raise or lower error types and rates to find exactly where a matcher breaks.
They scale. Once the pipeline exists, generating more edge cases is inexpensive.

What I found

The result that mattered most was clear. The generative models were able to represent the patterns of human error, and training the deduplication models on that data improved their sensitivity to true matches by a significant margin, with better accuracy and recall on the near-duplicate records that usually slip through.

It achieved this while removing the original obstacle entirely. Because every record was synthetic, the improvement came without touching or exposing a single real patient record. The privacy constraint that created the problem in the first place simply stops being a constraint.

Practical lessons if you try this

Realism and controllability pull in opposite directions. Treat error injection as its own controllable layer rather than hoping the generator produces enough natural noise.
Synthea gives you clean bones; you supply the bruises. The simulator produces well-formed records, and the value for entity resolution comes from the error layer you add on top, modeled on the failures you actually see.
Watch recall. For deduplication, missing a true match is usually the costlier error, so improvements in sensitivity are what justify the work.
Start small and focused. A tight, well-labeled synthetic set is worth more than a large, uncontrolled one.

This is not only a healthcare technique

Although I work in healthcare, the same idea applies anywhere you perform entity resolution on sensitive data, such as banking customers, citizens across government systems, or identities across enterprise platforms. Wherever privacy blocks access to realistic, imperfect data, synthetic data with the errors built in is a practical way around the wall.

Conclusion

Identity resolution quietly sits underneath accurate medical histories, clean master data, and trustworthy analytics. The hard part was never only the algorithm. It was obtaining realistic, imperfect, shareable data to build against. Generating synthetic data with human errors on purpose turns that constraint into something you control.

Perfect data makes for a confident demo and a fragile system. Deliberately imperfect data does the opposite.

This post summarizes my peer-reviewed paper, "Generative AI for Synthetic Patient Data Generation to Enhance Identity Matching and Deduplication Models," International Journal of Computer Applications, Vol. 187, No. 103, pp. 32–38 (2026). Full paper: https://www.ijcaonline.org/archives/volume187/number103/generative-ai-for-synthetic-patient-data-generation-to-enhance-identity-matching-and-deduplication-models/

Saiteja Jonnalagadda is a Senior Cloud Engineer working on AI-driven, large-scale healthcare data platforms.

推荐订阅源

DEV Community