惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Attack and Defense Labs
Attack and Defense Labs
T
Threatpost
C
Cybersecurity and Infrastructure Security Agency CISA
H
Hackread – Cybersecurity News, Data Breaches, AI and More
I
Intezer
C
Cyber Attacks, Cyber Crime and Cyber Security
The Register - Security
The Register - Security
量子位
Security Latest
Security Latest
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
大猫的无限游戏
大猫的无限游戏
小众软件
小众软件
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
C
CXSECURITY Database RSS Feed - CXSecurity.com
MyScale Blog
MyScale Blog
J
Java Code Geeks
Apple Machine Learning Research
Apple Machine Learning Research
Google DeepMind News
Google DeepMind News
WordPress大学
WordPress大学
Spread Privacy
Spread Privacy
Jina AI
Jina AI
博客园 - 【当耐特】
P
Palo Alto Networks Blog
Last Week in AI
Last Week in AI
SecWiki News
SecWiki News
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
G
GRAHAM CLULEY
宝玉的分享
宝玉的分享
Hacker News - Newest:
Hacker News - Newest: "LLM"
T
The Blog of Author Tim Ferriss
V
Vulnerabilities – Threatpost
有赞技术团队
有赞技术团队
T
Tor Project blog
H
Hacker News: Front Page
A
Arctic Wolf
NISL@THU
NISL@THU
A
About on SuperTechFans
云风的 BLOG
云风的 BLOG
Engineering at Meta
Engineering at Meta
V
V2EX
N
News and Events Feed by Topic
Webroot Blog
Webroot Blog
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
I
InfoQ
D
Docker
L
LINUX DO - 最新话题
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
U
Unit 42

Kyle Redelinghuys

How I Got the UK Global Talent Visa as a Software Engineer SrvMon: Self-Hosted Server Monitoring Built in Go Claude Cowork: Closing the Gap Between Coding and Knowledge Work Claude Code Agents & Subagents: What They Actually Unlock AI Agent Context Management: What I Built in Cont3xt Claude Agent SDK: Subagents, Sessions and Why It's Worth It I Built a Claude Code Cost Tracker - Was Max Worth It? Claude Code Pricing Guide: Which Plan Saves You Money OpenClaw: How I Built a Personal AI Operations Centre on Linux Claude Code Hooks: Automate Your AI Coding Workflow Have Anthropic Already Won the AI Race? Sonde: An AI Tool for Solving Complex Organisational Problems Open Sourcing EabhaSeq: Synthetic cfDNA for NIPT Research SoupaWhisper: Free SuperWhisper Alternative for Linux (Open Source)
Teaching a Transformer to Read DNA: How EabhaSeq Works
Kyle Redelinghuys · 2026-03-20 · via Kyle Redelinghuys

I've been building EabhaSeq for a while now, and there were probably five or six moments where I was genuinely convinced I'd cracked it. The model would produce something that looked right, I'd run the numbers, and some metric I hadn't been tracking closely enough would come back completely wrong. That cycle went on for months, and each iteration taught me something I hadn't understood before about what "correct" actually means for synthetic genomic data.

The project started as a synthetic cfDNA generator for NIPT research, backed by an Emergent Ventures grant, and the core idea is still the same: generate biologically accurate cell-free DNA so that labs can train and validate prenatal testing algorithms without needing large cohorts of patient samples. I've written about the broader project here and the open-source release here, so I won't repeat all of that. This post is specifically about the hardest technical problem I hit, what caused it, and how I fixed it, and then about what happened when I got my hands on real clinical data and tested everything end-to-end for the first time.

Dozens of tests, dozens of failures

The early versions of the model were statistically convincing. GC content distributions matched real cfDNA, fragment length profiles looked right, nucleotide transition patterns were plausible. I kept finding reasons to think the next version would be the one that held together. It never quite did.

The real test, the one that exposes everything, is alignment. You take your generated fragments, run bwa mem against GRCh38, and see how many of them actually map to the genome at reasonable mapping quality. For a long time I was getting rates under 10% for correct chromosome placement. The model had learned the texture of DNA without learning its geography. It was producing convincing-looking ATCG strings that were, genomically speaking, gibberish, and I spent a lot of time debugging why before I understood the root cause clearly enough to fix it properly.

The fix ended up being what I now call reference-conditioned generation, and once I understood the problem precisely the solution was reasonably clean to implement. After that fix, mapping quality jumped to over 94% at MAPQ ≥ 30, chromosome placement came out above 94%, and the distributional properties the model had always been good at remained intact. The full validation methodology is on the EabhaSeq site if you want the numbers in detail.

Then I got real T21 data

The alignment fix was the technical unlock, but the real validation came from getting access to actual karyotype-confirmed clinical samples. The dataset I used was from Lun et al. 2014 (PRJNA215135), 26 real clinical cfDNA samples including 6 confirmed trisomy 21 cases and 20 euploid controls. I had not touched this data at any point during training or development. It was purely a held-out test.

I trained a classifier entirely on synthetic data, zero real patient samples in training, and ran it blind on the 26 real clinical samples. The TSTR AUC (train synthetic, test real) came out at 0.87, and 0.98 when excluding one genuinely anomalous sample with a fetal fraction so low (~1-2%) that no chromosome-fraction NIPT method would detect it. Four of the six T21 cases in that dataset would fail standard NIPT by z-score, with z-scores of 2.47, 1.57, 1.29, and 0.86. The synthetic-trained model correctly ranked all four above the euploid samples (excluding the extreme anomaly). The permutation test on 1,000 random label shuffles came back at p = 0.002, so the result isn't noise.

The augmentation results were equally interesting. When you add synthetic data to a training set with only a handful of real samples, which is the reality for rare aneuploidies, the improvement is significant. With just 3 real training samples, adding synthetic data improved AUC by +19.7 points, from 0.661 to 0.858, and won 88% of 100 random stratified splits. The variance also drops, which matters as much as the mean for clinical applications. The WisecondorX results, which is the clinical-grade NIPT tool used in European laboratories, showed 100% detection of T21, T18, and T13 at 8% fetal fraction using a purely synthetic reference panel.

Full results, methodology, and per-patient breakdowns are at eabhaseq.com/validation and eabhaseq.com/validation/details.

I've been building this for long enough that getting those numbers back felt genuinely surprising. The approach works, and it works at a level that's useful for real clinical algorithm development, not just as a research curiosity.


The rest of this article covers the specific technical problem that caused the alignment failures and exactly how reference-conditioned generation solves it. It's a fairly deep dive into the architecture, the dual-level conditioning mechanism, and how the generation loop works in practice.

This content is for subscribers. If you want to follow along with the technical details of how EabhaSeq works under the hood, and get future posts on the validation work, the 107 conditions the platform now covers, and what real clinical validation looks like, subscribing gets you all of that.

This post is for subscribers only

Sign up to read the post and as well as all other member only posts. Subscribing only takes a few seconds and will give you immediate access.

Subscribe now

Already have an account? Log in