惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
小众软件
小众软件
博客园 - 【当耐特】
Last Week in AI
Last Week in AI
Jina AI
Jina AI
云风的 BLOG
云风的 BLOG
腾讯CDC
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Y
Y Combinator Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Engineering at Meta
Engineering at Meta
量子位
美团技术团队
I
InfoQ
Martin Fowler
Martin Fowler
MyScale Blog
MyScale Blog
博客园 - 聂微东
阮一峰的网络日志
阮一峰的网络日志
Blog — PlanetScale
Blog — PlanetScale

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

How to Build a Software Factory with Claude Code: From Vibe Coding to Agentic Development How to Avoid Rebuilding Infrastructure for Every New Project How to Use GitHub Search Like a Pro Learn LaTeX in 41 Hours Think Like the JavaScript Engine How to Encrypt Kubernetes Traffic with cert-manager, Let's Encrypt, and Internal TLS How to Migrate from ASP.NET Framework to ASP.NET Core Learn to Build Automated Workflows with Manis AI Learn to Build Automated Workflows with Manus AI How to Protect Your Privacy Online in 2026 How to Build a Browser-Based PDF Watermark Tool Using JavaScript AI Paper Review: Language Models are Few-Shot Learners (GPT-3) How to Clean Time Series Data in Python 7 Tools Digital Nomads Need in 2026 How to Build a Calculator with Tkinter in Python How to Build an Autonomous OSINT Agent in Python Using Claude's Tool Use API Common DevOps Mistakes and How to Avoid Them — Tips for Startups Claude Code for Beginners AWS Certified Cloud Practitioner Study Course – Pass the Exam With This Free 14-Hour Course Product Experimentation with Synthetic Control: Causal Inference for Global LLM Rollouts in Python How to Build Production-Ready AI Features with Flutter [Full Handbook for Devs] How to Build a Browser-Based PDF to Image Converter Using JavaScript How to Build Optimal AI Agents That Actually Work – A Handbook for Devs How to Develop Chrome Extensions using Plasmo [Full Handbook] Why Your “Simple Deploy” Turned Into a Week of Infrastructure Work AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2) How to Build a Self-Hosted WhatsApp Bot with n8n and WAHA The Codex Handbook: A Practical Guide to OpenAI's Coding Platform Learn Command Line Interface (CLI) Development with Dart: From Zero to a Fully Published Developer Tool How to Bypass Cloud SMTP Restrictions Using Brevo and HTTP APIs How to Apply Academic Theories to Human-Centered Web Design [Full Handbook How to Convert Images to PDF in the Browser Using JavaScript – A Step-by-Step Guide The Rise of AI Agents: How Software Is Learning to Act How to Build a Complete SaaS Payment Flow with Stripe, Webhooks, and Email Notifications Product Experimentation with Regression Discontinuity: How an LLM Confidence Threshold Creates a Natural Experiment in Python How to Build a Live Options Database in Python – A Complete Guide How to Migrate to S3 Native State Locking in Terraform How to Use SCons to Build Software Projects [Full Handbook] How to Run Open Source LLMs Locally and in the Cloud QuRT: The Real-Time OS Inside Your Phone's Processor [Full Handbook] The Real Infrastructure Behind Remote Work (It’s Not Just Wi-Fi) The Lithography Handbook: Machines, Markets, and the Next Wave of Semiconductor Startups ITCM vs DTCM vs DDR: Embedded Memory Types Explained [Full Handbook] AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1) How to Build a Market Research Copilot with MCP and Python [Full Handbook] How to Build a Scoped Note-Taking API with Django Rest Framework and SimpleJWT The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands Mastering the JavaScript Event Loop Data Science Insights: Why the Mean Lies When Handling Messy Retail Data How to Build High-Ranking SEO Landing Page How to Query Data in DynamoDB Using .Net How to Unblock Your AI PR Review Bottleneck: A Tech Lead’s Guide to Building a Codebase-Aware Reviewer How to Navigate Microservices as a Frontend Engineer How to Compress PDF Files in the Browser Using JavaScript (Step-by-Step) Stanford's youngest instructor talks InfoSec, AI, and catching cheaters - Rachel Fernandez interview [Podcast #217] Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python How to Build a Multi-Agent AI System with LangGraph, MCP, and A2A [Full Book] How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway How to Dockerize a Go Application – Full Step-by-Step Walkthrough Learn Hardware, Cloud, DevOps, Networking, Security, Databases, DNS, Git, and Linux Inside TreeHacks 2026, Stanford’s Elite Student Hakc Inside Stanford’s Elite Student Hackathon [Full Documentary] How to Measure Your AI Citation Rate Across ChatGPT, Perplexity, and Claude How to Deploy a Full-Stack Next.js App on Cloudflare Workers with GitHub Actions CI/CD How to Build a Multi-Tenant SaaS Platform with Next.js, Express, and Prisma How I Completed 15 freeCodeCamp Certifications in 4 Months: A Structured Learning Journey How to Build an Agentic Terminal Workflow with GitHub Copilot CLI and MCP Servers How AI Changed the Economics of Writing Clean Code How to Apply STRIDE Threat Modeling and SonarQube Analysis for Secure Software Development How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS How to Split PDF Files in the Browser Using JavaScript (Step-by-Step) How to Build Your Own Language-Specific LLM [Full Handbook] How to Build a Self-Learning RAG System with Knowledge Reflection How to Trace Multi-Agent AI Swarms with Jaeger v2 How I Tested Malaysia's Open Data Portals with Plain English How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik The Hidden Tax of Infrastructure: Why Your Team Shouldn’t Be Running It Anymore From Metrics to Meaning: How PaaS Helps Developers Understand Production From Symptoms to Root Cause: How to Use the 5 Whys Technique Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP 3D Web Development with Blender and Three.js How to Fix a Failing GitHub PR: Debugging CI, Lint Errors, and Build Errors Step by Step How to Merge PDF Files in the Browser Using JavaScript (Step-by-Step) How to Handle Stripe Webhooks Reliably with Background Jobs How to Build an Automatic Knowledge Graph for Your Blog with PHP and JSON-LD Understanding Proxies and Reverse Proxies: Your Gateway to Secure Networking The Evolution of Nvidia Blackwell GPU Memory Architecture How to Use PostgreSQL as a Cache, Queue, and Search Engine The New Definition of Software Engineering in the Age of AI Reclaim Your Time – Master Automation with Zapier How to Create Dynamic Emails in Go with React Email Why Many Beginner Self-Taught Developers Struggle (And What to Do About It) How to Build a Headless WordPress Frontend with Astro SSR on Cloudflare Pages How to Make Your GitHub Profile Stand Out How to Use Context Hub (chub) to Build a Companion Relevance Engine Why Chrome OS Is the Operating System the AI Era Was Built For How to Build Microservices-Based REST APIs for Healthcare Portals How to friction-max your learning with software engineer Jessica Rose [Podcast #216]
How to Build an AI-Powered Medical Image De-Identification Pipeline for Clinical Research
Lakshmi Maha · 2026-05-22 · via freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
How to Build an AI-Powered Medical Image De-Identification Pipeline for Clinical Research

Medical imaging is transforming healthcare. Researchers are training deep learning models to detect pneumonia from chest X-rays, estimate cardiac function from echocardiograms, and identify tumors from MRI scans. But before any of these images can be shared with researchers or used to train machine learning models, one critical challenge must be solved.

How Do We Protect Patient Privacy?

Medical images often contain sensitive information such as patient names, dates of birth, hospital identifiers, and accession numbers. Some of this information is stored in DICOM (Digital Imaging and Communications in Medicine) metadata, but much of it is also burned directly into the image pixels.

In this tutorial, you’ll learn how to build an AI-powered de-identification pipeline that removes PHI from both metadata and image pixels. Along the way, we’ll explore OCR (Optical Character Recognition), NER (Named Entity Recognition), and standards-based DICOM processing.

At the end, I’ll show how I combined these ideas into an open-source PyTorch project called Aegis.

What You’ll Build

In this tutorial, you’ll build a custom MONAI (PyTorch) preprocessing pipeline that automatically de-identifies medical images before they are used for clinical research or AI model training.

The pipeline will:

  • Discover DICOM studies

  • Load metadata and pixel data

  • Detect burned-in text using OCR

  • Classify text as PHI or non-PHI

  • Redact sensitive pixel regions

  • Remove PHI from DICOM metadata and pixel data

  • Save privacy-safe images for downstream AI workflows

By the end, you’ll have a reusable MONAI transform that can be integrated directly into any medical imaging workflow to prepare privacy-safe datasets for research and deep learning.

Prerequisites

To follow this tutorial, you should have:

  • Intermediate Python experience

  • Basic understanding of PyTorch

  • Familiarity with medical imaging concepts

  • Python 3.10 or later

We’ll use:

  • MONAI

  • pydicom

  • EasyOCR

  • NumPy

  • Transformers

  • Stanford NER

Set Up the Environment

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate        # On Windows: venv\Scripts\activate

# Upgrade pip
pip install --upgrade pip

# Install the core libraries used in this tutorial
pip install \
    monai \
    pydicom \
    easyocr \
    numpy \
    transformers \
    torch 

# Download the Stanford medical de-identification model from Hugging Face
python -c "
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'StanfordAIMI/stanford-deidentifier-base'
AutoTokenizer.from_pretrained(model_name)
AutoModelForTokenClassification.from_pretrained(model_name)
print('Stanford NER model downloaded successfully.')
"

Why Privacy Matters in Medical Imaging

Healthcare organizations generate enormous volumes of imaging data every day. These datasets are invaluable for:

  • Clinical research

  • Multi-center collaborations

  • Regulatory submissions

  • Artificial intelligence model development

  • Educational datasets

But privacy regulations such as the HIPAA (Health Insurance Portability and Accountability Act) in the United States require that PHI (Protected Health Information) be removed before data can be shared. This creates a significant bottleneck.

Many hospitals still rely on manual review to inspect thousands of images, searching for patient identifiers hidden in metadata and image annotations. This process is slow, expensive, and prone to human error.

Automated de-identification solves this problem by combining software engineering, computer vision, and natural language processing.

Understanding PHI, HIPAA, and DICOM

What Is PHI?

Protected Health Information (PHI) includes any information that can identify a patient, such as:

Name
Medical record number
Date of birth
Study date
Hospital ID
Accession number

What Is HIPAA?

The Health Insurance Portability and Accountability Act (HIPAA) defines rules for safeguarding patient data. One common approach is the Safe Harbor method, which requires removing specific identifiers before data is shared.

What Is DICOM?

Medical images such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Ultrasound (US) are commonly stored in the DICOM (Digital Imaging and Communications in Medicine) format, the international standard for storing and exchanging medical imaging data.

Unlike ordinary image formats such as JPEG or PNG, a DICOM file contains both the image itself and a rich set of structured metadata that describes the patient, the study, and the imaging procedure.

A typical DICOM file contains two main components:

  1. Pixel Data – the actual medical image, such as a CT slice, MRI volume, or ultrasound frame.

  2. Metadata – structured fields that may include:

    • Patient name and medical record number

    • Date of birth

    • Study and acquisition dates

    • Imaging modality (CT, MRI, US)

    • Scanner manufacturer and technical acquisition parameters

This combination makes DICOM far more than just an image format. It serves as a standardized container that allows imaging devices, hospital systems, and research software to exchange data reliably and consistently.

Because DICOM metadata often contains protected health information (PHI), and because identifiers may also be burned directly into the image pixels, particularly in ultrasound studies, both the metadata and the pixel data must be addressed during de-identification before images can be safely shared for clinical research or AI development.

Many tools remove PHI only from metadata. For example, deleting the PatientName tag may appear sufficient.

But in modalities such as ultrasound, fluoroscopy, and some X-ray workflows, identifying information is often burned directly into the image.

Common examples include:

NAME: JOHN DOE
DOB: 01/01/1980
MRN: 123456
HOSPITAL: ABC

If these annotations remain, privacy is still compromised. This means a complete solution must inspect both:

  • DICOM metadata

  • Image pixels

OCR and AI for Identifying PHI

To detect PHI embedded in pixels, we first need to find all visible text.

Step 1: Optical Character Recognition (OCR)

OCR converts image text into machine-readable strings.

import easyocr
reader = easyocr.Reader(['en'])
results = reader.readtext('ultrasound.png')

Each OCR result typically includes:

  • Bounding box coordinates – where the text appears in the image

  • Extracted text – the recognized characters

  • Confidence score – how certain the model is about the result

Example:

[
  ([[10, 20], [120, 20], [120, 45], [10, 45]], 'JOHN DOE', 0.98)
]

Step 2: Determine Whether the Text Is PHI

Not all detected text should be removed.

Medical images also contain clinically relevant labels such as:

LEFT VENTRICLE
APICAL VIEW
B-MODE

To distinguish PHI from legitimate clinical text, we can combine:

  1. Allowlists of known clinical terms

  2. Regular-expression heuristics

  3. Named Entity Recognition (NER)

Step 3: Named Entity Recognition

NER models identify entities such as:

PERSON
DATE
LOCATION
ID
def contains_phi(text): 
    if looks_like_date(text): 
    return True 
    if looks_like_identifier(text): 
    return True 
    return ner_model.predict(text) 

This hybrid approach reduces both false positives and false negatives.

Pixel Redaction

Once PHI is detected, the corresponding image regions can be masked.

image[y1:y2, x1:x2] = 0

This replaces the sensitive area with black pixels.

DICOM Metadata Scrubbing

Using pydicom, metadata fields can be modified or removed.

import pydicom

ds = pydicom.dcmread('study.dcm')
ds.PatientName = 'ANONYMIZED'
del ds.PatientBirthDate

Additional steps may include:

  • Removing private tags

  • Replacing UIDs

  • Recursively processing nested sequences

Together, metadata scrubbing and pixel redaction provide comprehensive de-identification.

Building the Complete Pipeline

Step-by-step workflow for medical image de-identification: discover files, load DICOM metadata, run OCR, classify PHI, redact pixels, scrub metadata, and save de-identified output.

The overall workflow looks like this:

  1. Discover medical image files

  2. Load DICOM metadata and pixel data

  3. Run OCR on annotation regions

  4. Classify text as PHI or non-PHI

  5. Redact sensitive pixel regions

  6. Remove PHI from metadata

  7. Save the de-identified output

Challenges and Lessons Learned

Building a production-ready de-identification system involves many practical challenges.

Clinical Terminology

OCR may detect legitimate labels that should not be removed.

OCR Errors

Low-contrast text and ultrasound overlays can produce inaccurate detections.

Nested DICOM Sequences

PHI may appear in deeply nested metadata structures.

Multi-Frame Studies

Ultrasound cine loops may contain dozens or hundreds of frames.

Deterministic Pseudonymization

Researchers often need the same patient to receive the same replacement identifier across studies.

These challenges require thoughtful engineering rather than a single machine learning model.

How I Built Aegis

While exploring this problem, I developed an open-source MONAI (PyTorch based) project called Aegis.

Aegis combines:

  • OCR-based text detection

  • AI-driven PHI classification

  • Pixel-level redaction

  • Standards-based DICOM de-identification

  • Batch processing for research workflows

Key Design Decisions

Standards First

I aligned metadata scrubbing with the DICOM confidentiality profile to follow established healthcare standards.

Hybrid AI + Rules

Clinical allowlists, heuristics, and NER models work together to improve accuracy.

Ultrasound-Specific Optimization

Aegis uses SequenceOfUltrasoundRegions to focus OCR on annotation areas instead of scanning the entire image.

Deterministic Identity Management

Consistent pseudonyms enable longitudinal research while protecting privacy.

Open Source Architecture

The project is modular, testable, and designed to integrate with research pipelines.

You can explore the full implementation in the Aegis GitHub repository:

https://github.com/lakshmi-mahabaleshwara/aegis

Future Directions

Automated de-identification continues to evolve.

Future enhancements may include:

  • Multilingual OCR

  • Handwriting recognition

  • Vision-language models

  • Human-in-the-loop review

  • Cloud-native deployment

  • Integration with AI training pipelines

As healthcare AI expands, privacy-preserving data preparation will become even more important.

Conclusion

Clinical research depends on access to high-quality medical imaging data.

But privacy regulations require that patient identifiers be removed from both DICOM metadata and image pixels.

By combining OCR, named entity recognition, pixel redaction, and standards-based DICOM processing, we can automate this task and dramatically reduce the burden of manual review.

The techniques covered in this tutorial are applicable far beyond a single project.

Whether you’re building a hospital data pipeline, preparing research datasets, or training the next generation of healthcare AI models, automated de-identification is a foundational capability.

To put these ideas into practice, I built Aegis as an open source reference implementation.

More importantly, the underlying concepts can help developers and researchers create privacy-safe workflows that accelerate innovation while respecting patient confidentiality.

References



Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started