惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
P
Proofpoint News Feed
L
Lohrmann on Cybersecurity
S
Secure Thoughts
Attack and Defense Labs
Attack and Defense Labs
人人都是产品经理
人人都是产品经理
Stack Overflow Blog
Stack Overflow Blog
W
WeLiveSecurity
O
OpenAI News
SecWiki News
SecWiki News
博客园 - Franky
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
T
Tor Project blog
Microsoft Security Blog
Microsoft Security Blog
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
H
Hacker News: Front Page
Google Online Security Blog
Google Online Security Blog
P
Privacy & Cybersecurity Law Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Darknet – Hacking Tools, Hacker News & Cyber Security
月光博客
月光博客
李成银的技术随笔
Spread Privacy
Spread Privacy
F
Full Disclosure
F
Fortinet All Blogs
T
The Exploit Database - CXSecurity.com
Vercel News
Vercel News
AWS News Blog
AWS News Blog
WordPress大学
WordPress大学
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
V
Visual Studio Blog
J
Java Code Geeks
博客园 - 三生石上(FineUI控件)
G
Google Developers Blog
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
Engineering at Meta
Engineering at Meta
Last Week in AI
Last Week in AI
P
Palo Alto Networks Blog
宝玉的分享
宝玉的分享
T
True Tiger Recordings
N
News and Events Feed by Topic
酷 壳 – CoolShell
酷 壳 – CoolShell
Cisco Talos Blog
Cisco Talos Blog
N
News | PayPal Newsroom
S
SegmentFault 最新的问题
Jina AI
Jina AI

DEV Community

Mumbli – my personal Wispr Flow Getting Paid Should Not Be a Geopolitical Nightmare: My NOWPayments Integration Story Four Layers of Validation in Kubernetes with Claude Code Prompt Flow — a visual side project for flow design, trace, and integration steps (looking for feedback) AI Citation Registry: Temporal Gaps in Government Publishing Cycles ShowDev: I built a 100% local, zero-upload PDF editor using WebAssembly Written by an AI Pipeline, Verified by Three Models. Is It Slop? Part1 Vulkan: Drawing Triangle 1 Why I Stopped Using useEffect to Sync State — and What I Use Instead Por qué dejé de usar useEffect para sincronizar estado y qué uso ahora Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans Azure DevOps Structure Explained: Organizations, Projects, and Repos Without the Mess A Simple React Hook for localStorage State, Expiry, and Sync I sold you on /scratchpad. Then I migrated to /note. Fixing WSL Errors on Windows 11 Your app is not Netflix. Stop building like it is. Resolving inter-service communication issue I built an email cleaner. CSV parsing took longer than the actual validators. How I Would Learn Full-Stack Development in 2026 If I Started From Zero Partition Evolution: Change Your Partitioning Without Rewriting Data What Google Play's I/O 2026 Updates Look Like From a Solo Indie Puzzle Developer Forgetting the Myth of "Ease of Integration" When Selling Digital Products with Bitcoin My 4-Step Regex Debugging Workflow (That Actually Saves Time) Stop Scraping Betting Sites: How to Build a Real-Time Sports Tracker in Python Civic Identity and Responsibility in Modern Democracy OLTP vs OLAP Are binaries really executable code ? The lie of the 80%: why software progress charts don't work What a Datacenter in Space Actually Buys You: Three Server Racks Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't Accessibility - This looks like a job for a developer advocate! I built a Mac app that turns web pages into live widgets How to Teach Source Evaluation When Your Students Use ChatGPT More Context Does Not Mean More Trust RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase Past the JVM Design decisions behind my “Irregular German Verbs” iOS app WordPress 7.0 "Armstrong" Is Live — Post-Release Deep Dive 🎺 Performance and Apache Iceberg's Metadata I Shipped a Bug to Production That Cost Us 3 Hours of Downtime 程序人生:在代码与时间之间 The Wrong Way to Think About XRPL Event Infrastructure What I Learned About MND, Voice Banking, and Why Assistive Tech Is Personal $1.50/Month Email Infrastructure That Beats Your $20 SendGrid Plan Cloud Unit Economics: The Metrics DevOps and FinOps Teams Actually Need Bypassing Payment Platform Restrictions Was The Best Decision I Ever Made For My Digital Product Business The Hidden Life of a Container: A Complete Lifecycle When a port is already in use, there is no interactive way to find it — so I built `port-peek` Como Sumir com o Barulho do Teclado Mecânico no Ubuntu Usando o NoiseTorch Google I/O 2026 dropped a bomb on Android tooling, and nobody's talking about it (or maybe they are 😅) Mentoring Junior Developers: What Actually Works How I Prevented Claude Code from Breaking My Architecture with 18 Tests That Run in 0.4 Seconds I Controlled an ESP32 Drone Using Only My Voice vite HMR is silently the reason ur laptop fan wont stop AI Agents Security for Developers: Don't Let Your Agents Become a Liability Single List Keyboard Handling 9 SaaS development companies worth knowing (a technical look) Material Nova — The Best VS Code Theme of 2026 Inference Routing Is Becoming an Infrastructure Placement Problem I just build a League MBTI Analytics Why I Built My Own Site with Astro, Not WordPress when I use WordPress for a Living Hello! I'm a balloon artist who started 3D modeling 7 Next.js 16 Caching Bugs That Compile Fine and Break Silently in Production I got tired of writing READMEs so I built a tool that generates them from your GitHub URL FrontGate: a Lightweight Package Proxy for Supply Chain Security Why Your Expense Tracking Architecture Keeps Breaking Stop your AI trading agent from hallucinating technical analysis Breaking the Monorepo Barrier in a Crypto Store for Digital Products Imposter Syndrome Is Something We All Struggle With at Some Point in Our Careers Moving Beyond the Black Box: How I Built a Real-Time Voice Fitness Coach using Next.js 15, Convex, & Vapi.ai How to Recover Kafka DLQ Messages After a Schema Change Broke Your Consumer From Spec-Driven Development to Attractor-Guided Engineering Githubster free tool to track your GitHub followers and unfollowers Why Bitcoin Core RPC is Too Slow for High-Frequency Trading (And How to Fix It) Why Reading Food Labels Shouldn't Feel Like Decoding a Chemistry Exam I built a "brain" for AI coding agents — it never forgets and never stops How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration) Controlling Employee AI Usage on Managed Devices: Browser Controls, Cloudflare AI Gateway, and AWS Bedrock When Global Payment Gateways Fail, Local Solutions Shine LeetCode Solution: 13. Roman to Integer End-to-End Observability for vLLM and TGI: from DCGM to Tokens LeetCode Solution: 12. Integer to Roman 🚀 A Beginner’s First Look at Project IDX: Secure Coding from Day One Team Topologies for DevOps: A Practical Implementation Guide Seven Contradictions Shaped an Architecture. Telemedicine in Venezuela: A Technical Guide for Clinics in 2026 SSO, SAML, OIDC, and SCIM: What Actually Happens When You Click "Sign in with Google" Mastering Next.js 16 Server Actions & Forms: The Future of Full-Stack React | Muhammad Arslan Enterprise Laravel API Development: Best Practices for Performance, Security, and Scale | Muhammad Arslan How I Turned an Image Into a 3D Model in Minutes With AI Why Pure Rust WASM Is Harder Than It Looks Platform Stores Are a Dead End for Crypto Payments The VLA Testing Pipeline in Mano-AFK: When AI Agents QA Their Own Work LeetCode Solution: 10. Regular Expression Matching IPv4 Geolocation and Leasing: A Practical Guide for Network Operators Reconciling the Inefficiencies of Global Crypto Payments Platforms I Exported HT-Demucs FT to ONNX in 2026 (4 Blockers Everyone Else Gave Up On) 🤖 The Hacker in the Machine: Using AI Agents to Build Interactive Security Games Savings Plan Amortized Cost in AWS Cost Explorer: What It Is and How to Use It How to Tailor Your Resume to a Job Description in 5 Minutes (A Method That Actually Works)
Migrating a Long-Running WordPress Site to Payload CMS (And All The Chaos That Came With It)
Nadia Sultan · 2026-05-21 · via DEV Community

TL;DR for the skimmers:

  • Design the Payload schema before writing any migration code
  • Use a wpId field on every collection for idempotent upserts
  • Expect multiple media migration passes : the XML export won't contain everything
  • The HTML-to-Lexical parser will dominate the timeline, budget accordingly
  • Separate migration scripts per content type is fine. Don't build abstractions that are going to be deleted.

If you've ever opened a WordPress XML export, scrolled for a while, and quietly wondered who let this happen, this post is for you.

Content migrations from WordPress to a headless CMS look simple on paper. Export. Transform. Load. Three words, absolutely not innocent in practice. Most of the work isn't moving data, it's figuring out what the data means and where and how it should live in the new system.

Recently i had to migrate a long-running Saas Documnetaion WordPress site to Payload CMS. A site that had been through a commercial page builder like Divi, SEO plugin, a custom FAQ plugin, and at least three different image upload mechanisms over the years.

This post is about what worked, what didn't, and the specific patterns that did it correctly when the script ran for the 30th time at 1 AM and finally produced clean output.

1. Design the Payload Schema Before Writing Any Migration Code

The migration script is throwaway code. The Payload collection is what the editorial team lives inside for the next several years. Spending a week on schema decisions before writing a single line of migration code is what keeps the project from becoming a rewrite of WordPress with extra steps.

Three decisions in this project carried more weight than everything else combined:

Model hierarchy with a self-referencing relationship, then derive the path from it. The documentation pages, policy pages, and integration catalog all had parent-child relationships. Some three or four levels deep. Each collection has a parent field pointing at the same collection, plus a read-only fullSlug and fullUrl that get recomputed in a beforeChange hook by walking up the parent chain:

// policy-pages collection — beforeChange hook
async ({ data, req }) => {
  if (!data?.slug) return data
  const fullSlug = await generateFullSlug(req.payload, data.slug, data.parent)
  data.fullSlug = fullSlug
  data.fullUrl = `/${fullSlug}`
  return data
}

async function generateFullSlug(payload, slug, parentId) {
  if (!parentId) return `policies/${slug}`
  const parent = await payload.findByID({
    collection: 'policy-pages', id: parentId, depth: 0,
  })
  if (parent?.fullSlug) return `${parent.fullSlug}/${slug}`
  return `policies/${slug}`
}

Enter fullscreen mode Exit fullscreen mode

This does two things. First, renaming a parent page doesn't break any of its children and the next save rewrites the full path. Second, the unique constraint goes on fullSlug (like docs/integrations/[provider]/setup), not on the individual slug. Multiple pages called "setup" under different parents is fine, because each resolves to a unique full path. WordPress handles this with friendly URL aliases. Payload handles it with a derived path. Both work, but the model needs to be committed to before the migration starts.

Stamp every migrated document with its WordPress ID. A wpId number field, hidden in the sidebar, read-only. Not for the editors but the migration script. This is what makes re-running the script safe instead of catastrophic. Section 2 covers the mechanics, but the schema decision needs to happen here: if the field doesn't exist on the collection, none of the idempotency logic works.

Decide which WordPress structures become their own collections vs. blocks within a collection. The source site had landing pages, policy pages, documentation, an integration catalog, FAQs, and blog posts. These all lived in WordPress's posts and pages tables, differentiated by parent IDs, custom post types, and template names. In Payload they became separate collections because each had fundamentally different fields: integration pages have a category relationship and platform-specific metadata. Documentation pages have a breadcrumbs array and cross-linking to related articles. Policy pages have none of that. FAQs are just a question/answer pair. If these had been shoe-horned into one pages collection with conditional fields, the admin panel would have been unusable.

2. Use a wpId Field for Idempotent Migrations

The migration script runs more than once. A lot more than once. Different subsets, different flags, different content types, recovery attempts after something breaks. In this project the documentation migration script alone ran dozens of times before everything settled.

Without a way to identify "is this WordPress post already in Payload?", duplicates pile up on every run. Once is annoying. Thirty times is a cleanup project of its own.

The fix is a small in-memory map populated at the start of every run:

const wpIdToPayloadId = new Map<number, number>()

async function loadWpIdMap(payload: any) {
  const result = await payload.find({
    collection: 'docs',
    limit: 500,
    select: { id: true, wpId: true },
  })
  result.docs.forEach((doc) => {
    if (doc.wpId) wpIdToPayloadId.set(doc.wpId, Number(doc.id))
  })
}

Enter fullscreen mode Exit fullscreen mode

Then during import:

const existingId = wpIdToPayloadId.get(wpPost.id)
if (existingId) {
  await payload.update({ collection: 'docs', id: existingId, data })
} else {
  await payload.create({
    collection: 'docs',
    data: { ...data, wpId: wpPost.id },
  })
}

Enter fullscreen mode Exit fullscreen mode

The script is now idempotent. Same input produces the same result. This unlocks everything else: staging refreshes, retry logic, partial reruns when a parsing bug surfaces halfway through a batch.

Seven collections in this project ended up with wpId fields. Every migration script uses the same map-and-upsert pattern. It's boring, and that's the point.

3. Media Migration Won't Be One Script

This is where the clean narrative "write a media resolver" falls apart.

WordPress had a complicated relationship with its own media library. The same image could appear in body HTML as a full CDN URL, a WordPress-generated size variant (screenshot-300x200.png), an artifact from a half-completed format migration (screenshot.png.webp), a protocol-less reference, or a relative path with no domain.

And that's just the images that were in the media library. This particular site also had:

  • 60+ images uploaded directly to the server via FTP years ago, never registered in WordPress's media library at all. They lived on disk but had no database entry. The XML export didn't mention them.
  • ~100 theme images from the child theme's CSS such as retina variants at 2x and 3x, icon sprites, background patterns. These were referenced in stylesheets, not in post content, but the new site needed them in Payload's media collection so the storage adapter could serve them from S3.

That's three separate media migration scripts. The first handles the XML export. The second is a hardcoded list of FTP URLs discovered by crawling the staging server. The third is a hardcoded list of theme asset paths. None of them shared code with each other beyond the Payload create call.

The first script also had a subtle bug that took weeks to surface. Retina images like icon@2x.png and logo@3x.png were all failing silently. The WordPress XML stored their attachment URLs with %40 (URL-encoded @) in the filename: icon%402x.png. The migration script extracted the filename with path.basename(new URL(...).pathname) which doesn't decode URL encoding. Payload received icon%402x.png instead of icon@2x.png. Thirteen images affected. The fix was a one-liner (decodeURIComponent), but finding it required inspecting the XML by hand and wondering why only images with @ in the name were broken. This is the kind of bug that only shows up when you stop looking at the script and start looking at the data.

In this project the content-body media resolver (used during HTML-to-Lexical conversion) was a lookup function, not a migration script.

function findMediaId(src: string): number | null {
  const filename = extractFilename(src)
  const baseName = getBaseFilename(filename) // strips .png.webp → base

  // 1. Exact base match
  if (mediaFilenameToId.has(baseName)) return mediaFilenameToId.get(baseName)!

  // 2. Full lowercase filename
  if (mediaFilenameToId.has(filename.toLowerCase())) return mediaFilenameToId.get(filename.toLowerCase())!

  // 3. Partial match (handles WP size variants like -300x200)
  for (const [key, id] of mediaFilenameToId) {
    if (key.startsWith(baseName + '-') || baseName.startsWith(key + '-')) return id
  }

  // 4. Substring match (catches double-extension edge cases)
  for (const [key, id] of mediaFilenameToId) {
    if (key.startsWith(baseName + '.') || key === baseName) return id
  }

  missingMediaSrcs.add(`${filename}::${currentPageSlug}`)
  return null
}

Enter fullscreen mode Exit fullscreen mode

A tiered lookup: exact match, then lowercase, then strip the WordPress size suffix, then substring fallback. Anything that still misses goes into a missingMediaSrcs set that prints at the end of the run, grouped by page slug.

An important decision: don't fail the document write when a media reference can't be resolved. Some source HTML will point at images that aren't in the export at all. If the script throws on missing media, the whole migration stops on the first batch. Logging and continuing produces a clean triage list at the end instead of a hard halt at the start.

4. The HTML-to-Lexical Body Parser Is Where Most of the Time Goes

WordPress stores rich content as HTML. Payload wants a structured Lexical JSON tree. The documentation parser alone ended up at ~900 lines. Here's why:

Page builder wrappers everywhere. The source HTML wasn't clean semantic markup. It was wrapped in multiple layers of page builder containers like section > row > column > module > text-inner > actual content. Sometimes there were additional wrapper divs with custom CSS classes. The parser had to recognize and recurse through all of them without treating them as content.

HTML entity decoding. Smart quotes, &nbsp; sequences, numeric entities, etc. What's worth noting is that two separate implementations of decodeHtmlEntities() ended up in the codebase because policy pages and documentation were migrated as separate scripts with slightly different entity handling. Copy-paste creeps in when scripts are written independently.

Page-builder shortcodes that need to become Payload blocks. A toggle module becomes a CollapsibleBlock. A styled button with a specific CSS class becomes a CustomLink block. A row of grouped tabbed links for different integration methods becomes a multi-link block. Each of these has different Lexical node shapes.

Inline images that need to become MediaBlock references. WordPress drops <img> tags anywhere: inside paragraphs, inside list items, inside blockquotes. Payload's Lexical editor represents images as block-level nodes with a relationship to a media document. The parser has to extract inline images, resolve them through the media lookup, and emit them as separate nodes.

The pattern that works: use JSDOM to walk the source HTML, classify each node, emit the corresponding Lexical structure. Be explicit and boring. Don't try to be clever.

The central processor ended up looking like this:

function processBlockNodes(container: Element, isNested: boolean = false): any[] {
  const children: any[] = []
  let inlineBuffer: any[] = []

  function flushBuffer() {
    if (inlineBuffer.length === 0) return
    children.push(makeParagraph(inlineBuffer))
    inlineBuffer = []
  }

  function processNode(node: ChildNode) {
    // ... classify and emit
    // Headings → createHeading()
    // Paragraphs → extract images first, then processInlineNodes()
    // Lists → createList() (with nested list and trailing content handling)
    // Blockquotes → createQuote() + extract trailing images
    // Page builder wrappers → recurse
    // FAQ accordions → processFaqList()
    // CTA buttons → createCustomLinkSingle()
    // Bare text → accumulate in inlineBuffer
  }

  container.childNodes.forEach(processNode)
  flushBuffer()
  return children
}

Enter fullscreen mode Exit fullscreen mode

The inlineBuffer pattern is worth calling out. WordPress HTML sometimes has bare text nodes between block elements. Lexical requires all text to live inside a paragraph node. The buffer collects loose text and flushes it into a paragraph whenever a real block element comes along.

The classification rules are specific to the source site. In this project, the documentation parser handled things the policy page parser didn't (FAQ accordions, tabbed link blocks) and vice versa (tables, back-navigation links). Each script handles its own content type.

One thing the code above doesn't convey: the parser wasn't designed correctly on the first pass, or the second, or the third. The first dry-run produced 9 media blocks across 80 pages. That was clearly wrong. The source HTML had images everywhere. The fix for images inside <p> tags bumped it to 160. Then it turned out blocks nested inside <li> elements were being silently skipped. Then support buttons inside <p> tags weren't becoming CustomLink blocks. Then querySelector was only grabbing the first page builder text block on pages that had several. Each dry-run revealed a new class of missed content.

The parser wasn't designed but discovered, one bug at a time.

For example, one bug from this phase didn't surface until much later, which is the lesson in itself. The parser generated Lexical node IDs with Date.now() plus a short random hex suffix. On fast iterations, Date.now() returns the same millisecond twice, and the random suffix occasionally collided producing duplicate node IDs across a document. The migrated content rendered fine on the front-end, so nothing looked wrong. But opening those documents in the Payload admin editor crashed it with a React duplicate-key error. A parser bug, surfacing as a CMS crash. The fix was a proper unique ID generator; finding it meant tracing a React error in the admin panel all the way back to a migration script that had already been deleted.

Speaking of which.

5. Accept That You'll Have Separate Scripts Per Content Type

This is something I chose not to gloss over, and it's worth being honest about. The project didn't end up with one unified migration system. I ended up with twelve separate scripts:

  • migrate-media.ts: XML export media
  • migrate-media-ftp.ts: 60+ FTP images not in WordPress
  • migrate-theme-media.ts: ~100 theme images from the child theme
  • migrate-clients.ts
  • migrate-categories.ts
  • migrate-platforms.ts: taxonomy terms for the integration catalog
  • migrate-pages.ts: root landing pages (page builder hero detection)
  • migrate-policy-pages.ts: 5 policy pages with table support
  • migrate-docs.ts: ~65 hierarchical documentation pages (the big one)
  • migrate-integrations.ts: ~100 integration catalog pages
  • migrate-faqs.ts: FAQ custom post type from XML
  • migrate-bespoke.ts: one hand-crafted page that didn't fit the general parser

Each has its own CLI flags (--dry-run, --limit, --slug, --verbose, --force-update), its own HTML-to-Lexical converter (ranging from "strip all tags and split by newlines" for landing pages to the full 900-line parser for documentation), and its own idempotency logic.

A planned unified CLI entry point (scripts/index.ts) never got built. Running individual scripts worked fine and less time consuming as well.

The fragmentation wasn't a failure of planning. It was a consequence of the "design the schema first" advice. Each content type had different fields, different block types, different HTML structures, and different parent/child relationships. The documentation needed hierarchy sorting and related-articles linking. The policy pages needed back-navigation stripping and table support. The integration catalog needed category taxonomy mapping. The FAQ custom post type had to be parsed from XML, not JSON.

Didn't want to optimize for dry. Feels like a trap!

6. Lexical Block Constraints Need a Sanitization Pass

This one caught me off guard, and it cost a day of mysterious validation errors.

Lexical blocks have rules about what they can contain. A CollapsibleBlock might allow h2 through h6, paragraphs, lists, media, and horizontal rules but not h1, blockquotes, or nested collapsibles. Real source HTML doesn't always follow those rules.

The documentation parser hits this constantly because the WordPress FAQ plugin generates accordion sections, and those sections contain whatever HTML the editor put in them including other FAQ sections. Nest an accordion inside an accordion and Payload rejects the write.

The fix is a sanitization pass that runs between parsing and writing:

function sanitizeForCollapsible(lexical: any): any {
  if (!lexical?.root?.children) return lexical

  const sanitized: any[] = []
  for (const node of lexical.root.children) {
    // h1 → h2
    if (node.type === 'heading' && node.tag === 'h1') {
      sanitized.push({ ...node, tag: 'h2' })
      continue
    }
    // blockquote → paragraphs
    if (node.type === 'quote') {
      const paras = (node.children || []).filter((c) => c.type === 'paragraph')
      sanitized.push(...(paras.length > 0 ? paras : [makeParagraph(extractText(node))]))
      continue
    }
    // nested collapsible → flatten content into parent
    if (node.type === 'block' && node.fields?.blockType === 'collapsible') {
      sanitized.push(...(node.fields?.content?.root?.children || []))
      continue
    }
    sanitized.push(node)
  }

  return { root: { ...lexical.root, children: sanitized } }
}

Enter fullscreen mode Exit fullscreen mode

Without this, roughly a quarter of documents can fail validation at write time. The error messages don't indicate which nested rule was violated, which is most of what makes this expensive to debug the first time.

7. WordPress URL Aliases Need Their Own Mapping...sometimes

This took a full afternoon to figure out and it's not obvious.

WordPress has a concept of "friendly" URL aliases. A documentation page might live at /docs/integrations/provider-a/setup-guide-full but also be reachable at /docs/provider-a/setup. The short URL is what appears in navigation links and "related articles" sections within other pages.

Payload doesn't have this concept. A document has one slug, one parent chain, one fullSlug. So when the source HTML contains a related-articles section with links like /docs/provider-a/setup, that path can't just be looked up in Payload. It doesn't exist.

The fix was a hardcoded lookup table mapping WordPress friendly URLs to the actual Payload fullSlug:

const wpFriendlyToFullSlug: Record<string, string> = {
  'docs/provider-a/setup': 'docs/integrations/provider-a/setup-guide-full',
  'docs/provider-b/webhooks': 'docs/integrations/provider-b/configuring-webhook-endpoints',
  'docs/api/v1': 'docs/api/v1/getting-started',
  'docs/api/v2': 'docs/api/v2/schema-reference',
  // ... 30 more entries
}

Enter fullscreen mode Exit fullscreen mode

This table was built by manually inspecting which WordPress page each friendly URL pointed to. It's not glamorous work. But without it, related-articles links would silently fail to resolve, and the migrated site would have broken cross-linking.

The SEO plugin's breadcrumb schema data was useful here as it was a structured BreadcrumbList JSON with absolute URLs and titles, which helped verify that the mapping was correct.

8. Build Dry-Run Mode From Day One

The highest-leverage decision is adding proper CLI flags at the start of the project, not at the end:

tsx migrate-docs.ts --dry-run
tsx migrate-docs.ts --dry-run --limit=5
tsx migrate-docs.ts --slug=getting-started
tsx migrate-docs.ts --force-update

Enter fullscreen mode Exit fullscreen mode

These take an hour to add per script. They save days of cleanup over the course of the project. Once they exist, make --dry-run the default. Running the script accidentally should be harmless.

The --slug flag is particularly valuable when debugging a single problematic page. Running the parser against one document and inspecting the full Lexical output without touching the database saves 20 minutes of debugging.

Running the migration in dry-run mode with verbose logging also turned out to be useful beyond catching script bugs. Inspecting every document on its way through the parser surfaced things worth flagging:

  • Slug collisions across different parents, usually legitimate. Three pages named "setup" under different parent integrations is fine when the unique constraint is on fullSlug.
  • Orphaned drafts with no slug, reachable only by ?page_id=12345 URLs.
  • Media references without matching files in the export.
  • One page whose HTML was unique enough that it warranted its own bespoke migration script rather than trying to bend the general parser to handle it.

The dry-run output ended up serving as the definitive content inventory.

9. Sort by Hierarchy Before Migrating

When content is hierarchical, parents need to be migrated before children.

The documentation pages formed a tree. The wpIdToPayloadId map needs the parent's Payload ID before setting the child's parent relationship field. Process children first and the lookup fails, leaving orphaned documents.

The fix was a topological sort that guarantees parents come before children:

function sortPagesByHierarchy(pages: WordPressPage[]): WordPressPage[] {
  const sorted: WordPressPage[] = []
  const remaining = [...pages]
  const added = new Set<number>()
  const allPageIds = new Set(pages.map((p) => p.id))

  while (remaining.length > 0) {
    const beforeLength = remaining.length

    for (let i = remaining.length - 1; i >= 0; i--) {
      const page = remaining[i]
      const parentInSet = allPageIds.has(page.parent)
      const parentMigrated = added.has(page.parent)

      if (!parentInSet || parentMigrated) {
        sorted.push(page)
        added.add(page.id)
        remaining.splice(i, 1)
      }
    }

    // Circular dependency guard
    if (remaining.length === beforeLength) {
      console.warn(`Circular dependency, force-adding ${remaining.length} pages`)
      sorted.push(...remaining)
      break
    }
  }

  return sorted
}

Enter fullscreen mode Exit fullscreen mode

The circular dependency guard is there because one page in the export referenced a parent that didn't exist in the set. Without it, the sort loops forever.

Closing Thoughts

The chaos isn't in the code. It's in the decisions and the modeling. Which content types become their own collections? Which URL conventions carry over? What counts as a "block" versus a "field"? The migration script just translates those decisions one at a time.

PS - I didn't write all of this code alone, either. Claude and GLM were on the team for most of it, especially through the parser layer where pattern-matching twelve variations of the same shortcode is exactly the tedious work AI is good at. The architectural calls stayed mine; the boilerplate didn't. If a deeper post on that workflow would be useful consisting what I prompted, where it failed, where it shaved off days versus quietly added them back, say so in the comments and I'll write it up.

Keep your dry-runs greedy and your logs verbose.