TL;DR for the skimmers:
- Design the Payload schema before writing any migration code
- Use a
wpIdfield on every collection for idempotent upserts - Expect multiple media migration passes : the XML export won't contain everything
- The HTML-to-Lexical parser will dominate the timeline, budget accordingly
- Separate migration scripts per content type is fine. Don't build abstractions that are going to be deleted.
If you've ever opened a WordPress XML export, scrolled for a while, and quietly wondered who let this happen, this post is for you.
Content migrations from WordPress to a headless CMS look simple on paper. Export. Transform. Load. Three words, absolutely not innocent in practice. Most of the work isn't moving data, it's figuring out what the data means and where and how it should live in the new system.
Recently i had to migrate a long-running Saas Documnetaion WordPress site to Payload CMS. A site that had been through a commercial page builder like Divi, SEO plugin, a custom FAQ plugin, and at least three different image upload mechanisms over the years.
This post is about what worked, what didn't, and the specific patterns that did it correctly when the script ran for the 30th time at 1 AM and finally produced clean output.
1. Design the Payload Schema Before Writing Any Migration Code
The migration script is throwaway code. The Payload collection is what the editorial team lives inside for the next several years. Spending a week on schema decisions before writing a single line of migration code is what keeps the project from becoming a rewrite of WordPress with extra steps.
Three decisions in this project carried more weight than everything else combined:
Model hierarchy with a self-referencing relationship, then derive the path from it. The documentation pages, policy pages, and integration catalog all had parent-child relationships. Some three or four levels deep. Each collection has a parent field pointing at the same collection, plus a read-only fullSlug and fullUrl that get recomputed in a beforeChange hook by walking up the parent chain:
// policy-pages collection — beforeChange hook
async ({ data, req }) => {
if (!data?.slug) return data
const fullSlug = await generateFullSlug(req.payload, data.slug, data.parent)
data.fullSlug = fullSlug
data.fullUrl = `/${fullSlug}`
return data
}
async function generateFullSlug(payload, slug, parentId) {
if (!parentId) return `policies/${slug}`
const parent = await payload.findByID({
collection: 'policy-pages', id: parentId, depth: 0,
})
if (parent?.fullSlug) return `${parent.fullSlug}/${slug}`
return `policies/${slug}`
}
This does two things. First, renaming a parent page doesn't break any of its children and the next save rewrites the full path. Second, the unique constraint goes on fullSlug (like docs/integrations/[provider]/setup), not on the individual slug. Multiple pages called "setup" under different parents is fine, because each resolves to a unique full path. WordPress handles this with friendly URL aliases. Payload handles it with a derived path. Both work, but the model needs to be committed to before the migration starts.
Stamp every migrated document with its WordPress ID. A wpId number field, hidden in the sidebar, read-only. Not for the editors but the migration script. This is what makes re-running the script safe instead of catastrophic. Section 2 covers the mechanics, but the schema decision needs to happen here: if the field doesn't exist on the collection, none of the idempotency logic works.
Decide which WordPress structures become their own collections vs. blocks within a collection. The source site had landing pages, policy pages, documentation, an integration catalog, FAQs, and blog posts. These all lived in WordPress's posts and pages tables, differentiated by parent IDs, custom post types, and template names. In Payload they became separate collections because each had fundamentally different fields: integration pages have a category relationship and platform-specific metadata. Documentation pages have a breadcrumbs array and cross-linking to related articles. Policy pages have none of that. FAQs are just a question/answer pair. If these had been shoe-horned into one pages collection with conditional fields, the admin panel would have been unusable.
2. Use a wpId Field for Idempotent Migrations
The migration script runs more than once. A lot more than once. Different subsets, different flags, different content types, recovery attempts after something breaks. In this project the documentation migration script alone ran dozens of times before everything settled.
Without a way to identify "is this WordPress post already in Payload?", duplicates pile up on every run. Once is annoying. Thirty times is a cleanup project of its own.
The fix is a small in-memory map populated at the start of every run:
const wpIdToPayloadId = new Map<number, number>()
async function loadWpIdMap(payload: any) {
const result = await payload.find({
collection: 'docs',
limit: 500,
select: { id: true, wpId: true },
})
result.docs.forEach((doc) => {
if (doc.wpId) wpIdToPayloadId.set(doc.wpId, Number(doc.id))
})
}
Then during import:
const existingId = wpIdToPayloadId.get(wpPost.id)
if (existingId) {
await payload.update({ collection: 'docs', id: existingId, data })
} else {
await payload.create({
collection: 'docs',
data: { ...data, wpId: wpPost.id },
})
}
The script is now idempotent. Same input produces the same result. This unlocks everything else: staging refreshes, retry logic, partial reruns when a parsing bug surfaces halfway through a batch.
Seven collections in this project ended up with wpId fields. Every migration script uses the same map-and-upsert pattern. It's boring, and that's the point.
3. Media Migration Won't Be One Script
This is where the clean narrative "write a media resolver" falls apart.
WordPress had a complicated relationship with its own media library. The same image could appear in body HTML as a full CDN URL, a WordPress-generated size variant (screenshot-300x200.png), an artifact from a half-completed format migration (screenshot.png.webp), a protocol-less reference, or a relative path with no domain.
And that's just the images that were in the media library. This particular site also had:
- 60+ images uploaded directly to the server via FTP years ago, never registered in WordPress's media library at all. They lived on disk but had no database entry. The XML export didn't mention them.
- ~100 theme images from the child theme's CSS such as retina variants at 2x and 3x, icon sprites, background patterns. These were referenced in stylesheets, not in post content, but the new site needed them in Payload's media collection so the storage adapter could serve them from S3.
That's three separate media migration scripts. The first handles the XML export. The second is a hardcoded list of FTP URLs discovered by crawling the staging server. The third is a hardcoded list of theme asset paths. None of them shared code with each other beyond the Payload create call.
The first script also had a subtle bug that took weeks to surface. Retina images like icon@2x.png and logo@3x.png were all failing silently. The WordPress XML stored their attachment URLs with %40 (URL-encoded @) in the filename: icon%402x.png. The migration script extracted the filename with path.basename(new URL(...).pathname) which doesn't decode URL encoding. Payload received icon%402x.png instead of icon@2x.png. Thirteen images affected. The fix was a one-liner (decodeURIComponent), but finding it required inspecting the XML by hand and wondering why only images with @ in the name were broken. This is the kind of bug that only shows up when you stop looking at the script and start looking at the data.
In this project the content-body media resolver (used during HTML-to-Lexical conversion) was a lookup function, not a migration script.
function findMediaId(src: string): number | null {
const filename = extractFilename(src)
const baseName = getBaseFilename(filename) // strips .png.webp → base
// 1. Exact base match
if (mediaFilenameToId.has(baseName)) return mediaFilenameToId.get(baseName)!
// 2. Full lowercase filename
if (mediaFilenameToId.has(filename.toLowerCase())) return mediaFilenameToId.get(filename.toLowerCase())!
// 3. Partial match (handles WP size variants like -300x200)
for (const [key, id] of mediaFilenameToId) {
if (key.startsWith(baseName + '-') || baseName.startsWith(key + '-')) return id
}
// 4. Substring match (catches double-extension edge cases)
for (const [key, id] of mediaFilenameToId) {
if (key.startsWith(baseName + '.') || key === baseName) return id
}
missingMediaSrcs.add(`${filename}::${currentPageSlug}`)
return null
}
A tiered lookup: exact match, then lowercase, then strip the WordPress size suffix, then substring fallback. Anything that still misses goes into a missingMediaSrcs set that prints at the end of the run, grouped by page slug.
An important decision: don't fail the document write when a media reference can't be resolved. Some source HTML will point at images that aren't in the export at all. If the script throws on missing media, the whole migration stops on the first batch. Logging and continuing produces a clean triage list at the end instead of a hard halt at the start.
4. The HTML-to-Lexical Body Parser Is Where Most of the Time Goes
WordPress stores rich content as HTML. Payload wants a structured Lexical JSON tree. The documentation parser alone ended up at ~900 lines. Here's why:
Page builder wrappers everywhere. The source HTML wasn't clean semantic markup. It was wrapped in multiple layers of page builder containers like section > row > column > module > text-inner > actual content. Sometimes there were additional wrapper divs with custom CSS classes. The parser had to recognize and recurse through all of them without treating them as content.
HTML entity decoding. Smart quotes, sequences, numeric entities, etc. What's worth noting is that two separate implementations of decodeHtmlEntities() ended up in the codebase because policy pages and documentation were migrated as separate scripts with slightly different entity handling. Copy-paste creeps in when scripts are written independently.
Page-builder shortcodes that need to become Payload blocks. A toggle module becomes a CollapsibleBlock. A styled button with a specific CSS class becomes a CustomLink block. A row of grouped tabbed links for different integration methods becomes a multi-link block. Each of these has different Lexical node shapes.
Inline images that need to become MediaBlock references. WordPress drops <img> tags anywhere: inside paragraphs, inside list items, inside blockquotes. Payload's Lexical editor represents images as block-level nodes with a relationship to a media document. The parser has to extract inline images, resolve them through the media lookup, and emit them as separate nodes.
The pattern that works: use JSDOM to walk the source HTML, classify each node, emit the corresponding Lexical structure. Be explicit and boring. Don't try to be clever.
The central processor ended up looking like this:
function processBlockNodes(container: Element, isNested: boolean = false): any[] {
const children: any[] = []
let inlineBuffer: any[] = []
function flushBuffer() {
if (inlineBuffer.length === 0) return
children.push(makeParagraph(inlineBuffer))
inlineBuffer = []
}
function processNode(node: ChildNode) {
// ... classify and emit
// Headings → createHeading()
// Paragraphs → extract images first, then processInlineNodes()
// Lists → createList() (with nested list and trailing content handling)
// Blockquotes → createQuote() + extract trailing images
// Page builder wrappers → recurse
// FAQ accordions → processFaqList()
// CTA buttons → createCustomLinkSingle()
// Bare text → accumulate in inlineBuffer
}
container.childNodes.forEach(processNode)
flushBuffer()
return children
}
The inlineBuffer pattern is worth calling out. WordPress HTML sometimes has bare text nodes between block elements. Lexical requires all text to live inside a paragraph node. The buffer collects loose text and flushes it into a paragraph whenever a real block element comes along.
The classification rules are specific to the source site. In this project, the documentation parser handled things the policy page parser didn't (FAQ accordions, tabbed link blocks) and vice versa (tables, back-navigation links). Each script handles its own content type.
One thing the code above doesn't convey: the parser wasn't designed correctly on the first pass, or the second, or the third. The first dry-run produced 9 media blocks across 80 pages. That was clearly wrong. The source HTML had images everywhere. The fix for images inside <p> tags bumped it to 160. Then it turned out blocks nested inside <li> elements were being silently skipped. Then support buttons inside <p> tags weren't becoming CustomLink blocks. Then querySelector was only grabbing the first page builder text block on pages that had several. Each dry-run revealed a new class of missed content.
The parser wasn't designed but discovered, one bug at a time.
For example, one bug from this phase didn't surface until much later, which is the lesson in itself. The parser generated Lexical node IDs with Date.now() plus a short random hex suffix. On fast iterations, Date.now() returns the same millisecond twice, and the random suffix occasionally collided producing duplicate node IDs across a document. The migrated content rendered fine on the front-end, so nothing looked wrong. But opening those documents in the Payload admin editor crashed it with a React duplicate-key error. A parser bug, surfacing as a CMS crash. The fix was a proper unique ID generator; finding it meant tracing a React error in the admin panel all the way back to a migration script that had already been deleted.
Speaking of which.
5. Accept That You'll Have Separate Scripts Per Content Type
This is something I chose not to gloss over, and it's worth being honest about. The project didn't end up with one unified migration system. I ended up with twelve separate scripts:
-
migrate-media.ts: XML export media -
migrate-media-ftp.ts: 60+ FTP images not in WordPress -
migrate-theme-media.ts: ~100 theme images from the child theme migrate-clients.tsmigrate-categories.ts-
migrate-platforms.ts: taxonomy terms for the integration catalog -
migrate-pages.ts: root landing pages (page builder hero detection) -
migrate-policy-pages.ts: 5 policy pages with table support -
migrate-docs.ts: ~65 hierarchical documentation pages (the big one) -
migrate-integrations.ts: ~100 integration catalog pages -
migrate-faqs.ts: FAQ custom post type from XML -
migrate-bespoke.ts: one hand-crafted page that didn't fit the general parser
Each has its own CLI flags (--dry-run, --limit, --slug, --verbose, --force-update), its own HTML-to-Lexical converter (ranging from "strip all tags and split by newlines" for landing pages to the full 900-line parser for documentation), and its own idempotency logic.
A planned unified CLI entry point (scripts/index.ts) never got built. Running individual scripts worked fine and less time consuming as well.
The fragmentation wasn't a failure of planning. It was a consequence of the "design the schema first" advice. Each content type had different fields, different block types, different HTML structures, and different parent/child relationships. The documentation needed hierarchy sorting and related-articles linking. The policy pages needed back-navigation stripping and table support. The integration catalog needed category taxonomy mapping. The FAQ custom post type had to be parsed from XML, not JSON.
Didn't want to optimize for dry. Feels like a trap!
6. Lexical Block Constraints Need a Sanitization Pass
This one caught me off guard, and it cost a day of mysterious validation errors.
Lexical blocks have rules about what they can contain. A CollapsibleBlock might allow h2 through h6, paragraphs, lists, media, and horizontal rules but not h1, blockquotes, or nested collapsibles. Real source HTML doesn't always follow those rules.
The documentation parser hits this constantly because the WordPress FAQ plugin generates accordion sections, and those sections contain whatever HTML the editor put in them including other FAQ sections. Nest an accordion inside an accordion and Payload rejects the write.
The fix is a sanitization pass that runs between parsing and writing:
function sanitizeForCollapsible(lexical: any): any {
if (!lexical?.root?.children) return lexical
const sanitized: any[] = []
for (const node of lexical.root.children) {
// h1 → h2
if (node.type === 'heading' && node.tag === 'h1') {
sanitized.push({ ...node, tag: 'h2' })
continue
}
// blockquote → paragraphs
if (node.type === 'quote') {
const paras = (node.children || []).filter((c) => c.type === 'paragraph')
sanitized.push(...(paras.length > 0 ? paras : [makeParagraph(extractText(node))]))
continue
}
// nested collapsible → flatten content into parent
if (node.type === 'block' && node.fields?.blockType === 'collapsible') {
sanitized.push(...(node.fields?.content?.root?.children || []))
continue
}
sanitized.push(node)
}
return { root: { ...lexical.root, children: sanitized } }
}
Without this, roughly a quarter of documents can fail validation at write time. The error messages don't indicate which nested rule was violated, which is most of what makes this expensive to debug the first time.
7. WordPress URL Aliases Need Their Own Mapping...sometimes
This took a full afternoon to figure out and it's not obvious.
WordPress has a concept of "friendly" URL aliases. A documentation page might live at /docs/integrations/provider-a/setup-guide-full but also be reachable at /docs/provider-a/setup. The short URL is what appears in navigation links and "related articles" sections within other pages.
Payload doesn't have this concept. A document has one slug, one parent chain, one fullSlug. So when the source HTML contains a related-articles section with links like /docs/provider-a/setup, that path can't just be looked up in Payload. It doesn't exist.
The fix was a hardcoded lookup table mapping WordPress friendly URLs to the actual Payload fullSlug:
const wpFriendlyToFullSlug: Record<string, string> = {
'docs/provider-a/setup': 'docs/integrations/provider-a/setup-guide-full',
'docs/provider-b/webhooks': 'docs/integrations/provider-b/configuring-webhook-endpoints',
'docs/api/v1': 'docs/api/v1/getting-started',
'docs/api/v2': 'docs/api/v2/schema-reference',
// ... 30 more entries
}
This table was built by manually inspecting which WordPress page each friendly URL pointed to. It's not glamorous work. But without it, related-articles links would silently fail to resolve, and the migrated site would have broken cross-linking.
The SEO plugin's breadcrumb schema data was useful here as it was a structured BreadcrumbList JSON with absolute URLs and titles, which helped verify that the mapping was correct.
8. Build Dry-Run Mode From Day One
The highest-leverage decision is adding proper CLI flags at the start of the project, not at the end:
tsx migrate-docs.ts --dry-run
tsx migrate-docs.ts --dry-run --limit=5
tsx migrate-docs.ts --slug=getting-started
tsx migrate-docs.ts --force-update
These take an hour to add per script. They save days of cleanup over the course of the project. Once they exist, make --dry-run the default. Running the script accidentally should be harmless.
The --slug flag is particularly valuable when debugging a single problematic page. Running the parser against one document and inspecting the full Lexical output without touching the database saves 20 minutes of debugging.
Running the migration in dry-run mode with verbose logging also turned out to be useful beyond catching script bugs. Inspecting every document on its way through the parser surfaced things worth flagging:
-
Slug collisions across different parents, usually legitimate. Three pages named "setup" under different parent integrations is fine when the unique constraint is on
fullSlug. -
Orphaned drafts with no slug, reachable only by
?page_id=12345URLs. - Media references without matching files in the export.
- One page whose HTML was unique enough that it warranted its own bespoke migration script rather than trying to bend the general parser to handle it.
The dry-run output ended up serving as the definitive content inventory.
9. Sort by Hierarchy Before Migrating
When content is hierarchical, parents need to be migrated before children.
The documentation pages formed a tree. The wpIdToPayloadId map needs the parent's Payload ID before setting the child's parent relationship field. Process children first and the lookup fails, leaving orphaned documents.
The fix was a topological sort that guarantees parents come before children:
function sortPagesByHierarchy(pages: WordPressPage[]): WordPressPage[] {
const sorted: WordPressPage[] = []
const remaining = [...pages]
const added = new Set<number>()
const allPageIds = new Set(pages.map((p) => p.id))
while (remaining.length > 0) {
const beforeLength = remaining.length
for (let i = remaining.length - 1; i >= 0; i--) {
const page = remaining[i]
const parentInSet = allPageIds.has(page.parent)
const parentMigrated = added.has(page.parent)
if (!parentInSet || parentMigrated) {
sorted.push(page)
added.add(page.id)
remaining.splice(i, 1)
}
}
// Circular dependency guard
if (remaining.length === beforeLength) {
console.warn(`Circular dependency, force-adding ${remaining.length} pages`)
sorted.push(...remaining)
break
}
}
return sorted
}
The circular dependency guard is there because one page in the export referenced a parent that didn't exist in the set. Without it, the sort loops forever.
Closing Thoughts
The chaos isn't in the code. It's in the decisions and the modeling. Which content types become their own collections? Which URL conventions carry over? What counts as a "block" versus a "field"? The migration script just translates those decisions one at a time.
PS - I didn't write all of this code alone, either. Claude and GLM were on the team for most of it, especially through the parser layer where pattern-matching twelve variations of the same shortcode is exactly the tedious work AI is good at. The architectural calls stayed mine; the boilerplate didn't. If a deeper post on that workflow would be useful consisting what I prompted, where it failed, where it shaved off days versus quietly added them back, say so in the comments and I'll write it up.
Keep your dry-runs greedy and your logs verbose.

























