惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Tor Project blog
B
Blog RSS Feed
M
MIT News - Artificial intelligence
WordPress大学
WordPress大学
H
Hackread – Cybersecurity News, Data Breaches, AI and More
罗磊的独立博客
GbyAI
GbyAI
N
Netflix TechBlog - Medium
博客园 - 司徒正美
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
宝玉的分享
宝玉的分享
W
WeLiveSecurity
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
SecWiki News
SecWiki News
V
Vulnerabilities – Threatpost
Google DeepMind News
Google DeepMind News
C
CERT Recently Published Vulnerability Notes
T
Tailwind CSS Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
Martin Fowler
Martin Fowler
A
About on SuperTechFans
S
Security @ Cisco Blogs
T
Tenable Blog
C
Check Point Blog
N
News and Events Feed by Topic
S
SegmentFault 最新的问题
The GitHub Blog
The GitHub Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
Attack and Defense Labs
Attack and Defense Labs
美团技术团队
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
C
Cisco Blogs
P
Palo Alto Networks Blog
V
V2EX
博客园 - 聂微东
Project Zero
Project Zero
酷 壳 – CoolShell
酷 壳 – CoolShell
D
Docker
N
News | PayPal Newsroom
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
小众软件
小众软件
Application and Cybersecurity Blog
Application and Cybersecurity Blog
人人都是产品经理
人人都是产品经理
V2EX - 技术
V2EX - 技术
I
Intezer
L
LINUX DO - 最新话题

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Preserving semantic styles from DOCX/ODT in Go
jk-kaluga · 2026-05-29 · via DEV Community

Preserving semantic styles from DOCX/ODT in Go

Most manuscript conversion tools are very good at moving text from one format to another.

That sounds like enough until you work with a manuscript where styles are not just decoration.

In many Word or LibreOffice files, a paragraph style called Poem does not mean “make this text indented”. It means “this is a poem”. A character style called Foreign - Latin does not only mean “italic text”. It means “this span is foreign-language text, probably with language-specific handling later”.

That distinction is easy to lose.

Once it is gone, every later step has to guess. Is this italic text a thought, a book title, a foreign phrase, emphasis, or just an accident from the editor? Is this indented block a quote, a poem, a letter, or a layout workaround?

I wanted a converter that treats named styles as manuscript semantics first, and visual output second.

That is the idea behind Tessera.

The problem with “just convert it”

DOCX and ODT are not pleasant formats, but they do carry useful information. Authors and editors already use named paragraph and character styles in tools they understand.

The problem is that a lot of conversion workflows flatten those styles into presentation:

Poem becomes indented paragraphs.
Epigraph becomes a styled quote.
Foreign - Latin becomes italic text.
Direct Thought becomes italic text too.
After conversion, two very different meanings can look identical.
This may be fine for a one-off document.

It is less fine for book production, where the same source may need to produce EPUB, PDF, LaTeX, test fixtures, and review artifacts. At that point, preserving meaning becomes more useful than preserving whatever the original document happened to look like on one machine.

Why not just use Pandoc or Calibre?

Pandoc and Calibre are excellent tools. I use them, and I do not think Tessera should be described as a replacement for either of them.

They solve broader problems.

Tessera is narrower. It is built around one specific assumption: in a manuscript, named styles are the source of truth.

That means I do not want to infer that a paragraph is a poem because it is indented or italic. I want the manuscript to say it is a poem through a style name, then carry that role through the whole pipeline.

For generic conversion, broad format support is a strength. For this workflow, a stricter model is useful:

parse DOCX or ODT;
map named styles to known roles;
build a semantic intermediate representation;
render EPUB, LaTeX, and PDF from that representation;
make the output reproducible enough to test.
That is a smaller problem than “convert anything to anything”, but it gives more control over the book-specific cases I care about.

The pipeline

The pipeline is:

DOCX / ODT -> semantic IR -> LaTeX + EPUB

The input parser reads the document package, extracts text, metadata, styles, images, notes, and structure, then maps known style names
into semantic roles.

The intermediate representation is the important layer. It is not HTML, and it is not LaTeX. It is closer to “what the manuscript means”.

From there, Tessera can render different outputs:

  • EPUB XHTML for ebooks;
  • LaTeX for print-oriented workflows;
  • PDF through a TeX engine;
  • canonical IR JSON for tests and debugging.

That IR layer makes the rest of the system easier to reason about. If a parser bug changes document meaning, it shows up before rendering.
If an EPUB renderer and a LaTeX renderer disagree, they are disagreeing over the same structured input.

A small example

Imagine a manuscript with these styles:

Paragraph style: Poem
Paragraph style: Epigraph
Character style: Foreign - Latin
Character style: Direct Thought

A generic conversion pipeline might turn several of those into italics and indentation.

Tessera keeps them separate.

A phrase marked as Foreign - Latin can become language-aware text:

veritas

A span marked as Direct Thought can remain a thought role:

a private thought

A paragraph marked as Poem can become a verse block instead of just a visually indented paragraph:

\begin{verse}
First semantic line\
Second semantic line
\end{verse}

An Epigraph can remain an epigraph in both EPUB and LaTeX output, instead of becoming just another quote-like block.

The point is not that these exact tags or macros are magical. The point is that the decision is made once, from manuscript intent, and
then each output format gets a suitable representation.

The annoying parts

The hardest parts were not the obvious ones.

Parsing XML is expected. Working with ZIP-based document formats is expected. The more annoying work was making the pipeline boring and
testable.

DOCX and ODT are packages, not single files. The useful data is spread across XML files, relationships, metadata, media, and style
definitions. Small differences between Word and LibreOffice matter. Style names can have visible names, internal IDs, localized names, and
inherited behavior.

Then there is deterministic output.

For normal use, a generated EPUB just needs to open. For tests and CI, it is much better if the same input produces the same output. That
means paying attention to timestamps, ordering, metadata, ZIP entries, generated identifiers, and canonical JSON.

EPUB linting was another useful constraint. It is easy to generate XHTML that looks fine in one reader and is still structurally wrong. A
built-in lint pass catches project-specific mistakes early, and external tools like epubcheck can still be used in stricter workflows.

None of this is glamorous, but it is the difference between a demo converter and something I can trust in a repeatable publishing
pipeline.

Current status

Tessera is still early.

The core shape is in place:

  • DOCX and ODT input;
  • semantic IR;
  • EPUB output;
  • LaTeX output;
  • PDF builds through TeX;
  • CLI commands for build, inspect, lint, and doctor;
  • Docker and GitHub Action support;
  • tests around the parser and renderers.

The project is intentionally not a GUI, not a SaaS uploader, and not a general-purpose Markdown converter. It is a command-line tool for
manuscripts where named styles carry the structure of the book.

If that sounds close to a problem you have, the repository is here:

https://github.com/balyakin/tessera