I decompiled a closed-source PDF converter and Opus rebuilt it from scratch

Every developer has that one tool they rely on that's closed-source, overpriced, and does exactly one thing no open-source alternative does properly.

For me, it was a PDF to SVG converter. Specifically, one that keeps text as actual text. Not outlines. Not paths. Real, selectable, searchable text. Every open-source tool I tried (and I tried them all) either rasterized the text, converted it to paths, or mangled the character encoding so "DROGUES" became "DECLES" and "Skip selected CIMA exams" became "6NLSVlOHFWlQR".

One commercial product did it perfectly. An 81MB binary. Closed source. Expensive license.

So I decompiled it.

The problem with PDF text

PDFs are not what you think they are. They're not documents. They're rendering instructions. A PDF doesn't say "here's the word Hello in Arial." It says "select font object 7, move to coordinates 234.5, 612.3, and paint glyph indices 43, 28, 55, 55, 62."

The font inside the PDF maps those indices to shapes. But here's the problem: there's no single standard way that PDFs connect "the letter A" to "glyph number 43." Over the decades, Adobe and others invented at least six different encoding schemes for this. Some PDFs use one method, some use another, some use three at once.

Most open-source converters handle maybe two of these methods. When they hit a PDF using one they don't understand, the text comes out garbled. The commercial tool handled all six. That's why it worked and nothing else did.

Enter Ghidra

Ghidra is the NSA's open-source reverse engineering tool. Yes, that NSA. They released it in 2019 and it's genuinely one of the best disassemblers/decompilers available. Free.

I pointed it at the 81MB binary and let it chew. The analysis took a while, but what came out was remarkable. Decompiled C code. Not pretty C code. The kind of C code that makes you want to lie down. But readable enough to understand the architecture.

The binary had over a dozen libraries baked into it. Font rendering, image decoding, compression, encryption, Unicode handling, even a JavaScript engine. All bundled into one giant executable. Plus a commercial PDF parsing library that costs real money to license.

From the build paths left in the binary (developers: strip your binaries), I could see the original source file names. The font processing pipeline was laid out in front of me.

The key discovery

Buried in the decompiled code was a font-building routine. Here's what it did in plain English: when you convert PDF text to SVG, you need to rebuild the font so that browsers know which character maps to which shape. The letter "A" needs to point to the right glyph. If that mapping is wrong, you get garbage.

The commercial product's entire advantage came down to one thing: it used an open-source font library (FreeType, free, available to anyone) and asked it "hey, what glyph goes with this character?" for every single character in the document. FreeType already knows how to handle every font format out there. One function call, correct answer every time.

That's it. That's the secret. The expensive proprietary tool was just using a free library correctly. Every open-source converter I'd tried was trying to figure out the font mapping manually, parsing internal font tables themselves, and getting it wrong. The commercial one just asked FreeType.

Let me be honest about what happened next

I am not a C++ systems programmer. I don't have deep knowledge of font file formats or PDF internals. I know enough to read decompiled code and understand what it's doing at a high level. I know enough to guide the work. But I did not sit down and hand-write the low-level font rebuilding code.

Claude Opus did.

I fed it everything Ghidra gave me. The decompiled functions, the extracted strings, the library references. Thousands of lines of reverse-engineered code that would take a human expert days to make sense of. Opus read through it, understood the patterns, identified what the code was actually doing, and reconstructed the entire pipeline in clean, modern C++.

Not blindly. I reviewed every file. I caught issues with font sizing, fixed edge cases where certain fonts rendered wrong, and debugged problems by comparing our output against the reference SVG pixel by pixel. But the heavy lifting, the part where you turn ugly decompiled code into structured, working C++ that correctly rebuilds fonts from scratch? That was Opus.

And it worked. Not "kind of worked." Worked worked. The same PDF that every open-source tool mangled came out with perfect text, correct fonts, proper positioning.

The implementation

The stack is simple:

MuPDF for reading PDFs. It's open-source and lets you intercept everything the PDF is trying to draw.
FreeType for font handling. Same library the commercial product used.
C++, no frameworks, no dependencies beyond those two.

The converter hooks into MuPDF's rendering pipeline. As MuPDF reads through the PDF, it tells us "here's some text," "here's a shape," "here's an image." We capture all of that in order and translate it to SVG. Fonts get extracted, rebuilt so browsers can render them, and packed directly into the SVG file as base64 data. Same with images. The result is one single SVG file with everything baked in. No external dependencies. Open it in any browser and it looks exactly like the PDF, except now you can select the text.

Build it:

cd pdf2svg_cpp/build && cmake .. && make
./pdf2svg input.pdf output.svg

One command. One file out. No dependencies at runtime.

What I actually learned

Decompilation is underrated as a learning tool. I learned more about how fonts actually work from reading decompiled code than from any official documentation. The docs tell you what's supposed to be there. The decompiled code shows you what actually matters and what you can ignore.

Commercial software isn't magic. The 81MB binary wasn't doing anything revolutionary. It was using the same free libraries anyone can download, with solid engineering around the font handling. The "secret sauce" was just doing the boring stuff correctly.

AI is genuinely good at this kind of work. Reconstructing a C++ codebase from decompiled output is exactly the kind of task where AI excels. It's pattern matching at scale, understanding data structures from mangled variable names, recognizing standard algorithms in non-standard code. I could not have done this alone. Not because I lack the ability to learn it, but because the time investment would have been months, not days.

Is this legal?

Reverse engineering for interoperability is generally protected in most jurisdictions. I didn't copy any proprietary code. I studied how the binary worked, understood the approach, and wrote a clean-room implementation using open-source libraries. The decompiled code was a reference for understanding the algorithm, not a source to copy from. The actual implementation is original C++ using MuPDF and FreeType's public APIs.

That said, I'm not a lawyer. I'm a developer who wanted his PDFs converted properly.

The point

There's a pattern here that keeps repeating in my projects. Find a problem that only expensive proprietary tools solve. Understand how they solve it. Build an open-source version.

The twist this time is that "understand how they solve it" meant pointing Ghidra at a binary and feeding the output to an AI that's better at reading decompiled C than I am. Five years ago this project would have required a team with deep font engineering expertise. Today it required one stubborn developer, an NSA decompiler, and Claude Opus.

The converter is fast, produces clean SVGs, and the text is real text. That's all I wanted.

I decompiled a closed-source PDF converter and Opus rebuilt it from scratch

The problem with PDF text

Enter Ghidra

The key discovery

Let me be honest about what happened next

The implementation

What I actually learned

Is this legal?

The point

More like this

Speak Caveman, Save Tokens

A free Claude API hidden inside Claude Code

Claude read its own leak. Awkward.

AI is making me obsolete as a musician (and I'm helping it)