HTML to Markdown vs PDF to Markdown
A side-by-side comparison of Convert HTML to Markdown and Convert PDF to Markdown.
HTML and PDF both render to look similar in a browser, but their source structure is very different. HTML has explicit semantics — <h1>, <ul>, <table> — and converts to Markdown cleanly. PDF stores positions of text glyphs and lines; "this is a heading" is implied by font size and weight, not declared.
That asymmetry is why PDF-to-Markdown is fundamentally lossier than HTML-to-Markdown. Tables in PDF often arrive as misaligned columns; headings as bold paragraphs; lists as glyphs followed by text.
When to use Convert HTML to Markdown
Use the HTML to Markdown converter when you have clean structured HTML — a CMS export, a blog scrape, an email body. Headings, lists, links, images, and tables translate predictably.
When to use Convert PDF to Markdown
Use the PDF to Markdown converter when the only source you have is a PDF — academic papers, scanned documents, reports. Expect to clean up tables, footnotes, and column flow by hand afterward; the conversion is a starting point, not a finished doc.
Side-by-side comparison
| Convert HTML to Markdown | Convert PDF to Markdown | |
|---|---|---|
| Source structure | Semantic markup (h1, ul, table) | Positioned glyphs / lines |
| Heading detection | Explicit (h1–h6 tags) | Inferred from font size/weight |
| List detection | Explicit (ul, ol, li) | Inferred from bullets/numbering |
| Table quality | Good — explicit table markup | Poor — column alignment lost |
| Multi-column layouts | Native CSS handling | May interleave columns in output |
| Images | Embedded as ![]() with src | Extracted as separate files |
| Lossy? | Mostly no — styles dropped | Yes — layout signals lost |
| Manual cleanup | Usually minimal | Often significant (tables, footnotes) |
Bottom line
If you can get HTML, convert from HTML — the structure is real. PDF is for when there is no other source, and the output always needs a human pass.
Frequently asked questions
Why are PDF tables such a mess after conversion?
PDFs store text glyphs at x/y positions — columns are inferred from spacing, not declared. Converters use heuristics that work for clean tables and fall apart on merged cells, footnotes, or rotated text.
Does OCR help PDFs?
Yes for scanned (image) PDFs — OCR is the only way to get text at all. For text-native PDFs (with embedded fonts) OCR is redundant and often worse than direct extraction.
Can I preserve footnotes from a PDF?
Most converters either inline footnotes at the bottom of the page or skip them. Round-tripping numbered footnotes (¹ ² ³) reliably requires manual editing in nearly every case.
What about converting Markdown back to HTML or PDF?
Both are straightforward and lossless from Markdown forward — Markdown → HTML is what every renderer does; Markdown → PDF goes through HTML or LaTeX. The lossy direction is the reverse.