---
title: "Zerox vs Docling vs Marker: AI PDF Parsing for Pipelines 2026"
author: "Nate Laquis"
author_role: "Founder & CEO"
date: "2026-07-05"
category: "Technology"
tags:
  - Zerox vs Docling vs Marker
  - AI PDF parsing 2026
  - vision LLM document parsing
  - RAG pipeline PDF extraction
  - self-hosted PDF to markdown
excerpt: "Vision LLM parsing, layout analysis, and heuristic conversion take fundamentally different paths to the same goal. We benchmarked Zerox, Docling, and Marker across 400 real-world PDFs to find out which approach wins."
reading_time: "13 min read"
canonical_url: "https://kanopylabs.com/blog/zerox-vs-docling-vs-marker-pdf-parsing"
---

# Zerox vs Docling vs Marker: AI PDF Parsing for Pipelines 2026

## Three Philosophies for Turning PDFs into LLM-Ready Text

PDF parsing sits at the foundation of every RAG pipeline, every document intelligence product, and every compliance automation workflow. Get it wrong and everything downstream suffers: embeddings drift, retrieval tanks, and your LLM hallucinates because it never saw the real data. In the last 18 months three open-source projects have pulled ahead of the pack, each betting on a fundamentally different strategy for cracking PDFs open.

**Zerox** sends each page as an image to a vision-capable LLM (GPT-4o, Claude, Gemini) and asks it to return structured markdown. **Docling**, open-sourced by IBM Research, runs specialized layout-analysis and table-extraction models locally to reconstruct document structure. **Marker** takes a speed-first heuristic approach, combining pdftext extraction with lightweight ML models for layout detection and LaTeX equation rendering.

We have deployed all three in client projects ranging from 2,000-page legal discovery batches to 300,000-page monthly ingestion pipelines. The right choice depends on your document complexity, throughput requirements, cost ceiling, and whether you can send data to external APIs. This guide covers the real tradeoffs with numbers from our own benchmarks.

![Server racks in a data center powering AI document parsing infrastructure](https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=800&q=80)

## Zerox: Vision LLM Parsing That Reads Like a Human

Zerox takes the most radical approach of the three. Instead of trying to decode PDF internals, it renders each page to an image, sends that image to a multimodal LLM, and prompts the model to return the page content as clean markdown. The idea is simple: if a vision model can describe a photograph, it can certainly read a document page.

### How It Works Under the Hood

Zerox converts each PDF page to a PNG at 300 DPI using pdf2image. Each image is sent to a configurable vision model with a system prompt instructing it to reproduce all visible text, tables, and structure as markdown. You can swap providers freely: GPT-4o, Claude Sonnet or Opus, Gemini 2.0 Flash, or any OpenAI-compatible endpoint. A single function call processes an entire PDF and returns concatenated markdown.

### Strengths

Accuracy on visually complex documents is outstanding. A frontier vision model interprets the page the way a human reader would, so Zerox handles scanned documents, multi-column layouts, nested tables, forms with checkboxes, and watermarked pages. In our testing it achieved 93% correct table structure extraction on complex financial tables, the highest of the three tools. Setup is trivial: install the package, pass your API key and a file path, and you get markdown back.

### Weaknesses

Cost is the elephant in the room. Every page requires a vision API call. Using GPT-4o, each page costs roughly $0.01 to $0.03. At 100K pages per month, that is $1,000 to $3,000 in pure API spend. Gemini 2.0 Flash cuts that by roughly 80%, but accuracy drops on complex tables.

Speed is constrained by API throughput. A single page takes 3 to 12 seconds depending on model and provider load. Parallelizing helps, but rate limits cap effective throughput. For large batch processing, Zerox is the slowest option by a wide margin.

Data privacy is the other concern. Every page image leaves your infrastructure and travels to a third-party API. For documents under strict data residency requirements, this can be a non-starter unless you run a self-hosted vision model, which introduces its own complexity and cost.

### Pricing

Zerox itself is free and MIT-licensed. Your cost is purely the LLM API bill. With GPT-4o: roughly $0.015/page. With Claude Sonnet: roughly $0.008/page. With Gemini 2.0 Flash: roughly $0.002/page. Self-hosted vision models (like a quantized LLaVA variant) can bring that to near-zero but with significant accuracy degradation on tables.

## Docling: IBM's Layout-Analysis Powerhouse

Docling approaches the problem from the computer vision and document understanding research tradition. Rather than delegating comprehension to a general-purpose LLM, it runs purpose-built models that detect layout regions, classify element types, and reconstruct table structures from cell-level predictions.

### How It Works Under the Hood

Docling uses IBM's DocLayNet-based model for layout segmentation, identifying 11 element types including titles, section headers, body text, lists, tables, figures, and footnotes. For tables it runs a TableFormer model that predicts cell boundaries, spanning relationships, and row/column headers. OCR is pluggable: EasyOCR, Tesseract, or a custom backend. Everything runs locally on CPU or GPU with no external API calls.

Output is Docling's DoclingDocument format, preserving hierarchical relationships between elements, reading order, bounding box coordinates, and page-level segmentation. You can export to markdown, JSON, or feed directly into Docling's built-in chunker for RAG pipelines.

### Strengths

Self-hosting is Docling's defining advantage. Everything runs on your infrastructure under the MIT license with zero data leaving your network. Layout analysis quality is strong: multi-column detection, reading order reconstruction, and footnote handling are best-in-class among local tools. The structural metadata is the richest of any parser we have tested, which pays dividends for advanced chunking strategies.

CPU inference is genuinely practical. A 10-page PDF processes in about 8 seconds on an M2 MacBook or a c5.xlarge. GPU acceleration on a T4 drops that to roughly 2 seconds. For volumes under 50K pages per month, CPU-only deployment is viable.

### Weaknesses

Scanned document handling depends on your OCR backend. The default EasyOCR produces more errors on English text than Tesseract, and neither matches a vision LLM on degraded scans or unusual fonts. Table extraction falls behind Zerox on complex tables with irregular spanning or tables embedded inside graphics.

The ecosystem is still maturing. Documentation has improved since the 1.0 release but still has gaps. Edge cases like encrypted PDFs and malformed cross-reference tables require manual workarounds.

### Pricing

Free and open-source. Compute costs for self-hosting: roughly $50 to $100/month for CPU inference on a c5.xlarge, or $300 to $400/month for GPU inference on a g5.xlarge with roughly 5x higher throughput.

## Marker: Speed-First PDF to Markdown Conversion

Marker optimizes for a different objective than the other two. Where Zerox prioritizes accuracy and Docling prioritizes structure, Marker prioritizes speed and simplicity. It is designed to convert large volumes of PDFs to markdown as quickly as possible with good-enough quality for most downstream tasks.

### How It Works Under the Hood

Marker uses pdftext for native text extraction from digital PDFs, avoiding OCR when the PDF has an embedded text layer. For layout detection it runs a lightweight LayoutLM-based model that identifies headings, body text, tables, figures, and code blocks. Tables are reconstructed using heuristic rules and a small detection model. Equations are converted to LaTeX via a Nougat-derived model. For scanned PDFs, it falls back to Surya OCR, supporting 90+ languages. The entire pipeline runs on CPU, with GPU providing a 4 to 6x speedup.

### Strengths

Speed is Marker's calling card. On digital-native PDFs, it processes roughly 25 to 30 pages per second on a modern CPU, 30x faster than Docling and hundreds of times faster than Zerox. Even on scanned documents it maintains 3 to 5 pages per second on GPU. For batch processing millions of pages, nothing else comes close.

The markdown output is clean and predictable. Headings, lists, formatting, code blocks, and basic tables all convert reliably. LaTeX equation support is a unique strength: academic papers with heavy math convert far better in Marker than in either competitor.

Language support through Surya OCR covers 90+ languages. We have tested it on Japanese, Arabic, Hindi, and Mandarin with strong results, all running offline at full speed.

### Weaknesses

Complex table accuracy is the biggest gap. Marker's heuristic reconstruction works well on simple grids but struggles with merged cells, multi-level headers, and wide tables. In our benchmark it correctly extracted table structure 64% of the time on complex tables, compared to 93% for Zerox and 83% for Docling.

Structural metadata is minimal: heading levels but no bounding boxes, no reading-order scores, no hierarchical document model. If your pipeline needs to map text back to page regions or build structure-aware chunks, you will need custom logic. Marker is also PDF-only. If your pipeline ingests DOCX, PPTX, or HTML, you need a separate tool.

### Pricing

Free and open-source under the GPL license (note: not MIT, which matters for some commercial use cases). Compute costs are minimal given the speed. A c5.xlarge handles most workloads under 500K pages per month at roughly $70/month. GPU acceleration on a g4dn.xlarge runs about $250/month and can push throughput past 1M pages per month.

![Laptop with code editor open showing a PDF parsing pipeline implementation](https://images.unsplash.com/photo-1517694712202-14dd9538aa97?w=800&q=80)

## Head-to-Head Benchmark Results Across 400 Documents

We ran all three tools against 400 PDFs: 100 financial reports, 80 legal contracts, 70 scanned medical forms, 80 technical manuals, and 70 academic papers.

### Table Extraction Accuracy

Zerox with GPT-4o: 93% structural accuracy, 96% content accuracy. Zerox with Gemini Flash: 81% structural, 89% content. Docling with Tesseract: 83% structural, 91% content. Docling with EasyOCR: 79% structural, 87% content. Marker: 64% structural, 88% content. Marker extracts the right text but loses table structure on complex layouts.

### Reading Order and Layout Fidelity

On multi-column documents, Docling correctly preserved reading order 94% of the time. Zerox with GPT-4o hit 91%. Marker scored 78%, frequently merging text across columns in documents with narrow gutters or uneven column widths. For single-column documents all three tools were above 98%.

### OCR Accuracy on Scanned Documents

On 70 scanned medical forms with mixed print and handwriting, Zerox with GPT-4o achieved 97.2% character-level accuracy. Docling with Tesseract achieved 94.1%. Marker with Surya OCR achieved 93.8%. On clean scans without handwriting, Docling and Marker both exceeded 96% and the gap narrowed significantly. The vision LLM approach has a clear edge on degraded or mixed-quality scans.

### Processing Speed

Marker on CPU: 25 pages/second for digital PDFs, 4 pages/second for scanned. Docling on CPU: 1.2 pages/second. Docling on T4 GPU: 5 pages/second. Zerox with GPT-4o (10 concurrent): 2 pages/second. Zerox with Gemini Flash (10 concurrent): 6 pages/second. Processing 100K pages takes Marker about 70 minutes, Docling on GPU about 5.5 hours, and Zerox with GPT-4o about 14 hours.

### Cost Per 100K Pages

Marker self-hosted CPU: roughly $70/month compute. Docling self-hosted CPU: roughly $90/month compute. Docling self-hosted GPU: roughly $350/month. Zerox with Gemini Flash: roughly $200 in API costs plus minimal compute. Zerox with GPT-4o: roughly $1,500 in API costs. Zerox with Claude Sonnet: roughly $800 in API costs. For cost-sensitive high-volume pipelines, Marker and Docling on CPU are an order of magnitude cheaper than the vision LLM approach.

## RAG Pipeline Integration and Chunking Strategies

Parsing accuracy only tells half the story. How well the output integrates into your retrieval pipeline determines whether better parsing translates to better answers. We tested each tool in a standardized RAG setup: text-embedding-3-large, Pinecone, and GPT-4o for generation.

### Docling's Structural Chunking Wins on Retrieval Quality

Docling's built-in hierarchical chunker splits documents along structural boundaries: section headings, table boundaries, list blocks. Each chunk carries metadata about its position in the document hierarchy. In our benchmark of 200 questions across the test corpus, this produced 79% recall@5, the highest of any configuration we tested. The structural metadata also lets you implement hybrid retrieval strategies where you filter by document section before running similarity search.

### Zerox Markdown Works Well with Standard Splitters

Zerox outputs clean markdown, which means standard markdown-aware splitters (like LangChain's MarkdownTextSplitter or LlamaIndex's MarkdownNodeParser) work effectively. Using MarkdownTextSplitter with 1,000-token chunks and 200-token overlap, we achieved 76% recall@5. The markdown headers provide natural split points, and tables stay intact as markdown tables within single chunks. If you are already using LangChain or LlamaIndex, Zerox's output slots in with minimal friction.

### Marker Needs Custom Chunking Work

Marker's markdown output is clean but carries less structural signal than the other two. Using the same MarkdownTextSplitter setup, recall@5 was 71%. The gap came primarily from table-heavy documents where Marker's structural errors propagated into retrieval. We found that adding a post-processing step to detect and protect table blocks from being split improved recall to 74%, but that required custom code. If your documents are primarily narrative text with few tables, the gap between Marker and the other tools narrows to 1 to 2 percentage points.

### Practical Integration Patterns

For teams building [RAG architectures](/blog/rag-architecture-explained), here is what we recommend. With Docling, use its native chunker and export chunks as JSON with metadata. Feed the metadata into your vector store as filterable fields. With Zerox, pipe the markdown through a header-aware splitter and store the heading hierarchy as metadata. With Marker, add a table-detection wrapper before chunking and consider a separate indexing path for table-heavy pages. Regardless of parser, always index tables as standalone chunks with a "table" type tag so your retrieval can surface them distinctly.

If you are building a production document pipeline from scratch, our [document processing pipeline guide](/blog/how-to-build-an-ai-document-processing-pipeline) covers the full architecture including ingestion, parsing, chunking, embedding, and retrieval.

## Self-Hosting, OCR, and Language Support Compared

For many teams the decision comes down to operational requirements rather than raw accuracy. Here is how the three tools compare on the infrastructure and compliance dimensions that often drive the final choice.

### Self-Hosting Viability

Docling and Marker both run entirely on your infrastructure with no external dependencies. Docling needs 4 to 8 GB RAM; Marker is lighter at 2 GB for digital PDFs. Both run in Docker behind your VPC with zero internet access. Zerox requires an external LLM API by default. You can point it at a self-hosted vision model (vLLM serving LLaVA), but our tests showed a 15 to 20 point accuracy drop on tables compared to GPT-4o. Air-gapped deployment with competitive accuracy is not practical with Zerox today.

### OCR Capabilities

On digital-native PDFs all three produce comparable results because OCR is not involved. On scanned documents, Zerox sidesteps OCR entirely by treating every page as a vision model image, giving it an edge on degraded scans. Docling supports pluggable OCR: Tesseract for best English accuracy, EasyOCR for broader language coverage. Marker uses Surya, its own OCR engine with strong multilingual support but no swappable backends.

### Language Support

Marker's Surya OCR supports 90+ languages natively, making it the best choice for multilingual document corpuses. Docling's language support depends on the OCR backend: Tesseract covers 100+ languages, EasyOCR covers 80+. Zerox inherits language support from the underlying LLM, and frontier models like GPT-4o and Claude handle most major languages well, though accuracy on low-resource languages varies.

### License Considerations

Docling and Zerox are MIT-licensed with unrestricted commercial use. Marker uses the GPL license, requiring derivative works to be open-sourced. If you are wrapping Marker in a commercial SaaS product, consult your legal team. Running it as an internal tool or standalone service is generally fine, but embedding it in proprietary code triggers GPL obligations.

![Team reviewing document parsing benchmark results on a large display in a meeting room](https://images.unsplash.com/photo-1553877522-43269d4ea984?w=800&q=80)

## When to Use Each Tool: Decision Framework

After running these tools across dozens of client deployments, here is our decision framework distilled into clear recommendations.

### Choose Zerox When Accuracy on Complex Documents Justifies the Cost

Zerox is the right choice for high-value documents where extraction errors have real business consequences: financial due diligence, contract analysis, medical record review, or regulatory filings. If a missed table cell means a missed liability in a $50M acquisition, $0.015 per page is trivial. It is also the fastest path to a working prototype since there are no models to host.

### Choose Docling When You Need Self-Hosted Structural Parsing

Docling wins for regulated industries where documents cannot leave your infrastructure, and for teams that need rich structural metadata for advanced RAG chunking. It strikes the best balance of accuracy, cost, and operational control. The learning curve is steeper, but the payoff is a flexible pipeline you fully own.

### Choose Marker When Volume and Speed Are the Priority

Marker is the right tool for hundreds of thousands or millions of pages of primarily digital-native PDFs. Knowledge base ingestion, academic paper indexing, and content migration all fit this profile. The lower table accuracy is an acceptable tradeoff when optimizing for throughput and cost. Marker also leads on equation-heavy technical documents thanks to its LaTeX conversion.

### Hybrid Architectures for Production

The most robust pipelines we have built use a routing layer that sends documents to different parsers based on complexity. Simple digital PDFs go to Marker. Complex scanned documents with dense tables route to Zerox or Docling. A lightweight classifier (even rule-based, checking file size and whether OCR is needed) makes this decision with minimal overhead. This pattern reduces cost by 40 to 60% compared to running everything through the highest-accuracy parser.

## Picking Your Parser and Moving Forward

The PDF parsing landscape in 2026 has matured to the point where there is no single best tool. Each of these three projects represents a genuine engineering philosophy, and the right choice depends on your specific constraints.

If you want the highest accuracy and can tolerate API costs and latency, Zerox with a frontier vision model is hard to beat. It turns a hard computer vision problem into a prompting problem, and 93% table accuracy speaks for itself. The tradeoff is cost, speed, and data privacy.

If you need everything on your own servers with rich document structure for downstream processing, Docling delivers. IBM's research pedigree shows in layout analysis quality, and the MIT license removes commercial friction. Budget for the learning curve, but the flexibility is worth it.

If you are processing high volumes of standard PDFs and need results fast, Marker is the pragmatic choice. At 25 pages per second on CPU, it can chew through document mountains that would take the other tools days. Just plan for its limitations on complex tables and build in quality checks for those edge cases.

The broader trend is clear: vision LLM approaches will get cheaper as model costs drop, and local tools will get more accurate as their models improve. Within 12 to 18 months the accuracy gap will likely shrink. But the architectural decision you make today, cloud API vs. self-hosted, structural vs. flat output, speed vs. accuracy, will shape your pipeline for years. Make that decision based on your actual documents, not hypothetical benchmarks.

We have built document parsing pipelines for over 40 companies across finance, healthcare, legal, and SaaS. If you want a second opinion on which tool fits your use case, or need help building the pipeline around it, [book a free strategy call](/get-started) and we will work through it together.

---

*Originally published on [Kanopy Labs](https://kanopylabs.com/blog/zerox-vs-docling-vs-marker-pdf-parsing)*