Skip to content

A modern client-side tool to bulk convert PDFs into a unified document. Features drag-and-drop, auto dark mode, and multi-language support. Built with React and PDF.js.

License

Notifications You must be signed in to change notification settings

R0mb0/Local_pdf_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Local pdf extractor

Codacy Badge pages-build-deployment Maintenance Open Source Love svg3 MIT Donate

A modern client-side tool to bulk convert PDFs into a unified document. Features drag-and-drop, auto dark mode, and multi-language support. Built with React and PDF.js.


πŸš€ Features

  • Bulk Extraction: Drag & drop unlimited PDF files at once.
  • Smart Formatting: Heuristic algorithms attempt to reconstruct paragraphs and detect headers (H1, bold text) from the raw PDF stream.
  • Multi-Format Output: Export merged content as:
    • Markdown (.md): Perfect for LLM context or note-taking apps.
    • HTML (.html): Ready for web use.
    • Plain Text (.txt): Raw data.
  • Runs entirely in the browser: No server, no backend, no installation required.
  • Privacy-first: Your files never leave your computer.
  • Auto-Adaptive UI: Automatically detects system language (EN/IT) and Theme (Light/Dark).

πŸ› οΈ How it works

  1. Upload your PDF(s) via the drag & drop interface.
  2. The tool uses PDF.js to parse the binary data of each file locally.
  3. It extracts text items and sorts them by coordinates (Y/X) to reconstruct the reading order.
  4. An algorithm analyzes font size and spacing to determine line breaks and headers.
  5. Turndown converts the structure into clean Markdown (if selected).
  6. Download the single, unified document containing all data.

πŸ† What makes it special?

  • Zero-Dependency Setup (Offline): Can be run offline by simply opening index.html if libraries are downloaded locally.
  • Header & Format Detection: Unlike standard "Select All > Copy" methods, this tool tries to preserve the semantic structure of the document (Titles, Bold text).
  • Infinite Scalability: Since it runs on your client machine, you are not limited by server upload caps or timeouts.

πŸ’‘ Why use this tool?

  • LLM Context Preparation: Quickly merge 20+ PDFs into one Markdown file to feed into ChatGPT or Claude.
  • Research Consolidation: Combine multiple papers into a single searchable text file.
  • Privacy: Sensitive documents remain on your device.

πŸ”’ Privacy & Security

  • All processing is done locally in your browser.
  • No file is sent to any server.
  • No data is stored; memory is cleared upon page refresh.

⚑ Getting Started

Online

Simply visit the Demo Page.

Local Installation (Offline)

  1. Clone this repository.
  2. Ensure the library files (react.js, pdf.js, etc.) are in the root folder.
  3. Open index.html in your browser.
  4. Toggle the comments in the <head> of the HTML file to switch from CDN to Local libraries.

✨ Limitations & Notes

  • Text-Based PDFs Only: This tool extracts text layers. It does not perform OCR. If your PDF is a scanned image (without a text layer), use my PDF Accessibility Fixer instead.
  • Complex Layouts: While it handles standard documents well, complex multi-column layouts or tables might be extracted linearly.
  • Formatting: The reconstruction is heuristic; it may not be pixel-perfect compared to the original visual layout.

πŸ“– License

MIT License. See LICENSE for details.

πŸ™ Credits & Inspiration

Not made by AI

About

A modern client-side tool to bulk convert PDFs into a unified document. Features drag-and-drop, auto dark mode, and multi-language support. Built with React and PDF.js.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Contributors