>_ DevTrendsen

Language

Home

Languages

Sections

Frontend Backend Mobile DevOps AI / ML Security
Python

PDF Craft: Forget About "Dead" PDFs – Turn Scans into Live Text!

5,803 stars

Know the situation when you get a PDF document in your hands, or worse, an entire book in scan format? The text can't be copied, search doesn't work, and reading on an e-reader is pure torture. This is a problem that probably everyone who has ever worked with academic literature or old digitized documents has faced. And that's when a hero steps onto the stage, capable of breathing life into these "dead" files – a project called PDF Craft.

What is it and why do you need it?

PDF Craft is a powerful Python tool designed for one, but very important, purpose: to convert PDF files, especially scanned books, into more convenient and editable formats such as Markdown and EPUB. Imagine you have an old but very valuable book in PDF that someone once simply scanned. With PDF Craft, you can turn it into a full-fledged e-book for your reader or into a Markdown file that you can work with like regular text: search, copy, edit, reformat. It's simply a godsend for students, researchers, developers, and really for anyone who values their time and convenience when working with information.

Key features that impressed me

The project doesn't just "extract" text. It does it smartly, using cutting-edge technologies.

Intelligent recognition and structure preservation

At the core of PDF Craft lies DeepSeek OCR – a powerful optical character recognition technology. This isn't just OCR that outputs a set of characters. DeepSeek OCR can recognize complex content: tables, formulas, footnotes, images within footnotes. It doesn't just scan text; it analyzes the document structure, separating main text from headers and footers, preserving the integrity of important elements.

By the way, do you remember how tables turn into a mess when copying from PDFs, and formulas become a set of incomprehensible symbols? PDF Craft solves this problem by trying to preserve these elements as close to the original as possible, whether it's an HTML table or a MathML formula.

Local and incredibly fast operation

One of the main highlights of version 1.0.0 and above is the complete abandonment of large language models (LLM) for text correction. This means the entire conversion process happens locally, without sending your data anywhere and without delays associated with network requests. If you have a GPU, the process will be lightning-fast thanks to hardware acceleration. Forget about long waits and connection drops!

Although, if you still need the LLM correction function, the developers kindly left the option to use the old v0.2.8 version.

You can evaluate the speed and quality of work right now by trying the online demo.

PDF Craft Online Demo

Output flexibility: Markdown and EPUB with automatic table of contents creation

PDF Craft allows you to convert PDFs into two popular formats: Markdown and EPUB.

  • Markdown: Ideal for those who want simple, structured text that's easy to integrate into their notes, documentation, or blogs. Images are saved in a separate folder in this case.

    from pdf_craft import transform_markdown
    
    transform_markdown(
        pdf_path="input.pdf",
        markdown_path="output.md",
        markdown_assets_path="images",
    )
    

    PDF to Markdown

  • EPUB: Your choice if you want to create a full-fledged e-book for comfortable reading on an e-reader. PDF Craft automatically generates a table of contents, which is very convenient for navigating through the book.

    from pdf_craft import transform_epub, BookMeta
    
    transform_epub(
        pdf_path="input.pdf",
        epub_path="output.epub",
        book_meta=BookMeta(
            title="Моя Отсканированная Книга",
            authors=["Автор 1", "Автор 2"],
        ),
    )
    

    PDF to EPUB

Fine-tuning for your needs

The project offers many parameters for fine-tuning the conversion process. You can choose the OCR model size (from tiny to gundam), specify a path for model caching, enable or disable footnote processing, set the table rendering method (TableRender.HTML or TableRender.CLIPPING - just an image) and formulas (LaTeXRender.MATHML, LaTeXRender.SVG or LaTeXRender.CLIPPING). This gives you full control over the final result.

By the way, there's even a mode where you can ignore rendering errors on individual PDF pages so as not to interrupt the entire process (ignore_pdf_errors=True). Very useful for "broken" files!

How it works under the hood

As I mentioned, the heart of the OCR engine is DeepSeek OCR. The models for it are downloaded automatically from Hugging Face on the first run, but you can preload them in advance or specify your own cache path, which is especially convenient for production environments or offline work.

from pdf_craft import predownload_models

predownload_models(
    models_cache_path="./my_models", # Указываем свой каталог для кэша
)

For parsing PDF files, pdf-craft uses Poppler (through the pdf2image library). If Poppler is not in your PATH, you can always specify the path to it manually:

from pdf_craft import transform_markdown, DefaultPDFHandler

transform_markdown(
    pdf_path="input.pdf",
    markdown_path="output.md",
    pdf_handler=DefaultPDFHandler(poppler_path="/путь/к/poppler/bin"),
)

It's nice to see that the project is licensed under MIT, which makes it very flexible for use in various projects.

Practical applications: Where will PDF Craft come in handy?

  • Digitizing your library: Do you have piles of scanned books or old documents that you want to make searchable and editable? PDF Craft is your best helper.
  • Reading on any device: Convert boring PDFs into convenient EPUB for reading on Kindle, PocketBook, or any other e-reader. Automatic table of contents will make navigation pleasant.
  • Data extraction for analysis: Need to quickly extract text, tables, or formulas from dozens of scientific articles? This tool will do it for you while preserving the structure.
  • Creating educational materials: Convert PDF textbooks into editable formats for creating lecture notes or adapting to your needs.
  • Combining with other tools: Developers even mention the possibility of using it together with the epub-translator project, which can automatically translate EPUB books while preserving their format. Imagine: scanned book -> EPUB -> translated bilingual EPUB. That's just pure magic!

Conclusion: Is it worth trying?

Without a doubt, yes! If you've ever faced the problem of working with scanned PDFs, PDF Craft can become your salvation. It's not just a converter, but a smart tool that understands document structure and strives to preserve it.

It will be perfect for:

  • Those who work a lot with academic texts and scanned documents.
  • Developers who need to automate the PDF processing workflow.
  • E-book enthusiasts who want to transfer their paper libraries to digital format.

By the way, if you don't want to install anything, you can try the online demo. It's a great way to quickly evaluate the project's capabilities.

Try PDF Craft on GitHub and give your "dead" PDFs new life!

Related projects