GemPress: Formatting eBooks with Gemini

Amanvir Parhar • February 16, 2025

This week, I wrote a script that uses Google's Gemini-1.5-Flash model to fix typesetting and formatting issues in public domain eBooks from Project Gutenberg.

It's open source, and it works thanks to Gemini's insanely long context window (1M+ tokens)!

You can view the code for GemPress on GitHub.

Screenshot of formatted page from a book made by GemPress

This project was born out of a personal need: I was looking for a poetry collection by Robert Frost on gutenberg.org, a repository with thousands of free eBooks, but when I downloaded the ePub for this particular book, I was rather disappointed.

Screenshot of Wikipedia page of 'A Boy's Will'

The ePub I found wasn't properly formatted: it contained a duplicate, malformed copy of the table of contents spanning several pages, and the worst part was that every single poem was displayed in monospace for some reason.

In my experience, most books on Project Gutenberg are unfortunately formatted in a non-standard way, and a lot of books are actually plagued with much bigger issues than the one I was looking for. This is essentially the reason why initiatives like Standard Ebooks exist.

I started thinking about fixing some of these typesetting/formatting issues using an LLM, and I quickly realized that Gemini would be perfect for this project - the 1M+ token context window would allow me to paste the text for an entire book in one shot.

The script I ended up writing works like this: the model first takes in a prompt and the book's content. It's not asked to return a corrected version of the entire book (since the output limit is around ~8k tokens), but is instead asked to return JSON that aligns with a provided schema. This is possible thanks to structured generation (which is a topic I actually spoke about at DevIgnition 2024 - here's a link to the recording if you're interested in viewing that talk!).

The model is asked to respond with JSON that contains the indices for the first and last paragraphs of each chapter. Gemini is able to accurately return the index of any given paragraph, because each “paragraph” from the book's raw text is wrapped with a numbered tag which the model can reference:

These indices are used to create a new ePub that only includes meaningful parts of the book's text - the parts that we're interested in reading!

Currently, this is a pretty simple program, but I hope that I (or anyone else in the community!) can expand on it by adding functionality to tackle typos, gibberish text, and other issues found within public domain literature.

Hey, my name is Amanvir Parhar (but you can call me Aman)! I'm an undergrad studying C.S. at the University of Maryland, College Park.

If you liked this post, feel free to check out my other work, and follow me on X/Twitter!