Getting Text from epub Files in Python

Epub files are the standard for digital books. If you want a lot of text data, or just have a lot of epubs you want to analyze, you can get that text data using Python. There are a few libraries for managing epub data in Python, but ebooklib is a good one with lots of useful features.

If you haven’t checked out an epub file yourself, it’s actually remarkably easy. DRM-free epubs are actually simply renamed zip files. You can change the .epub extension to .zip and unzip like any normal zip file.

Looking at the inside of an epub file, you can see that it mostly contains html files. The text of an epub is stored inside html tags, similar to a website. So, in addition to ebooklib, you can use BeautifulSoup to parse some of the data.

The epub is broken up into different parts, often around chapters in the book. Tables of contents, covers, copyright pages, etc. also often get their own sections in an epub. After loading the epub file into ebooklib, you’ll want to get every item in the epub.

import ebooklib
from ebooklib import epub
book = epub.read_epub(file_name)
items = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))

You can also get items by ID or href, or just get all items. ebooklib.ITEM_DOCUMENT type items contain the text of the epub, so that’s what we are after here. The full list of item types is:

ebooklib.ITEM_UNKNOWN
ebooklib.ITEM_IMAGE
ebooklib.ITEM_STYLE
ebooklib.ITEM_SCRIPT
ebooklib.ITEM_NAVIGATION
ebooklib.ITEM_VECTOR
ebooklib.ITEM_FONT
ebooklib.ITEM_VIDEO
ebooklib.ITEM_AUDIO
ebooklib.ITEM_DOCUMENT
ebooklib.ITEM_COVER

This line should get you a list of ebooklib.epub.EpubHtml objects. From this you can loop over the items, doing whatever you want. For instance, you can use if ‘chapter’ in item.get_name(): to weed out tables of contents, copyright pages, etc.

Of course, the real meat and potatoes comes with getting the actual text from the items themselves. Ebooklib has .get_content() and .get_body_content() methods, but the resultant strings will contain the text still in html format. This is where we need BeautifulSoup.

from bs4 import BeautifulSoupdef chapter_to_str(chapter):
soup = BeautifulSoup(chapter.get_body_content(), ‘html.parser’)
text = [para.get_text() for para in soup.find_all(‘p’)]
return ‘ ‘.join(text)
texts = {}
for c in chapters:
texts[c.get_name()] = chapter_to_str(c)]

The above code gathers all text within <p> paragraph tags. That should gather all text without getting any titles, images, any metadata. But you can similarly search only for titles, or only bold or italicized text, or anything else you can find with BeautifulSoup’s find functions. You can also get every bit of text simply by returning soup.text, but that often leaves many newline characters, so you may want to use return soup.text.replace(‘\n’, ‘ ‘).strip() instead.

For more on ways to use BeautifulSoup, check out my post about getting started with it. For more about ebooklib, here is the documentation. With this you can easily gather a lot of text data, or just perform some analysis on something you already know you enjoy.

Student of Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store