Provides functions supporting the reading and parsing of internal e-book content from EPUB files. The 'epubr' package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame. E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata. EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with 'epubr'. Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like 'tm' or 'qdap'.
Read metadata and textual content of epub files.
epubr provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame.
E-book formatting is non-standard enough across all literature that no function can curate parsed e-book content across an arbitrary collection of e-books, in completely general form, resulting in a singular, consistently formatted output containing all the same variables.
EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. Text is read 'as is'. Additional text cleaning should be performed by the user at their discretion, such as with functions from packages like
epubr from CRAN with:
Install the development version from GitHub with:
Bram Stoker's Dracula novel sourced from Project Gutenberg is a good example of an EPUB file with unfortunate formatting. The first thing that stands out is the naming convention using
item followed by some ordered digits does not differentiate sections like the book preamble from the chapters. The numbering also starts in a weird place. But it is actually worse than this. Notice that sections are not broken into chapters; they can begin and end in the middle of chapters!
These annoyances aside, the metadata and contents can still be read into a convenient table. Text mining analyses can still be performed on the overall book, if not so easily on individual chapters.
file <- system.file("dracula.epub", package = "epubr")(x <- epub(file))#> # A tibble: 1 x 9#> rights identifier creator title language subject date source data#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lis>#> 1 Public~ Bram St~ Drac~ en Horror~ 1995~ http:/~ <tib~x$data[]#> # A tibble: 15 x 4#> section text nword nchar#> <chr> <chr> <int> <int>#> 1 item6 "The Project Gutenberg EBook of Dracula, by ~ 11252 60972#> 2 item7 "But I am not in heart to describe beauty, f~ 13740 71798#> 3 item8 "\" 'Lucy, you are an honest-hearted girl, I~ 12356 65522#> 4 item9 "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day~ 12042 62724#> 5 item10 "CHAPTER X\nLetter, Dr. Seward to Hon. Arthu~ 12599 66678#> 6 item11 "Once again we went through that ghastly ope~ 11919 62949#> 7 item12 "CHAPTER XIVMINA HARKER'S JOURNAL\n23 Septem~ 12003 62234#> 8 item13 "CHAPTER XVIDR. SEWARD'S DIARY-continued\nIT~ 13812 72903#> 9 item14 "\"Thus when we find the habitation of this ~ 13201 69779#> 10 item15 "\"I see,\" I said. \"You want big things th~ 12706 66921#> 11 item16 "CHAPTER XXIIIDR. SEWARD'S DIARY\n3 October.~ 11818 61550#> 12 item17 "CHAPTER XXVDR. SEWARD'S DIARY\n11 October, ~ 12989 68564#> 13 item18 " \nLater.-Dr. Van Helsing has returned. He ~ 8356 43464#> 14 item19 "End of the Project Gutenberg EBook of Dracu~ 2669 18541#> 15 coverpage-wr~ "" 0 0
tesseract by @jeroen for more direct control of the OCR process.
pdftools for extracting metadata and text from PDF files (therefore more specific to PDF, and without a Java dependency)
tabulizer by @leeper and @tpaskhalis, Bindings for Tabula PDF Table Extractor Library, to extract tables, therefore not text, from PDF files.
rtika by @goodmansasha for more general text parsing.
gutenbergr by @dgrtwo for searching and downloading public domain texts from Project Gutenberg.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
epub_catfunction for pretty printing to console as a helpful way to quickly inspect the parsed text in a more easily readable format than looking at the quoted strings in the table entries.
epub_catcan take an EPUB filename string (may be a vector) as its first argument or a data frame already returned by
epub_headaccepts EPUB character filenames or now also a data frame already returned by
epubbased on those files. Because of this change, the first argument has been renamed from
epubfunction, defaulting to UTF-8.
titlefield when missing, redundant or requiring remapping/renaming. All outputs of
epubnow include a
titleas well as
datafield, even if the e-book does not have a metadata field named
epub_headfunction for previewing the opening text of each e-book section.
epub_metafor strictly parsing EPUB metadata without reading the full file contents.
unlinkimmediately after use, rather than after all files are read into memory or by overwriting files in a single temp directory.
epubrfunctions is not too inflexible.