Read EPUB File Metadata and Text

Provides functions supporting the reading and parsing of internal e-book content from EPUB files. The 'epubr' package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame. E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata. EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with 'epubr'. Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like 'tm' or 'qdap'.


epubr

Author: Matthew Leonawicz gitter
License: MIT

CRAN status CRAN downloads Rdoc
Travis-CI Build Status AppVeyor Build Status codecov

Read EPUB files in R

Read EPUB text and metadata.

The epubr package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame.

E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata.

EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with epubr.

Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like tm or qdap.

Installation

Install epubr from CRAN with:

install.packages("epubr")

Install the development version from GitHub with:

remotes::install_github("ropensci/epubr")

Example

Bram Stoker's Dracula novel sourced from Project Gutenberg is a good example of an EPUB file with unfortunate formatting. The first thing that stands out is the naming convention using item followed by some ordered digits does not differentiate sections like the book preamble from the chapters. The numbering also starts in a weird place. But it is actually worse than this. Notice that sections are not broken into chapters; they can begin and end in the middle of chapters!

These annoyances aside, the metadata and contents can still be read into a convenient table. Text mining analyses can still be performed on the overall book, if not so easily on individual chapters. See the package vignette for examples on how to further improve the structure of an e-book with formatting like this.

file <- system.file("dracula.epub", package = "epubr")
(x <- epub(file))
#> # A tibble: 1 x 9
#>   rights  identifier   creator  title language subject date  source   data 
#>   <chr>   <chr>        <chr>    <chr> <chr>    <chr>   <chr> <chr>    <lis>
#> 1 Public~ http://www.~ Bram St~ Drac~ en       Horror~ 1995~ http://~ <tib~
 
x$data[[1]]
#> # A tibble: 15 x 4
#>    section        text                                          nword nchar
#>    <chr>          <chr>                                         <int> <int>
#>  1 item6          "The Project Gutenberg EBook of Dracula, by ~ 11446 60972
#>  2 item7          "But I am not in heart to describe beauty, f~ 13879 71798
#>  3 item8          "\" 'Lucy, you are an honest-hearted girl, I~ 12474 65522
#>  4 item9          "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day~ 12177 62724
#>  5 item10         "CHAPTER X\nLetter, Dr. Seward to Hon. Arthu~ 12806 66678
#>  6 item11         "Once again we went through that ghastly ope~ 12103 62949
#>  7 item12         "CHAPTER XIVMINA HARKER'S JOURNAL\n23 Septem~ 12214 62234
#>  8 item13         "CHAPTER XVIDR. SEWARD'S DIARY-continued\nIT~ 13990 72903
#>  9 item14         "\"Thus when we find the habitation of this ~ 13356 69779
#> 10 item15         "\"I see,\" I said. \"You want big things th~ 12866 66921
#> 11 item16         "CHAPTER XXIIIDR. SEWARD'S DIARY\n3 October.~ 11928 61550
#> 12 item17         "CHAPTER XXVDR. SEWARD'S DIARY\n11 October, ~ 13119 68564
#> 13 item18         " \nLater.-Dr. Van Helsing has returned. He ~  8435 43464
#> 14 item19         "End of the Project Gutenberg EBook of Dracu~  2665 18541
#> 15 coverpage-wra~ ""                                                0     0

Reference

Documentation

Complete package reference and function documentation

Related packages

tesseract by @jeroen for more direct control of the OCR process.

pdftools for extracting metadata and text from PDF files (therefore more specific to PDF, and without a Java dependency)

tabulizer by @leeper and @tpaskhalis, Bindings for Tabula PDF Table Extractor Library, to extract tables, therefore not text, from PDF files.

rtika by @goodmansasha for more general text parsing.

gutenbergr by @dgrtwo for searching and downloading public domain texts from Project Gutenberg.


Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

ropensci_footer

News

epubr 0.6.0

  • Added count_words helper function.
  • Improved word count accuracy in epub. Now also splitting words on new line characters rather than only on spaces. Now also ignoring vector elements in the split result that are most likely to not be words, such as stranded pieces of punctuation.
  • Added epub_recombine for breaking apart and recombining text sections into new data frame rows using alternative breaks based on a regular expression pattern.
  • Added epub_sift function for filtering out small text sections based on low word or character count. This function can also be used directly inside calls to epub_recombine through an argument list.
  • Added epub_reorder for reordering a specified (by index) subset of text section data frame rows according to a text parsing function (several template functions are available for convenient use to address common cases).
  • Refactored code to remove purrr dependency.
  • Added unit tests.
  • Updated function documentation, readme and vignette.

epubr 0.5.0

  • Added epub_cat function for pretty printing to console as a helpful way to quickly inspect the parsed text in a more easily readable format than looking at the quoted strings in the table entries. epub_cat can take an EPUB filename string (may be a vector) as its first argument or a data frame already returned by epub.
  • Like epub_cat, epub_head accepts EPUB character filenames or now also a data frame already returned by epub based on those files. Because of this change, the first argument has been renamed from file to x.
  • Added encoding argument to epub function, defaulting to UTF-8.
    • This helps significantly with reading EPUB archive files properly, e.g., providing ability to parse and substitute all the curly single and double quotes, apostrophes, various forms of hyphens and ellipses.
    • Previously, these were not substituted (e.g., replacing curly quotes with straight quotes), but attempting to do so would have failed anyway because they were not initially read correctly due to the lack of encoding specification.
    • Now non-standard characters are more likely to be read correctly, and those mentioned above are substituted with standard versions. If necessary, the encoding can be changed from UTF-8 via the new argument.
    • It appears that the EPUB format requires UTF encoding. Currently the only permissible option other than UTF-8 is UTF-16. This keeps things very simple and straightforward. Users should not encounter EPUB files in other encodings.
  • Added unit tests and updated documentation.

epubr 0.4.1

  • Improved handling of errors and better messages.
  • More robust handling of title field when missing, redundant or requiring remapping/renaming. All outputs of epub now include a title as well as data field, even if the e-book does not have a metadata field named title.
  • Minor improvements to e-book section handling.
  • Added epub_head function for previewing the opening text of each e-book section.
  • Removed R version from Depends field of DESCRIPTION. Package Imports that necessitated a higher R version were previously removed.
  • Minor fixes.
  • Updated documentation, vignette, unit tests.

epubr 0.4.0

  • Enhanced function documentation details.
  • Added epub_meta for strictly parsing EPUB metadata without reading the full file contents.
  • When working with a vector of EPUB files, functions now clean up each unzipped archive temp directory with unlink immediately after use, rather than after all files are read into memory or by overwriting files in a single temp directory.
  • Added initial introduction vignette content.
  • Minor function refactors.
  • Minor bug fixes.
  • Added unit tests.

epubr 0.3.0

  • Refactor functions.
  • Further reduce package dependencies.
  • Update unit tests and documentation.

epubr 0.2.0

  • Refactor functions.
  • Move contextual and e-book collection-specific functionality to other packages.
  • Make any other remaining edge-case related options hidden arguments so that general usage of epubr functions is not too inflexible.
  • Reduce package dependencies.
  • Add basic unit tests.
  • Add example public domain EPUB book for examples and testing.
  • Update readme and documentation.

epubr 0.1.0

  • Added initial package scaffolding and function.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("epubr")

0.6.0 by Matthew Leonawicz, 10 months ago


https://github.com/ropensci/epubr


Report a bug at https://github.com/ropensci/epubr/issues


Browse source code at https://github.com/cran/epubr


Authors: Matthew Leonawicz [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports xml2, xslt, magrittr, dplyr, tidyr

Suggests testthat, knitr, rmarkdown, lintr, covr, readr


See at CRAN