Text Extraction, Rendering and Converting of PDF Documents

Utilities based on 'libpoppler' for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.


News

2.2

  • Import new qpdf package with pdf transformation tools
  • Enable pdf_data() in poppler 0.71 now Debian has backported the encoding patch.
  • Document new PPA for Ubuntu 16.04 and 18.04 with poppler 0.74

2.1

  • Windows / MacOS: update poppler to 0.73.0
  • Remove code that used the 'unstable' xpdf api
  • Use unique_ptr objects to fix memory leaks

2.0

  • Windows, MacOS: update poppler to 0.72.1 (backported UTF-8 fix)
  • Enable pdf_data() on systems with 0.72.1 or newer
  • Several encoding related fixes in text/metadata extraction
  • Add a 'tibble' class to data frames for pretty printing

1.8

  • Run configure script in bash

1.7

  • Change autobrew script to avoid dependency on xQuartz

1.6

  • pdf_render_page() and pdf_convert() gain argument to control 'antialias'
  • Small tweaks in pdf_text() for dealing with malformed pdf files

1.5

  • On Windows and MacOS we now bundle poppler-data to support non-latin text
  • Windows: Upgrade libpoppler to 0.61.0 from rwinlib
  • Windows: patch libpoppler bug that would cause pdf_convert() to generate corrupt files
  • PDF rendering errors are relayed via message() instead of warning()

1.4

  • Hide symbols in supported platforms
  • Upgrade libpoppler on Windows

1.3

  • Improve support for reading passworded and encyrpted pdf files (+ unit tests)
  • Support direct conversion from pdf to png, jpeg, tiff (+ unit tests)
  • Switch to Rcpp automatic symbol registration
  • Tweak autobrew script for legacy Mavericks builds

1.2

  • Fix autobrew for OSX Mavericks

1.1

  • Extract autobrew script to separate repo

1.0

  • Add workaround for poppler landscape truncation bug (fixes #7)

0.5

  • Rebuild poppler on Windows to support PDF rendering

0.4

  • Update Homebrew URL in configure script.
  • Fix autobrew (rename libopenjepg -> libopenjp2)
  • Update libpoppler 0.46 for Windows

0.3

  • Update libpoppler 0.42 for Windows
  • Use the COMPILED_BY variable on Windows to support R 3.3

0.2

  • Switch pdf_render_page to 1 based indexing
  • Fix for red/blue channel mixup in pdf_render_page
  • Update example to use local PDF file

0.1

  • Initial release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("pdftools")

2.3.1 by Jeroen Ooms, 11 days ago


https://docs.ropensci.org/pdftools (website) https://github.com/ropensci/pdftools#readme (devel) https://poppler.freedesktop.org (upstream)


Report a bug at https://github.com/ropensci/pdftools/issues


Browse source code at https://github.com/cran/pdftools


Authors: Jeroen Ooms [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports Rcpp, qpdf

Suggests jpeg, png, webp, tesseract, testthat

Linking to Rcpp

System requirements: Poppler C++ API: libpoppler-cpp-dev (deb) or poppler-cpp-devel (rpm). The unit tests also require the 'poppler-data' package (rpm/deb)


Imported by SwimmeR, TextForecast, crminer, findR, fulltext, gdiff, pdfsearch, rcoreoa, readtext, speech, tesseract, textreadr.

Suggested by LexisNexisTools, fplot, goldi, gridGraphics, hunspell, magick, pagedown, slickR, spelling, staplr, texPreview, tm.


See at CRAN