Open Source OCR Engine

Bindings to 'Tesseract' <>: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.

Simple example

text <- ocr("")

Roundtrip test: render PDF to image and OCR it back to text

# A PDF file with some text
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")
# Extract text from images
out <- ocr("page.tiff")

On Windows and MacOS the package binary package can be installed from CRAN:


Installation from source on Linux or OSX requires the Tesseract library (see below).

On Debian or Ubuntu install libtesseract-dev and libleptonica-dev. Also install tesseract-ocr-eng to run english examples.

sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-langpack-eng

On Fedora and CentOS we need tesseract-devel and leptonica-devel

sudo yum install tesseract-devel leptonica-devel

On OS-X use tesseract from Homebrew:

brew install tesseract

Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other langauges you can to install the training data from your distribution. For example to install the spanish training data:

On other platforms you can manually download training data from github and store it in a path on disk that you pass in the datapath parameter. Alternatively you can set a default path via the TESSDATA_PREFIX environment variable.



  • Windows, MacOS: Upgrade to upstream Tesseract 4.0! Completely new OCR engine.
  • Tesseract 4 has a new training data format. On Windows / MacOS you need to re-download your language data with tesseract_download(). The package uses separate directories for storing Tesseract 3 vs 4 data so they shouldn't get mixed up (hopefully).
  • Drop hard-dependency on tibble (only load if available)


  • Fix problem with setlocale() not properly restoring locale.
  • Switch examples from dontrun{} to donttest{}, and '--run-donttest' on travis/appveyor


  • Fixes for breaking changes in Tesseract 4.0.0 beta.3
  • Set LC_ALL = C when initiating tesseract
  • Include <tesseract/*> to support Tesseract 4


  • Fixes for 4.0.0-beta.1: they switched to semver + other data branch
  • Set LC_CTYPE to "C" when loading training data (required for some asian languages)
  • Add back OSD training data on Windows


  • Set tesseract parameters at init so that all parameters types now actually work!
  • New function tesseract_params() lists all supported parameters and their default
  • Added 'config' argument to tesseract() which specifies a file with parameter values
  • Internally validate paremeter names before init to revent tesseract crashes
  • Rewrite the ocr_data() function in C++ to make it much faster
  • Tesseract 4 now gets data from the tessdata_fast repo as recommended upstream
  • Use default resolution of 300dpi when image does not contain resolution info


  • Tesseract 4 now dowloads training data from the "tessdata_fast" repo
  • Add ocr_data() function that parses the hOCR output


  • Add support for HOCR output (#20)
  • Remove 'script' and 'orientation' attributes in output (doesn't work in Tesseract 4)

1.7 (internal)

  • Add support upcoming Tesseract 4 (compiler fix + separate tessdata dir)
  • Configure script now explicitly tests for CXX11 (required by Tesseract 4)


  • Windows: update libtesseract to 3.05.01
  • tesseract_download now uses 3.04 tree (instead of 4.00) as suggested in readme
  • For static packags on Win/Mac, languages stored in: rappdirs::user_data_dir('tesseract')
  • Use 'png' instead of 'tiff' to read magick images
  • Compile with $(C_VISIBILITY) to hide internal symbols (requires Rcpp 0.12.12)
  • Use Rcpp symbol registration


  • Run engine finalizer on R exit (requires Rcpp 0.12.10)
  • Move autobrew script to separate repository
  • Add symbol registration


  • tesseract() gains an 'options' parameter for setting engine variables
  • New tessseract_download() function for installing training data on Win/Mac
  • Initiate default tesseract engine onAttach() to fail for missing training data
  • Add support for ocr() on magick images


  • Try to fix build for CRAN OS-X, again.


  • Try to fix build for CRAN OS-X build server
  • Show 'loaded' and 'available' languages in print.tesseract()


  • Initial CRAN release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


5.0.0 by Jeroen Ooms, 17 days ago (website) (devel)

Report a bug at

Browse source code at

Authors: Jeroen Ooms [aut, cre]

Documentation:   PDF Manual  

Task views: Natural Language Processing

Apache License 2.0 license

Imports Rcpp, pdftools, curl, rappdirs, digest

Suggests magick, spelling, knitr, tibble, rmarkdown

Linking to Rcpp

System requirements: Tesseract >= 3.03 (libtesseract-dev / tesseract-devel) and Leptonica (libleptonica-dev / leptonica-devel). On Debian you need to install the English training data separately (tesseract-ocr-eng)

Suggested by imagerExtra, magick, pdftools, textreadr.

See at CRAN