Extract Data Tables and Comments from 'Microsoft' 'Word' Documents

'Microsoft Word' 'docx' files provide an 'XML' structure that is fairly straightforward to navigate, especially when it applies to 'Word' tables and comments. Tools are provided to determine table count/structure, comment count and also to extract/clean tables and comments from 'Microsoft Word' 'docx' documents. There is also nascent support for '.doc' and '.pptx' files.



  • Fix for errors introduced by an update of the tidyverse


  • Enable support for accepting or rejecting tracked changes when reading in the document. Ref #19


  • .doc input supported (via Chris Muir)
  • UTF-8 filename support for Windows-1252 locale


  • add a preserve logical paramater to tbl extraction functions to support preserving intra-cell whitespace (implements #9)
  • use httr vs download.file() for URL retrieval (fixes #10)

0.3.0 WIP

  • return tibbles where possible & not stomping on input type (#7)
  • change tests to test for tbl vs data.frame (related to #7)
  • don't stomp on data frame-ish input type in assign_colnames()
  • prefix :: (non-user facing tweak)
  • switch all *apply() to purrr calls since we bother to import purrr (non-user facing tweak)
  • Make Column Names Great Again! (mgca() function added. The janitor package has a more robust function.)

0.2.0 released

  • update for new xml2 pkg compatibility
  • added ability to extract comments

0.1.1 released

  • had to change budget docx url since it was 404'ing


  • new function to extract all tables and a function to cleanup column names in scraped tables

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.6.5 by Bob Rudis, 2 years ago


Report a bug at https://gitlab.com/hrbrmstr/docxtractr/issues

Browse source code at https://github.com/cran/docxtractr

Authors: Bob Rudis [aut, cre] , Mark Dulhunty [ctb] , Karlo Guidoni-Martins [ctb] , Chris Muir [aut, ctb] , John Muschelli [ctb]

Documentation:   PDF Manual  

MIT + file LICENSE license

Imports tools, xml2, purrr, dplyr, utils, httr, magrittr

Suggests covr, tinytest

System requirements: LibreOffice (<https://www.libreoffice.org/>) required to extract data from .doc files or perform .pptx conversion.

Imported by ariExtra.

See at CRAN