Read Text Out of Modern Office Files

Reads in text from 'unstructured' modern Microsoft Office files (XML based files) such as Word and PowerPoint. This does not read in structured data (from Excel or Access) as there are many other great packages to that do so already.


A package for reading in text from 'unstructured' modern Microsoft Office file types

Build Status

Why do I want this?

If you do any kind of text analysis work, you probably have text arrive in inconvenient formats like Word or PowerPoint. While copy and paste can be an effective way of getting the text into an easily readable format, this package aims to make loading in those files even easier.

Supported files

.docx and .pptx files are supported

Usage

read_docx("file/path/to/word.docx")

Returns a vector of characters, one element for each paragraph in the file.

read_pptx("file/path/to/powerpoint.pptx")

Returns a list, one for each slide, each element in the list containing a vector of characters, one element for each paragraph of text on the slide. Slides are kept in order.

News

readOffice 0.2.2

Updated release with minor improvements to functions to read in Microsoft Word and PowerPoint files.

Improvements

Components on PowerPoint slides are stored in a named list to preserve structure. Tables on PowerPoint slides are now detected and extracted as character matrices.

.docx support

File is read in, broken by XML defined paragraph and returned as a vector.

.pptx support

File is read in, each slide is processed and returned as an element of a list. Each slide has most components identified (titles, subtitles, text blocks, shapes, tables) and extracts the text. This text is returned as either a data.frame or a matrix (for tables) with minor formating details provided. This text is stored in a named list (names are the slide component names).

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("readOffice")

0.2.2 by Mark Ewing, 2 years ago


Browse source code at https://github.com/cran/readOffice


Authors: Mark Ewing


Documentation:   PDF Manual  


Unlimited license


Imports xml2, rvest, purrr, magrittr


See at CRAN