Reads in text from 'unstructured' modern Microsoft Office files (XML based files) such as Word and PowerPoint. This does not read in structured data (from Excel or Access) as there are many other great packages to that do so already.
A package for reading in text from 'unstructured' modern Microsoft Office file types
If you do any kind of text analysis work, you probably have text arrive in inconvenient formats like Word or PowerPoint. While copy and paste can be an effective way of getting the text into an easily readable format, this package aims to make loading in those files even easier.
.docx and .pptx files are supported
Returns a vector of characters, one element for each paragraph in the file.
Returns a list, one for each slide, each element in the list containing a vector of characters, one element for each paragraph of text on the slide. Slides are kept in order.
Updated release with minor improvements to functions to read in Microsoft Word and PowerPoint files.
Components on PowerPoint slides are stored in a named list to preserve structure. Tables on PowerPoint slides are now detected and extracted as character matrices.
File is read in, broken by XML defined paragraph and returned as a vector.
File is read in, each slide is processed and returned as an element of a list. Each slide has most components identified (titles, subtitles, text blocks, shapes, tables) and extracts the text. This text is returned as either a data.frame or a matrix (for tables) with minor formating details provided. This text is stored in a named list (names are the slide component names).