Import Articles from 'Factiva' Using the 'tm' Text Mining Framework

Provides a 'tm' Source to create corpora from articles exported from the Dow Jones 'Factiva' content provider as XML or HTML files. It is able to read both text content and meta-data information (including source, date, title, author, subject, geographical coverage, company, industry, and various provider-specific fields).


Version 1.7 - 2017-11-06 * Port from XML to xml2 package to support tm 0.8.

Version 1.6 - 2017-02-08 * Avoid importing each article twice with new Factiva HTML format. * Add screencast showing how to export correct HTML files in ?FactivaSource.

Version 1.5 - 2014-07-05 * Fix encoding issues on non-UTF-8 systems, adding back the 'encoding' argument to work around a bug in package XML.

Version 1.4 - 2014-05-31 * Adapt to tm 0.6. * Remove the 'encoding' argument to FactivaSource() as it is not supported by tm 0.6 (normally not needed). * Change all tags to lowercase (for consistency with tm). * Ensure meta-data variables which are supposed to contain only one value always do so.

Version 1.3 - 2014-01-10 * Extract Company, Industry, Information Provider Code (IPC) and Information Provider Description (IPD) meta-data (based on a patch by Grigorij Ljubownikow). * Remove inconsistent line breaks in HTML format. * Update to support tm 0.5-10 and clean the code a bit.

Version 1.2 - 2013-01-28 * Extract Subject and Coverage meta-data. * Add Reuters21578 example. * Fix handling of articles with no header or body. * Split lead paragraphs into separate lines. * Fix package help page to mention HTML.

Version 1.1 - 2012-06-30 * Add support for HTML files since Factiva no longer allows exporting to XML. * Work around encoding issues on Windows (for HTML only). * Preserve paragraphs information so that e.g. makeChunks() from tm can be used to split documents into smaller pieces.

Version 1.0 - 2012-05-14 * Initial release with support for XML files.

Milan Bouchet-Valat [aut, cre] , Grigorij Ljubownikow [ctb] , Juliane Krueger [ctb] , Tom Nicholls [ctb]

