Provides a 'tm' Source to create corpora from articles exported from the 'LexisNexis' content provider as HTML files. It is able to read both text content and meta-data information (including source, date, title, author and pages). Note that the file format is highly unstable: there is no warranty that this package will work for your corpus, and you may have to adjust the code to adapt it to your particular format.
Version 1.4.0 - 2018-06-05 * Rework parsing code to make it more robust to variations in HTML format.
Version 1.3.1 - 2017-06-30 * Fix date parsing on Mac (thanks to Simon Naitram for signalling this).
Version 1.3 - 2016-06-29 * Support more variants of the format (though many likely remain unsupported).
Version 1.2 - 2015-02-22 * Support importation of English HTML files (thanks to Oriol Mirosa for sending an example file).
Version 1.1 - 2014-05-31 * Adapt to tm 0.6. * Change all tags to lowercase (for consistency with tm).
Version 1.0 - 2014-02-10 * Initial release with support for HTML files saved in French only.