Import Articles from 'Europresse' Using the 'tm' Text Mining Framework

Provides a 'tm' Source to create corpora from articles exported from the 'Europresse' content provider as HTML files. It is able to read both text content and meta-data information (including source, date, title, author and pages).


Version 1.4 - 2016-08-23 * Fix failures with a new variant of the Europresse HTML file format (reported by Tristan Guerra).

Version 1.3 - 2014-07-25 * Support recently-introduced Europresse HTML file format.

Version 1.2 - 2014-03-26 * Fix removal of search terms higlighted in red (reported by Patrick Lâm Lê).

Version 1.1 - 2014-05-25 * Adapt to tm 0.6. * Change all tags to lowercase (for consistency with tm). * Stop truncating document IDs.

Version 1.0.1 - 2014-02-11 * Fix small bug when parsing dates on Mac OS X.

Version 1.0 - 2014-02-10 * Initial release with support for HTML files.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


1.4 by Milan Bouchet-Valat, 4 years ago

Report a bug at

Browse source code at

Authors: Milan Bouchet-Valat [aut, cre]

Documentation:   PDF Manual  

Task views: Natural Language Processing

GPL (>= 2) license

Imports utils, NLP, tm, XML

Imported by R.temis.

Suggested by RcmdrPlugin.temis.

See at CRAN