Miscellaneous Tools for Chinese Text Mining and More

Efforts are made to make Chinese text mining easier, faster, and robust to errors. Document term matrix can be generated by only one line of code; detecting encoding, segmenting and removing stop words are done automatically. Some convenient tools are also supplied.


               chinese.misc NEWS and CHANGELOG


0.2.0 (2019-03-04)

  • [BUGFIX] #1: Now the package does not depend on package Ruchardet. Encode detectiing is done by package stringi.

0.1.9 (2018-06-15)

  • [NEW FEATURE] #1: Add a new function sparse_left to check the number of words left under certain sparse values.

  • [NEW FEATURE] #2: Function scancn now has a collapse argument to specify symbols linking characters together.

  • [NEW FEATURE] #3: as.character2 and as.numeric2 now keep the original order of the input object.

0.1.8 (2018-02-06)

  • [NEW FEATURE] #1: Add a new function to convert conveniently objects among matrix, dgCMatrix, simple_triplet_matrix, DocumentTermMatrix, TermDocumentMatrix.

  • [BUGFIX] #1: Only some slight changes, user-invisible.

0.1.7 (2017-05-12)

  • [NEW FEATURE] #1: Add V, VC, VR, VCR, VRC to facilitate copying data from Excel-like tables.

0.1.6 (2017-05-03)

  • [BUGFIX] #2: scancn is modified in order to remove unicode replacement characters in texts.

  • [BUGFIX] #1: Now the $v of dtm created by corp_or_dtm do not have names, and is compatible with topicmodels::LDA.

0.1.5 (2017-04-07)

  • [NEW FEATURE] #3: A new function dictionary_dtm is added to count term frequencies in groups.

  • [NEW FEATURE] #2: Now the computation work of sort_tf and word_cor is done with sparse matrix to save memory rather than first converting the object into dense matrix.

  • [NEW FEATURE] #1: The function word_cor now can compute up to 200 words' correlation, while the previous limit is 30.

0.1.4 (2017-03-23)

  • [NEW FEATURE] #2: Users now can set their own locales in options( ) and view it by get_tmp_chi_locale( ).

  • [NEW FEATURE] #1: Add a new function create_ttm to generate term-term matrix.

0.1.3 (2017-03-11)

  • [BUGFIX] #1: Modify the funtion scancn, but there is no user-visible change.

  • [NEW FEATURE] #3: Some functions temporally modify locale values internally.

  • [NEW FEATURE] #2: Add a new function topic_trend to compute in/decrease of topics through years.

  • [NEW FEATURE] #1: Add a new function word_cor to compute word correlation.

0.1.2 (2017-03-04)

  • [BUGFIX] #2: This version is compatible with package tm (>=0.7), where as the function corp_or_dtm in the previous version sometimes raise error due to the update of tm.

  • [BUGFIX] #1: The function as.character2 is slightly modified with no user-visible change.

  • [NEW FEATURE] #4: The argument control in function corp_or_dtm has a new default value "auto", which calls the control list named DEFAULT_control1 in the previous version. "auto1" also points to this value. "auto2" points to the value named DEFAULT_control2 in the previous version. However, DEFAULT_control1 and DEFAULT_control2 can also be used by users.

  • [NEW FEATURE] #3: The argument control in the function corp_or_dtm now differs significantly from that used by DocumentTermMatrix in package tm. Please see details in the help page of corp_or_dtm.

  • [NEW FEATURE] #2: The function scancn and make_stoplist now has enhanced ability to deal with unrecognizable characters.

  • [NEW FEATURE] #1: User-visible changes: make_stoplist, slim_text have new arguments. But the new arguments are compatible with functions in the previous version.

0.1.1 (2017-02-20)

  • [BUGFIX] #2: The function as.character2(x) is changed to as.character2(...), so as to corerce multiple objects in one time. The same is done to as.numeric2(...). Accordingly, some other functions of the package and their documents are also modified.

  • [NEW FEATURE] #1: The url of a Chinese manual is added to "chinese.misc-package" in the English manual.

  • [BUGFIX] #1: The auto created objects DEFAULT_cutter, DEFAULT_control1, DEFAULT_control2 now can be directly used or modified by users.

0.1.0 (2017-02-17)

  • First Release: Version 0.1.0

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.2.2 by Jiang Wu, 21 days ago


Browse source code at https://github.com/cran/chinese.misc

Authors: Jiang Wu [aut, cre] (from Capital Normal University)

Documentation:   PDF Manual  

GPL-3 license

Imports jiebaR, NLP, tm, stringi, slam, Matrix, purrr

See at CRAN