R package based on 'Rcpp' for 'MeCab': Yet Another Part-of-Speech and Morphological Analyzer. The purpose of this package is providing a seamless developing and analyzing environment for CJK texts. This package utilizes parallel programming for providing highly efficient text preprocessing 'posParallel()' function. For installation, please refer to README.md file.
This package, RcppMeCab, is a
Rcpp wrapper for the part-of-speech morphological analyzer
MeCab. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power
R computation to analyze texts faster.
MeCab of your language-of-choice.
MeCab-Kofrom Bitbucket repository
MeCab Chinese Dicfrom MeCab-Chinese
Second, you can install RcppMeCab from CRAN with:
install.packages("RcppMeCab") # build from source # install.packages("devtools") install_github("junhewk/RcppMeCab") # install developmental version
You should set the language you want to use for the analysis with the environment variable
MECAB_LANG. The default value is
ko and if you want to analyze Japanese or Chinese, please set it as
jp before install the package.
install.packages("RcppMeCab") # for installing Korean version # or, install for Japanese Sys.setenv(MECAB_LANG = 'ja') # for installing Japanese developmental version install.packages("RcppMeCab", type="source") # build from source # install.packages("devtools") install_github("junhewk/RcppMeCab") # install developmental version
For analyzing, you also need MeCab binary and dictionary.
Install mecab binary. Provide directory location to
RcppMeCab function. For example:
pos(sentence, sys_dic = "C:/PROGRA~2/mecab/dic/ipadic")
This package has
pos(sentence) # returns list, sentence will present on the names of the list pos(sentence, join = FALSE) # for yielding morphemes only (tags will be given on the vector names) pos(sentence, format = "data.frame") # the result will returned as a data frame format pos(sentence, user_dic) # gets a compiled user dictionary posParallel(sentence, user_dic) # parallelized version uses more memory, but much faster than the loop in single threading
dicrcfile is located, default value is "" or you can set your default value using
options(mecabSysDic = "")
mecab_dict_index, default value is also ""
MeCab API has
DictionaryCompiler, but it contains
die(). Hence, calling it in Rcpp crashes down entire R session. This will not be included in
Please refer to Mecab for Japanese.
You should have
model_file if you want the library to estimate cost automatically.
$ /usr/local/libexec/mecab/mecab-dict-index -m `model_file` -d `mecab_dic_location` -u `user_dictionary_file_name` -f `CSV file charset` -t `original dictionary charset` `target_csv # example $ /usr/local/libexec/mecab/mecab-dict-index -m /usr/local/lib/mecab/dic/mecab-ko-dic/model.bin -d ~/mecab-ko-dic-2.0.3-20170922 -u userdic.dic -f utf8 -t utf8 ~/person.csv
MeCabbinary version has
You can use it in the same way the Linux binary compiles the dictionary.
Junhewk Kim ([email protected])
posfunction is fixed (duplicated result)
sys_dicis now working properly
getOption("mecabSysDic")to get user preference of MeCab system dictionary
pos()will return a list
posParallelfunction is added to support parallelization
joinparameter is added to yield a output of morphemes only
user_dicparameter is added to support user dictionary usage