'rcpp' Wrapper for 'mecab' Library

R package based on 'Rcpp' for 'MeCab': Yet Another Part-of-Speech and Morphological Analyzer. The purpose of this package is providing a seamless developing and analyzing environment for CJK texts. This package utilizes parallel programming for providing highly efficient text preprocessing 'posParallel()' function. For installation, please refer to README.md file.


This package, RcppMeCab, is a Rcpp wrapper for the part-of-speech morphological analyzer MeCab. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power Rcpp brings R computation to analyze texts faster.

Installation

Linux and Mac OSX

First, install MeCab of your language-of-choice.

Second, you can install RcppMeCab from CRAN with:

install.packages("RcppMeCab") # build from source
# install.packages("devtools")
install_github("junhewk/RcppMeCab") # install developmental version

Windows

You should set the language you want to use for the analysis with the environment variable MECAB_LANG. The default value is ko and if you want to analyze Japanese or Chinese, please set it as jp before install the package.

install.packages("RcppMeCab") # for installing Korean version

# or, install for Japanese
Sys.setenv(MECAB_LANG = 'ja') # for installing Japanese developmental version
install.packages("RcppMeCab", type="source") # build from source

# install.packages("devtools")
install_github("junhewk/RcppMeCab") # install developmental version

For analyzing, you also need MeCab binary and dictionary.

For Korean:

Install mecab-ko-msvc and mecab-ko-dic-msvc up to your 32-bit or 64-bit Windows version in C:\mecab. Provide directory location to RcppMeCab function.

For Japanese:

Install mecab binary. Provide directory location to RcppMeCab function. For example: pos(sentence, sys_dic = "C:/PROGRA~2/mecab/dic/ipadic")

Usage

This package has pos and posParallel function.

pos(sentence) # returns list, sentence will present on the names of the list
pos(sentence, join = FALSE) # for yielding morphemes only (tags will be given on the vector names)
pos(sentence, format = "data.frame") # the result will returned as a data frame format
pos(sentence, user_dic) # gets a compiled user dictionary 
posParallel(sentence, user_dic) # parallelized version uses more memory, but much faster than the loop in single threading
  • sentence: a text for analyzing
  • join: If it gets TRUE, output form is (morpheme/tag). If it gets FALSE, output form is (morpheme) + tag in attribute.
  • format: The default is a list. If you set this as `"data.frame", the function will return the result in a data frame format.
  • sys_dic: a directory in which dicrc file is located, default value is "" or you can set your default value using options(mecabSysDic = "")
  • user_dic: a user dictionary file compiled by mecab_dict_index, default value is also ""

Compiling User Dictionary

MeCab API has DictionaryCompiler, but it contains die(). Hence, calling it in Rcpp crashes down entire R session. This will not be included in RcppMeCab functions.

Please refer to Mecab for Japanese.

Unix and Mac OSX

You should have model_file if you want the library to estimate cost automatically.

You need entire mecab-ko-dic source if you want to compile Korean user dictionary. User dictionary should also be prepared in CSV file. CSV structure is found in Japanese and Korean.

Compile:

$ /usr/local/libexec/mecab/mecab-dict-index -m `model_file` -d `mecab_dic_location` -u `user_dictionary_file_name` -f `CSV file charset` -t `original dictionary charset` `target_csv

# example

$ /usr/local/libexec/mecab/mecab-dict-index -m /usr/local/lib/mecab/dic/mecab-ko-dic/model.bin -d ~/mecab-ko-dic-2.0.3-20170922 -u userdic.dic -f utf8 -t utf8 ~/person.csv

Windows

  • Korean: mecab-ko-msvc has mecab-dict-index.exe.
  • Japanese: MeCab binary version has mecab-dict-index.exe.

You can use it in the same way the Linux binary compiles the dictionary.

TODOs

  • Test multilanguage support
  • Provide other useful functions
  • Provide multilanguage manuals for international support

Author

Junhewk Kim ([email protected])

News

RcppMeCab 0.0.1.2

  • loop version of pos function is fixed (duplicated result)
  • sys_dic is now working properly
  • each function checks getOption("mecabSysDic") to get user preference of MeCab system dictionary
  • present input character vecters over the result list attributes (names)
  • a single character vector input in pos() will return a list
  • an option for result type is added: with arg format="data.frame"

RcppMeCab 0.0.1.1

  • posParallel function is added to support parallelization
  • join parameter is added to yield a output of morphemes only
  • RcppParallel dependency
  • user_dic parameter is added to support user dictionary usage
  • Published on CRAN

RcppMeCab 0.0.1.0

  • First release
  • Windows support is solved; further work should be done for multiarch installation

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("RcppMeCab")

0.0.1.2 by Junhewk Kim, 10 months ago


Report a bug at https://github.com/junhewk/RcppMeCab/issues


Browse source code at https://github.com/cran/RcppMeCab


Authors: Junhewk Kim [aut, cre] , Taku Kudo [aut]


Documentation:   PDF Manual  


GPL license


Imports Rcpp, RcppParallel

Linking to Rcpp, RcppParallel, BH

System requirements: MeCab 0.996 (or mecab-ko 0.9.2) or higher, GNU make


See at CRAN