Extractive Summarization of Text with the LexRank Algorithm

An R implementation of the LexRank algorithm described by G. Erkan and D. R. Radev (2004) .


Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge Last Commit

Installation

##install from CRAN
install.packages("lexRankr")
 
#install from this github repo
devtools::install_github("AdamSpannbauer/lexRankr")

Overview

lexRankr is an R implementation of the LexRank algorithm discussed by Güneş Erkan & Dragomir R. Radev in LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. LexRank is designed to summarize a cluster of documents by proposing which sentences subsume the most information in that particular set of documents. The algorithm may not perform well on a set of unclustered/unrelated set of documents. As the white paper's title suggests, the sentences are ranked based on their centrality in a graph. The graph is built upon the pairwise similarities of the sentences (where similarity is measured with a modified idf cosine similarity function). The paper describes multiple ways to calculate centrality and these options are available in the R package. The sentences can be ranked according to their degree of centrality or by using the Page Rank algorithm (both of these methods require setting a minimum similarity threshold for a sentence pair to be included in the graph). A third variation is Continuous LexRank which does not require a minimum similarity threshold, but rather uses a weighted graph of sentences as the input to Page Rank.

note: the lexrank algorithm is designed to work on a cluster of documents. LexRank is built on the idea that a cluster of docs will focus on similar topics

note: pairwise sentence similarity is calculated for the entire set of documents passed to the function. This can be a computationally instensive process (esp with a large set of documents)

Basic Usage

library(lexRankr)
library(dplyr)
 
df <- tibble(doc_id = 1:3, 
             text = c("Testing the system. Second sentence for you.", 
                      "System testing the tidy documents df.", 
                      "Documents will be parsed and lexranked."))
                      
df %>% 
    unnest_sentences(sents, text) %>% 
    bind_lexrank(sents, doc_id, level = 'sentences') %>% 
    arrange(desc(lexrank))

More Examples

News

lexRankr 0.5.2

  • fix damping bug where damping parameter wasn't passed to igraph::pagerank

lexRankr 0.5.1

  • changed smart_stopwords to be internal data so that package doesnt need to be explicitly loaded with library to be able to parse

lexRankr 0.5.0

  • bug fix in sentence parsing for parsing exclamatory sentences
  • converted idf calculation from idf(d, t) = log( n / df(d, t) ) to idf(d, t) = log( n / df(d, t) ) + 1 to avoid zeroing out common word tfidf values
  • removed dplyr, tidyr, stringr, magrittr, & tm as dependencies
  • created option to bypass assumption that each row/vector-element are different documents in lexRank and unnest_sentences
  • various bug fixes in token & sentence parsing

lexRankr 0.4.1

  • added bug report url: (https://github.com/AdamSpannbauer/lexRankr/issues/)
  • formatting updates to README.md

lexRankr 0.4.0

  • added functions unnest_sentences and unnest_sentences_ to parse sentences in a dataframe following tidy data principles
  • added functions bind_lexrank and bind_lexrank_ to calculate lexrank scores for sentences in a dataframe following tidy data principles (unnest_sentences & bind_lexrank can be used on a df in a magrittr pipeline)
  • added vignette for using lexrank to analyze tweets

lexRankr 0.3.0

  • sentence similarity from sentenceSimil now calculated using Rcpp. Improves speed by ~25%-30% over old implementation using proxy package

lexRankr 0.2.0

  • Added logic to avoid naming conflicts in proxy::pr_DB in sentenceSimil (#1, @AdamSpannbauer)

  • Added check and error for cases where no sentences above threshold in lexRankFromSimil (#2, @AdamSpannbauer)

  • tokenize now has stricter punctuation removal. Removes all non-alphnumeric characters as opposed to removing [:punct:]

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("lexRankr")

0.5.2 by Adam Spannbauer, 2 months ago


https://github.com/AdamSpannbauer/lexRankr/


Report a bug at https://github.com/AdamSpannbauer/lexRankr/issues/


Browse source code at https://github.com/cran/lexRankr


Authors: Adam Spannbauer [aut, cre] , Bryan White [ctb]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports SnowballC, igraph, Rcpp

Suggests covr, testthat, R.rsp

Linking to Rcpp


See at CRAN