Easily Harvest (Scrape) Web Pages

Wrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML.


rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
 
rating <- lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
rating
 
cast <- lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()
cast
#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"
 
poster <- lego_movie %>%
  html_nodes(".poster img") %>%
  html_attr("src")
poster
#> [1] "http://ia.media-imdb.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"

The most important functions in rvest are:

  • Create an html document from a url, a file on disk or a string containing html with read_html().

  • Select parts of a document using css selectors: html_nodes(doc, "table td") (or if you've a glutton for punishment, use xpath selectors with html_nodes(doc, xpath = "//table//td")). If you haven't heard of selectorgadget, make sure to read vignette("selectorgadget") to learn about it.

  • Extract components with html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).

  • (You can also use rvest with XML files: parse with xml(), then extract components using xml_node(), xml_attr(), xml_attrs(), xml_text() and xml_tag().)

  • Parse tables into data frames with html_table().

  • Extract, modify and submit forms with html_form(), set_values() and submit_form().

  • Detect and repair encoding problems with guess_encoding() and repair_encoding().

  • Navigate around a website as if you're in a browser with html_session(), jump_to(), follow_link(), back(), forward(), submit_form() and so on. (This is still a work in progress, so I'd love your feedback.)

To see examples of these function in use, check out the demos.

Install the release version from CRAN:

install.packages("rvest")

Or the development version from github

# install.packages("devtools")
devtools::install_github("hadley/rvest")

News

rvest 0.3.2

  • Fixes to follow_link() and back() to correctly manage session history.

  • If you're using xml2 1.0.0, html_node() will now return a "missing node".

  • Parse rowspans and colspans effectively by filling using repetition from left to right (for colspan) and top to bottom (rowspan) (#111)

  • Updated a few examples and demos where the website structure has changed.

  • Made compatible with both xml2 0.1.2 and 1.0.0.

rvest 0.3.1

  • Fix invalid link for SSA example.

  • Parse <options> that don't have value attribute (#85).

  • Remove all remaining uses of html() in favor of read_html() (@jimhester, #113).

rvest 0.3.0

  • rvest has been rewritten to take advantage of the new xml2 package. xml2 provides a fresh binding to libxml2, avoiding many of the work-arounds previously needed for the XML package. Now rvest depends on the xml2 package, so all the xml functions are available, and rvest adds a thin wrapper for html.

  • A number of functions have change names. The old versions still work, but are deprecated and will be removed in rvest 0.4.0.

    • html_tag() -> html_name()
    • html() -> read_html()
  • html_node() now throws an error if there are no matches, and a warning if there's more than one match. I think this should make it more likely to fail clearly when the structure of the page changes.

  • xml_structure() has been moved to xml2. New html_structure() (also in xml2) highlights id and class attributes (#78).

  • submit_form() now works with forms that use GET (#66).

  • submit_request() (and hence submit_form()) is now case-insensitive, and so will find <input type=SUBMIT> as well as<input type="submit">.

  • submit_request() (and hence submit_form()) recognizes forms with <input type="image"> as a valid form submission button per http://www.w3.org/TR/html-markup/input.image.html

rvest 0.2.0

  • html() and xml() pass ... on to httr::GET() so you can more finely control the request (#48).

  • Add xml support: parse with xml(), then work with using xml_node(), xml_attr(), xml_attrs(), xml_text() and xml_tag() (#24).

  • xml_structure(): new function that displays the structure (i.e. tag and attribute names) of a xml/html object (#10).

  • follow_link() now accepts css and xpath selectors. (#38, #41, #42)

  • html() does a better job of dealing with encodings (passing the problem on to XML::parseHTML()) instead of trying to do it itself (#25, #50).

  • html_attr() returns default value when input is NULL (#49)

  • Add missing html_node() method for session.

  • html_nodes() now returns an empty list if no elements are found (#31).

  • submit_form() converts relative paths to absolute URLs (#52). It also deals better with 0-length inputs (#29).

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.