Parse XML

Work with XML files using a simple, consistent interface. Built on top of the 'libxml2' C library.


The xml2 package is a binding to libxml2, making it easy to work with HTML and XML from R. The API is somewhat inspired by jQuery.

You can install xml2 from CRAN,

install.packages("xml2")

or you can install the development version from github, using devtools:

# install.packages("devtools")
devtools::install_github("hadley/xml2")
library("xml2")
x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")
x
 
xml_name(x)
xml_children(x)
xml_text(x)
xml_find_all(x, ".//baz")
 
h <- read_html("<html><p>Hi <b>!")
h
xml_name(h)
xml_text(h)

There are three key classes:

  • xml_node: a single node in a document.

  • xml_doc: the complete document. Acting on a document is usually the same as acting on the root node of the document.

  • xml_nodeset: a set of nodes within the document. Operations on xml_nodesets are vectorised, apply the operation over each node in the set.

xml2 has similar goals to the XML package. The main differences are:

  • xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.

  • xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.

  • More convenient handling of namespaces in Xpath expressions - see xml_ns() and xml_ns_strip() to get started.

News

xml2 1.0.0

  • xml_integer() and xml_double() functions to make it easy to extract integer and double text from nodes (@jimhester, #97, #99).

  • xml2 now supports modification and creation of XML nodes. New functions xml_new_document(), xml_new_child(), xml_new_sibling(), xml_set_namespace(), , xml_remove(), xml_replace(), xml_root() and replacement methods for xml_name(), xml_attr(), xml_attrs() and xml_text() (@jimhester, #9 #76)

  • xml_ns() now keeps namespace prefixes that point to the same URI (@jimhester, #35, #95).

  • read_xml() and read_html() methods added for httr::response() objects. (@jimhester, #63, #93)

  • xml_child() function to make selecting children a little easier (@jimhester, #23, #94)

  • xml_find_one() has been deprecated in favor of xml_find_first() (@jimhester, #58, #92)

  • xml_read() functions now default to passing the document's namespace object. Namespace definitions can now be removed as well as added and xml_ns_strip() added to remove all default namespaces from a document. (@jimhester, #28, #89)

  • xml_read() gains a options argument to control all available parsing options, including HUGE to turn off limits for parsing very large documents and now drops blank text nodes by default, mimicking default behavior of XML package. (@jimhester, #49, #62, #85, #88)

  • xml_write() expands the path on filenames, so directories can be specified with '~/' (@jimhester, #86, #80)

  • xml_find_one() now returns a 'xml_missing' node object if there are 0 matches (@jimhester, #55, #53, hadley/rvest#82).

  • xml_find_num(), xml_find_chr(), xml_find_lgl() functions added to return numeric, character and logical results from XPath expressions. (@jimhester, #55)

  • xml_name() and xml_text() always correctly encode returned value as UTF-8 (#54).

xml2 0.1.2

  • Improved configure script - now works again on R-devel on windows.

  • Compiles with older versions of libxml2.,

xml2 0.1.1

  • Make configure script more cross platform.

  • Add xml_length() to count the number of children (#32).

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("xml2")

1.1.1 by James Hester, 9 months ago


https://github.com/hadley/xml2/


Report a bug at https://github.com/hadley/xml2/issues/


Browse source code at https://github.com/cran/xml2


Authors: Hadley Wickham [aut], James Hester [aut, cre], Jeroen Ooms [aut], RStudio [cph], R Foundation [ctb] (Copy of R-project homepage cached as example)


Documentation:   PDF Manual  


Task views: Web Technologies and Services


GPL (>= 2) license


Imports Rcpp

Suggests testthat, curl, covr, knitr, rmarkdown, magrittr, httr

Linking to Rcpp, BH

System requirements: libxml2: libxml2-dev (deb), libxml2-devel (rpm)


Imported by BETS, EML, MazamaSpatialUtils, OAIHarvester, OECD, RDML, RNHANES, RNRCS, Rcrawler, RefManageR, ReporteRs, Rilostat, Rnightlights, SanFranBeachWater, SchemaOnRead, W3CMarkupValidator, addinslist, algorithmia, ari, atlantistools, aws.iam, aws.s3, aws.sns, aws.sqs, banxicoR, baseballDBR, bikedata, binman, biolink, bold, bomrang, boxoffice, breathtestcore, bulletr, camsRad, ccafs, cdcfluview, cifti, congressbr, cptec, crminer, cycleRtools, dataRetrieval, datasus, dataverse, discgolf, docxtractr, dwapi, ecb, ecoseries, edeaR, edgarWebR, eechidna, enigma, epidata, etl, europepmc, fastqcr, finch, finreportr, foghorn, fulltext, gcite, gesis, getlandsat, gfer, ggiraph, gifti, glassdoor, googleformr, googlesheets, htmltidy, icesSAG, incadata, ionicons, itunesr, kableExtra, kokudosuuchi, lumendb, mRchmadness, mapsapi, midas, move, mregions, mschart, natserv, neotoma, neurohcp, nhanesA, nonmemica, oai, officer, openadds, originr, osmdata, pangaear, pdfetch, postlightmercury, prisonbrief, rNOMADS, rbcb, rbhl, rcrossref, rdefra, rdnb, rdryad, readODS, readOffice, refimpact, reqres, rerddap, rgbif, rnoaa, rnrfa, ropercenter, roxygen2, rversions, rvg, scholar, sejmRP, sidrar, slickR, smapr, softermax, solrium, sparklyr, speaq, spelling, splashr, sss, taxize, texPreview, textreadr, tidyRSS, tidycensus, tidyquant, tidyverse, traits, tropr, unpivotr, vdiffr, waccR, waterData, webchem, wikilake, wikipediatrend, wikitaxa, xesreadR.

Depended on by RSauceLabs, aptg, dicecrawler, rvest, seleniumPipes, xslt.

Suggested by DBI, OpenML, ameco, assertive.types, ckanr, covr, flextable, googleLanguageR, httptest, httr, icd, pollstR, polmineR, prodigenr, rccmisc, repurrrsive, rio, rscopus, sbtools, selectr, sharpshootR, svglite, tuber, tubern, units, xmlparsedata.

Enhanced by svgPanZoom.


See at CRAN