Parse XML

Work with XML files using a simple, consistent interface. Built on top of the 'libxml2' C library.


Build Status Coverage Status AppVeyor Build Status

The xml2 package is a binding to libxml2, making it easy to work with HTML and XML from R. The API is somewhat inspired by jQuery.

Installation

You can install xml2 from CRAN,

install.packages("xml2")

or you can install the development version from github, using devtools:

# install.packages("devtools")
devtools::install_github("r-lib/xml2")

Usage

library("xml2")
x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")
x
 
xml_name(x)
xml_children(x)
xml_text(x)
xml_find_all(x, ".//baz")
 
h <- read_html("<html><p>Hi <b>!")
h
xml_name(h)
xml_text(h)

There are three key classes:

  • xml_node: a single node in a document.

  • xml_doc: the complete document. Acting on a document is usually the same as acting on the root node of the document.

  • xml_nodeset: a set of nodes within the document. Operations on xml_nodesets are vectorised, apply the operation over each node in the set.

Compared to the XML package

xml2 has similar goals to the XML package. The main differences are:

  • xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.

  • xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.

  • More convenient handling of namespaces in Xpath expressions - see xml_ns() and xml_ns_strip() to get started.

News

xml2 1.2.0

Breaking changes

  • as_list() on xml_document objects did not properly include the root node in the returned list. Previous behavior can be obtained by using as_list()[[1L]] in place of as_list().

New Features

  • download_xml() and download_html() helper functions to make it easy to download files (#193).

  • xml_attr() can now set attributes with no value (#198).

  • xml_serialize() and xml_unserialize() now create file connections when given character input (#179).

Bugfixes

  • xml_find_first() no longer de-duplicates results, so the results are always the same length as the inputs (as documented) (#194).

  • xml2 can now build using libxml2 2.7.0

  • Use Rcpp symbol registration and visibility to prevent symbol conflicts on Linux

  • xml_add_child() now requires less resources to insert a node when called with .where = 0L (@heckendorfc, #175).

  • Fixed failing examples due to a change in an external resource.

xml2 1.1.1

  • This is a small point release addressing installation issues found with older libxml2 versions shipped with RedHat Linux 6 / CentOS 6 (#163, #164).

xml2 1.1.0

New Features

  • write_xml() and write_html() now accept connections as well as filenames for output. (#157)

  • xml_add_child() now takes a .where argument specifying where to add the new children. (#138)

  • as_xml() generic function to convert R objects to xml. The most important method is for lists and enables full roundtrip support for going to and back from xml for lists and enables full roundtrip support to and from XML. (#137, #143)

  • xml_new_root() can be used to create a new document and a root node in one step (#131).

  • xml_add_parent() inserts a new node between the node and its parent (#129)

  • Add xml_validate() to validate a document against an xml schema (#31, @jeroenooms).

  • Export xml2_types.h to allow for extension packages such as xslt.

  • xml_comment() allows you to add comment nodes to a document. (#111)

  • xml_cdata() allows you to add CDATA nodes to a document. (#128)

  • Add xml_set_text() and xml_set_name() equivalent to xml_text<- and xml_name<-. (#130).

  • Add xml_set_attr() and xml_set_attrs() equivalent to xml_attr<- and xml_attrs<-. (#109, #130)

  • Add write_html() method (#133).

Bugfixes

  • xml_new_document() now explicitly sets the encoding (default UTF-8) (#142)

  • Document formatting options for write_xml() (#132)

  • Add missing methods for xml_missing objects. (#134)

  • Bugfix for xml_length.xml_nodeset that caused it to fail unconditionally. (#140)

  • is.na() now returns TRUE for xml_missing objects. (#139)

  • Trim non-breaking spaces in xml_text(trim = TRUE) (#151).

  • Allow setting non-character attributes (values are coerced to characters). (@sjp, #117, #122).

  • Fixed return value in call to vapply in xml_integer.xml_nodeset. (@ddiez, #146, #147).

  • Allow docs missing a root element to be created and printed. (@sjp, #126, #121).

  • xml_add_* methods now return invisibly. (@sjp, #124)

  • as_list() now preserves element names when attributes exist, and escapes XML attributes that conflict with special R attributes (@peterfoley, #115).

xml2 1.0.0

  • All C++ functions now use checked_get() instead of get() where possible, so NULL XPtrs properly throw an error rather than crashing. (@jimhester, #101, #104).

  • xml_integer() and xml_double() functions to make it easy to extract integer and double text from nodes (@jimhester, #97, #99).

  • xml2 now supports modification and creation of XML nodes. New functions xml_new_document(), xml_new_child(), xml_new_sibling(), xml_set_namespace(), , xml_remove(), xml_replace(), xml_root() and replacement methods for xml_name(), xml_attr(), xml_attrs() and xml_text() (@jimhester, #9 #76)

  • xml_ns() now keeps namespace prefixes that point to the same URI (@jimhester, #35, #95).

  • read_xml() and read_html() methods added for httr::response() objects. (@jimhester, #63, #93)

  • xml_child() function to make selecting children a little easier (@jimhester, #23, #94)

  • xml_find_one() has been deprecated in favor of xml_find_first() (@jimhester, #58, #92)

  • xml_read() functions now default to passing the document's namespace object. Namespace definitions can now be removed as well as added and xml_ns_strip() added to remove all default namespaces from a document. (@jimhester, #28, #89)

  • xml_read() gains a options argument to control all available parsing options, including HUGE to turn off limits for parsing very large documents and now drops blank text nodes by default, mimicking default behavior of XML package. (@jimhester, #49, #62, #85, #88)

  • xml_write() expands the path on filenames, so directories can be specified with '~/' (@jimhester, #86, #80)

  • xml_find_one() now returns a 'xml_missing' node object if there are 0 matches (@jimhester, #55, #53, hadley/rvest#82).

  • xml_find_num(), xml_find_chr(), xml_find_lgl() functions added to return numeric, character and logical results from XPath expressions. (@jimhester, #55)

  • xml_name() and xml_text() always correctly encode returned value as UTF-8 (#54).

xml2 0.1.2

  • Improved configure script - now works again on R-devel on windows.

  • Compiles with older versions of libxml2.,

xml2 0.1.1

  • Make configure script more cross platform.

  • Add xml_length() to count the number of children (#32).

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("xml2")

1.2.0 by James Hester, 4 months ago


https://github.com/r-lib/xml2


Report a bug at https://github.com/r-lib/xml2/issues


Browse source code at https://github.com/cran/xml2


Authors: Hadley Wickham [aut], James Hester [aut, cre], Jeroen Ooms [aut], RStudio [cph], R Foundation [ctb] (Copy of R-project homepage cached as example)


Documentation:   PDF Manual  


Task views: Web Technologies and Services


GPL (>= 2) license


Imports Rcpp

Suggests testthat, curl, covr, knitr, rmarkdown, magrittr, httr

Linking to Rcpp

System requirements: libxml2: libxml2-dev (deb), libxml2-devel (rpm)


Imported by AMR, BANEScarparkinglite, BAwiR, BETS, BIS, CDECRetrieve, EML, EdSurvey, MazamaSpatialUtils, OAIHarvester, OECD, RDML, RNHANES, RNRCS, RNeXML, Rcrawler, RefManageR, ReporteRs, Rgretl, Rilostat, Rnightlights, SanFranBeachWater, SchemaOnRead, W3CMarkupValidator, addinslist, adjustedcranlogs, aire.zmvm, algorithmia, ari, atlantistools, aws.alexa, aws.iam, aws.s3, aws.sns, aws.sqs, ballr, banxicoR, baseballDBR, bikedata, biolink, bold, bomrang, breathtestcore, bulletr, camsRad, ccafs, cdcfluview, cifti, coalitions, congressbr, cptec, crminer, cycleRtools, dataRetrieval, datasus, dataverse, discgolf, docxtractr, dwapi, ecb, ecoseries, edgarWebR, eechidna, epidata, epitab, ess, essurvey, etl, europepmc, fastqcr, finch, finreportr, foghorn, fulltext, gcite, geniusr, gesis, getlandsat, gfer, ggiraph, gifti, glassdoor, goodpractice, googleformr, googlesheets, htmltidy, icesSAG, incadata, insect, ionicons, ipumsr, itunesr, kableExtra, kokudosuuchi, mRchmadness, malariaAtlas, mapsapi, midas, mlbgameday, move, mregions, mschart, natserv, neotoma, neurohcp, nhanesA, nonmemica, oai, officer, openadds, originr, osmdata, pangaear, pdfetch, petro.One, pkgdown, polmineR, postlightmercury, prisonbrief, psychmeta, rNOMADS, randquotes, rbcb, rbhl, rcrossref, rdefra, rdfp, rdnb, rdryad, readODS, readOffice, readtext, refimpact, reqres, rerddap, rfbCNPJ, rgbif, rnoaa, rnrfa, ropercenter, rorcid, roxygen2, rversions, rvg, salesforcer, scholar, sejmRP, seoR, sidrar, slickR, smapr, softermax, soilDB, solrium, sparklyr, speaq, spelling, splashr, sss, swatches, taxize, texPreview, textreadr, tidyRSS, tidycensus, tidyquant, tidyverse, tiobeindexr, tm, tm.plugin.factiva, traits, tropr, unpivotr, vdiffr, vstsr, waccR, waterData, webchem, wikilake, wikipediatrend, wikitaxa, wosr, xesreadR.

Depended on by BMRBr, RSauceLabs, aptg, crypto, rvest, seleniumPipes, spanish, x3ptools, xslt.

Suggested by DBI, OpenML, ameco, assertive.types, censusr, ckanr, codemetar, covr, flextable, googleLanguageR, httptest, httr, icd, knitr, mrgsolve, prodigenr, rccmisc, rdflib, repurrrsive, rio, rplos, rscopus, rtika, sbtools, selectr, sharpshootR, svglite, testthat, tuber, tubern, units, webmockr, xmlparsedata.

Enhanced by svgPanZoom.


See at CRAN