xmltools package to help convert XML data to tidy data frames
I created a new, small package called xmltools
that helps simplify the process of converting XML data into tidy data frames.
It has not yet been tested on a ton of XML files so it may have some bugs. I also have not created any tests. But, at least for me, it helps drastically cut down on the code I have to write to get the data I want from an XML file.
Below is what I wrote as part of the README to the package (it’s really more of a vignette.)
Motivation for xmltools
There are 3 things I felt were missing from the two wonderful packages XML
and xml2
:
- A easier, more condensed way to see the structure of xml data.
xml2::xml_structure
provides a way to look at the structure, but I find that it is not very easy to read and takes up too much console space.xmltools::xml_view_tree
is more condense and attempts to emulate thetree
command line program.
- A quick way to determine all feasible xpaths and to identify terminal nodes. Data values of interest are contained in terminal nodes (nodes of “length zero” that do no dig any deeper). Quickly getting xpaths to the parents of these nodes makes extracting data much easier—and faster if you do not recursively dig deeper.
xmltools::xml_get_paths
can find all paths for a given nodeset or xml document. It has options to help tag terminal nodes (mark_terminal
) and the option to return the parent of any terminal nodes (mark_terminal_parent
).
- Other alternatives for converting xml data to data frames.
XML::xmlToDataFrame
exists but it seems to always dig recursively. This leads to some crappy data frames.- I offer two alternatives,
xml_to_df
andxml_dig_df
.xml_to_df
uses theXML
anddata.table
packagesxml_dig_df
is based ofxml2
andtidyverse
packages.
Installation
Run the following.
::install_github('ultinomics/xmltools')
devtoolslibrary(xmltools)
Examples
Let’s set up the first example using some ebay data from the UW XML Data Repository. These data come as part of the package because I dropped the really annoying description
field that makes the data hard to look at. (Parses it just fine!)
library(xmltools)
# USING ebay.xml ------------------------------------------------
# load the data
<- system.file("extdata", "ebay.xml", package = "xmltools")
file <- file %>%
doc ::read_xml()
xml2<- doc %>%
nodeset ::xml_children() # get top level nodeset xml2
View XML trees/structures
Let’s look at the structure of the data. The function
# `xml_view_tree` structure
# we can get a tree for each node of the doc
%>%
doc xml_view_tree()
%>% # we can also vary the depth
doc xml_view_tree(depth = 2)
``r # easier to read and understand than
xml2::xml_structure()and has the
depth` option nodeset[1] %>% xml2::xml_structure() #> [[1]] #>
or, we can extract from nodesets
class(nodeset[1]) #> [1] “xml_nodeset” nodeset[1] %>% xml_view_trees() #> └