xmltools package to help convert XML data to tidy data frames
I created a new, small package called xmltools
that helps simplify the process of converting XML data into tidy data frames.
It has not yet been tested on a ton of XML files so it may have some bugs. I also have not created any tests. But, at least for me, it helps drastically cut down on the code I have to write to get the data I want from an XML file.
Below is what I wrote as part of the README to the package (it’s really more of a vignette.)
Motivation for xmltools
There are 3 things I felt were missing from the two wonderful packages XML
and xml2
:
- A easier, more condensed way to see the structure of xml data.
xml2::xml_structure
provides a way to look at the structure, but I find that it is not very easy to read and takes up too much console space.xmltools::xml_view_tree
is more condense and attempts to emulate thetree
command line program.
- A quick way to determine all feasible xpaths and to identify terminal nodes. Data values of interest are contained in terminal nodes (nodes of “length zero” that do no dig any deeper). Quickly getting xpaths to the parents of these nodes makes extracting data much easier—and faster if you do not recursively dig deeper.
xmltools::xml_get_paths
can find all paths for a given nodeset or xml document. It has options to help tag terminal nodes (mark_terminal
) and the option to return the parent of any terminal nodes (mark_terminal_parent
).
- Other alternatives for converting xml data to data frames.
XML::xmlToDataFrame
exists but it seems to always dig recursively. This leads to some crappy data frames.- I offer two alternatives,
xml_to_df
andxml_dig_df
.xml_to_df
uses theXML
anddata.table
packagesxml_dig_df
is based ofxml2
andtidyverse
packages.
Installation
Run the following.
::install_github('ultinomics/xmltools')
devtoolslibrary(xmltools)
Examples
Let’s set up the first example using some ebay data from the UW XML Data Repository. These data come as part of the package because I dropped the really annoying description
field that makes the data hard to look at. (Parses it just fine!)
library(xmltools)
# USING ebay.xml ------------------------------------------------
# load the data
<- system.file("extdata", "ebay.xml", package = "xmltools")
file <- file %>%
doc ::read_xml()
xml2<- doc %>%
nodeset ::xml_children() # get top level nodeset xml2
View XML trees/structures
Let’s look at the structure of the data. The function
# `xml_view_tree` structure
# we can get a tree for each node of the doc
%>%
doc xml_view_tree()
%>% # we can also vary the depth
doc xml_view_tree(depth = 2)
# easier to read and understand than `xml2::xml_structure()` and has the `depth` option
1] %>% xml2::xml_structure()
nodeset[#> [[1]]
#> <listing>
#> <seller_info>
#> <seller_name>
#> {text}
#> <seller_rating>
#> {text}
#> <payment_types>
#> {text}
#> <shipping_info>
#> {text}
#> <buyer_protection_info>
#> {text}
#> <auction_info>
#> <current_bid>
#> {text}
#> <time_left>
#> {text}
#> <high_bidder>
#> <bidder_name>
#> {text}
#> <bidder_rating>
#> {text}
#> <num_items>
#> {text}
#> <num_bids>
#> {text}
#> <started_at>
#> {text}
#> <bid_increment>
#> {text}
#> <location>
#> {text}
#> <opened>
#> {text}
#> <closed>
#> {text}
#> <id_num>
#> {text}
#> <notes>
#> {text}
#> <bid_history>
#> <highest_bid_amount>
#> {text}
#> <quantity>
#> {text}
#> <item_info>
#> <memory>
#> {text}
#> <hard_drive>
#> {text}
#> <cpu>
#> {text}
#> <brand>
#> {text}
## or, we can extract from nodesets
class(nodeset[1])
#> [1] "xml_nodeset"
1] %>%
nodeset[xml_view_trees()
#> └── listing
#> ├── payment_types
#> ├── shipping_info
#> ├── buyer_protection_info
#> ├── seller_info
#> ├── seller_name
#> └── seller_rating
#> ├── auction_info
#> ├── current_bid
#> ├── time_left
#> ├── num_items
#> ├── num_bids
#> ├── started_at
#> ├── bid_increment
#> ├── location
#> ├── opened
#> ├── closed
#> ├── id_num
#> ├── notes
#> └── high_bidder
#> ├── bidder_name
#> └── bidder_rating
#> ├── bid_history
#> ├── highest_bid_amount
#> └── quantity
#> └── item_info
#> ├── memory
#> ├── hard_drive
#> ├── cpu
#> └── brand
1] %>%
nodeset[xml_view_trees(depth=2)
#> └── listing
#> ├── payment_types
#> ├── shipping_info
#> ├── buyer_protection_info
#> ├── seller_info
#> ├── auction_info
#> ├── bid_history
#> └── item_info
## will not work with class "xml_node" (can't use lapply on those, apparently)
class(nodeset[[1]])
#> [1] "xml_node"
try(nodeset[[1]] %>%
xml_view_tree()
)
Get Terminal Nodes
Terminal nodes in XMLs are nodes that do no have any “children”. These nodes contain the information we generally want to extract into a tidy data frame.
I’ve found myself wanting easy access to all XML paths but could find no tool to do so easily and quickly. I especially wanted the xpaths to terminal nodes for any XML structure. This is accomplished using the xml_get_paths
function.
# one can see all the paths per node of a doc
%>%
doc xml_get_paths()
# can look at one nodeset
## NOTE that nodesets can vary, so looking at one doesn't mean you'll find all feasible paths
1] %>%
nodeset[xml_get_paths()
#> [[1]]
#> [1] "/root/listing"
#> [2] "/root/listing/payment_types"
#> [3] "/root/listing/shipping_info"
#> [4] "/root/listing/buyer_protection_info"
#> [5] "/root/listing/seller_info"
#> [6] "/root/listing/seller_info/seller_name"
#> [7] "/root/listing/seller_info/seller_rating"
#> [8] "/root/listing/auction_info"
#> [9] "/root/listing/auction_info/current_bid"
#> [10] "/root/listing/auction_info/time_left"
#> [11] "/root/listing/auction_info/num_items"
#> [12] "/root/listing/auction_info/num_bids"
#> [13] "/root/listing/auction_info/started_at"
#> [14] "/root/listing/auction_info/bid_increment"
#> [15] "/root/listing/auction_info/location"
#> [16] "/root/listing/auction_info/opened"
#> [17] "/root/listing/auction_info/closed"
#> [18] "/root/listing/auction_info/id_num"
#> [19] "/root/listing/auction_info/notes"
#> [20] "/root/listing/auction_info/high_bidder"
#> [21] "/root/listing/auction_info/high_bidder/bidder_name"
#> [22] "/root/listing/auction_info/high_bidder/bidder_rating"
#> [23] "/root/listing/bid_history"
#> [24] "/root/listing/bid_history/highest_bid_amount"
#> [25] "/root/listing/bid_history/quantity"
#> [26] "/root/listing/item_info"
#> [27] "/root/listing/item_info/memory"
#> [28] "/root/listing/item_info/hard_drive"
#> [29] "/root/listing/item_info/cpu"
#> [30] "/root/listing/item_info/brand"
1] %>%
nodeset[xml_get_paths(mark_terminal = ">>") # can mark terminal nodes
#> [[1]]
#> [1] "/root/listing"
#> [2] ">>/root/listing/payment_types"
#> [3] ">>/root/listing/shipping_info"
#> [4] ">>/root/listing/buyer_protection_info"
#> [5] "/root/listing/seller_info"
#> [6] ">>/root/listing/seller_info/seller_name"
#> [7] ">>/root/listing/seller_info/seller_rating"
#> [8] "/root/listing/auction_info"
#> [9] ">>/root/listing/auction_info/current_bid"
#> [10] ">>/root/listing/auction_info/time_left"
#> [11] ">>/root/listing/auction_info/num_items"
#> [12] ">>/root/listing/auction_info/num_bids"
#> [13] ">>/root/listing/auction_info/started_at"
#> [14] ">>/root/listing/auction_info/bid_increment"
#> [15] ">>/root/listing/auction_info/location"
#> [16] ">>/root/listing/auction_info/opened"
#> [17] ">>/root/listing/auction_info/closed"
#> [18] ">>/root/listing/auction_info/id_num"
#> [19] ">>/root/listing/auction_info/notes"
#> [20] "/root/listing/auction_info/high_bidder"
#> [21] ">>/root/listing/auction_info/high_bidder/bidder_name"
#> [22] ">>/root/listing/auction_info/high_bidder/bidder_rating"
#> [23] "/root/listing/bid_history"
#> [24] ">>/root/listing/bid_history/highest_bid_amount"
#> [25] ">>/root/listing/bid_history/quantity"
#> [26] "/root/listing/item_info"
#> [27] ">>/root/listing/item_info/memory"
#> [28] ">>/root/listing/item_info/hard_drive"
#> [29] ">>/root/listing/item_info/cpu"
#> [30] ">>/root/listing/item_info/brand"
## we can find all feasible paths then collapse
<- doc %>% ## get all xpaths
terminal xml_get_paths()
<- terminal %>% ## collapse xpaths to unique only
xpaths unlist() %>%
unique()
## but what we really want is the parent node of terminal nodes.
## use the `only_terminal_parent = TRUE` to do this
<- doc %>% ## get all xpaths to parents of parent node
terminal_parent xml_get_paths(only_terminal_parent = TRUE)
<- terminal_parent %>% ## collapse xpaths to unique only
terminal_xpaths unlist() %>%
unique()
Extracting XML Data to Tidy Data Frames
Next, we use the terminal xpaths above to extract the data we want.
First, I want to show the issue with using XML::xmlToDataFrame
.
# xmlToDataFrame works great on terminal nodes IF there are no non-terminal nodes any deeper.
## we extract a data frame for each parent of terminal nodes
<- lapply(terminal_xpaths, function(x) {
df0 <- file %>% XML::xmlInternalTreeParse()
doc <- XML::getNodeSet(doc, x)
nodeset ::xmlToDataFrame(nodeset, stringsAsFactors = FALSE) %>%
XML::as_data_frame()
dplyr })
There is data contained in the terminal nodes
/root/listing/payment_types
/root/listing/shipping_info
/root/listing/buyer_protection_info
with the parent node
/root/listing
But XML::xmlToDataFrame
will keep digging into /root/listing
and extract data from xpaths like /root/listing/seller_info
. But it does so extracting data in a non tidy way. We can see this below comparing the data in df0[[1]] %>% dplyr::select(seller_info)
vs df0[[2]]
, which is data extracted from just from /root/listing/seller_info
and deeper.
# problem with xmlToDataFrame is it keeps digging into other nodes recursively in "/root/listing"
1] # /root/listing is terminal parent but xmlToDataFrame keeps digging
xpaths[#> [1] "/root/listing"
1]] %>%
df0[[::select(seller_info) # not good; keeps diving into other nodes but fails to separate
dplyr#> # A tibble: 5 × 1
#> seller_info
#> <chr>
#> 1 cubsfantony 848
#> 2 ct-inc 403
#> 3 ct-inc 403
#> 4 bestbuys4systems 28
#> 5 sales@ctgcom.com 219
2]
xpaths[#> [1] "/root/listing/payment_types"
2]] # works because the recursive dig down hits only the terminal nodes
df0[[#> # A tibble: 5 × 2
#> seller_name seller_rating
#> <chr> <chr>
#> 1 cubsfantony 848
#> 2 ct-inc 403
#> 3 ct-inc 403
#> 4 bestbuys4systems 28
#> 5 sales@ctgcom.com 219
The solution that worked for me was to write a function that, by default, does not dig into non-terminal nodes. That is, when given the xpath /root/listing
, the function will only go into terminal nodes
/root/listing/payment_types
/root/listing/shipping_info
/root/listing/buyer_protection_info
And ignore the other, non-terminal nodes unless instructed to do so otherwise by setting option dig = TRUE
.
The xml_to_df
Function
The xml_to_df
function is built on the XML
package and data.table
. By default, it does not dig into non-terminal nodes (dig = FALSE
) when given any xpath. One can pass an xpath to an XML file (is_xml = FALSE
) or an already parsed XML file (is_xml = TRUE
).
# xml_to_df (XML package based)
## does not dig by default
## use the terminal xpaths to get data frames
terminal_xpaths#> [1] "/root/listing"
#> [2] "/root/listing/seller_info"
#> [3] "/root/listing/auction_info"
#> [4] "/root/listing/auction_info/high_bidder"
#> [5] "/root/listing/bid_history"
#> [6] "/root/listing/item_info"
## we send each terminal xpath to `xml_to_df`.
## the file source is the parsed xml object `doc`, so we set `is_xml = TRUE`
## we do no want to dig, which quickly gets us the data we want for each terminal xpath `dig = FALSE` (default)
<- lapply(terminal_xpaths, xml_to_df, file = doc, is_xml = TRUE, dig = FALSE) %>%
df1 ::bind_cols() dplyr
The xml_dig_df
Function
The other fuction is xml_dig_df
which is built on xml2
and tidyverse
packages.
The important distinction is that we first need to find all the terminal nodesets we want to parse. We find these using xml2::xml_find_all
on each of the terminal_xpaths
on the original parsed xml file (doc
).
For each terminal_nodeset
, we then apply xml_dig_df
. For each nodeset, we will get single row of data, so we bind the results for each nodeset. The final data frame is created by column binding. (I convert all empty strings to NA
for good measure.)
# xml_dig_df (xml2 package based)
<- lapply(terminal_xpaths, xml2::xml_find_all, x = doc)
terminal_nodesets <- terminal_nodesets %>%
df2 ::map(xml_dig_df) %>% ## does not dig by default
purrr::map(dplyr::bind_rows) %>%
purrr::bind_cols() %>%
dplyr::mutate_all(empty_as_na) dplyr
# they're the same!
identical(df1, data.table::as.data.table(df2))
#> [1] TRUE
Example 2
Below is another example to work through. The output is hidden but copy and paste the code or find the source file in this repository examples.R
.
# USING wsu.xml ------------------------------------------------
# larger file
# using xml_to_df
<- "http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/wsu.xml"
file <- file %>%
doc ::read_xml()
xml2<- doc %>%
nodeset ::xml_children()
xml2length(nodeset) # lots of nodes!
1] %>% # lets look at ONE node's tree
nodeset[xml_view_tree()
## takes a long time. likely can extract from a single node
# terminal_paths <- doc %>% ## get the xpath to parents of terminal node
# xml_get_paths(only_terminal_parent = TRUE)
# lets assume that most nodes share the same structure
<- nodeset[1] %>%
terminal_paths xml_get_paths(only_terminal_parent = TRUE)
<- terminal_paths %>% ## collapse xpaths to unique only
terminal_xpaths unlist() %>%
unique()
# xml_to_df (XML package based)
## note that we use file, not doc, hence is_xml = FALSE
<- lapply(xpaths, xml_to_df, file = file, is_xml = FALSE, dig = FALSE) %>%
df1 ::bind_cols()
dplyr
df1
# xml_dig_df (xml2 package based)
## faster!
<- lapply(terminal_xpaths, xml2::xml_find_all, x = doc) # use xml docs, not nodesets! I think this is because it searches the 'root'.
terminal_nodesets <- terminal_nodesets %>%
df2 ::map(xml_dig_df) %>%
purrr::map(dplyr::bind_rows) %>%
purrr::bind_cols() %>%
dplyr::mutate_all(empty_as_na)
dplyr
df2
# they're the same!
identical(df1, data.table::as.data.table(df2))