3

I am new to R XML and want to parse below XML into a data.frame. Searching in StackOverflow it seems it is better to use xpath, to obtain a data.frame such as below.

   locationName                     StartTime     MaxT  MinT 
     TaipeiCity      2015-08-06T12:00:00+08:00      34    30
     TaipeiCity      2015-08-06T18:00:00+08:00      30    25
     TaipeiCity      2015-08-07T06:00:00+08:00      30    25
New Taipei City      2015-08-06T12:00:00+08:00      33    30
New Taipei City      2015-08-06T18:00:00+08:00      30    25
New Taipei City      2015-08-07T06:00:00+08:00      30    25

Somehow, I am not familiar how to do this by parsing the elementName and grouping this into a data.frame.

Below are my samples XML

<?xml version="1.0" encoding="UTF-8"?>
<cwbopendata xmlns="urn:cwb:gov:tw:cwbcommon:0.1">
    <identifier >6a9fd4e8-cf93-7884-fa2e-4a30f6960e13</identifier>
    <sender >weather@cwb.gov.tw</sender>
    <sent >2015-08-06T11:09:03+08:00</sent>
    <status >Actual</status>
    <msgType >Issue</msgType>
    <source >MFC</source>
    <dataid >C0032-001</dataid>
    <scope >Public</scope>
    <dataset >
        <datasetInfo>
            <datasetDescription>36 hours wealther predicts</datasetDescription>
            <issueTime>2015-08-06T11:00:00+08:00</issueTime>
            <update>2015-08-06T11:09:03+08:00</update>
</datasetInfo>
        <location>
            <locationName>Taipei City</locationName>
            <weatherElement>
                <elementName>Wx</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Cloudy</parameterName>
                        <parameterValue>12</parameterValue>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Rain</parameterName>
                        <parameterValue>12</parameterValue>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Rain</parameterName>
                        <parameterValue>26</parameterValue>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>MaxT</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>34</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>MinT</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>25</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>25</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>CI</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>HOT</parameterName>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>comforatble</parameterName>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>comforatble</parameterName>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>PoP</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>50</parameterName>
                        <parameterUnit>percentage</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>70</parameterName>
                        <parameterUnit>percentage</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>80</parameterName>
                        <parameterUnit>percentage</parameterUnit>
</parameter>
</time>
</weatherElement>
</location>
        <location>
            <locationName>New Taipei City</locationName>
            <weatherElement>
                <elementName>Wx</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>rainly</parameterName>
                        <parameterValue>12</parameterValue>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>rainly</parameterName>
                        <parameterValue>12</parameterValue>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>rainly</parameterName>
                        <parameterValue>26</parameterValue>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>MaxT</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>33</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>MinT</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>25</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>25</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>CI</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Hot</parameterName>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Hot</parameterName>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Hot</parameterName>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>PoP</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>50</parameterName>
                        <parameterUnit>pertcentage</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>60</parameterName>
                        <parameterUnit>pertcentage</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>70</parameterName>
                        <parameterUnit>pertcentage</parameterUnit>
</parameter>
</time>
</weatherElement>
</location>
</dataset>
</cwbopendata>
bgoldst
  • 30,505
  • 4
  • 34
  • 59
James Chien
  • 947
  • 1
  • 7
  • 16

1 Answers1

2

Dude, this was among the most difficult questions I've ever worked on. The problem may look fairly straightforward, but the general complexities of navigating XML data in code, combined with the challenge of extracting a subset of an XML structure and laying it out in a regular tabular format, combined with the specific complexity of your question that you actually need to merge the MaxT and MinT data on StartTime, all made this very difficult. But I'm happy to say, I think I got it.

library('XML');
doc <- xmlInternalTreeParse('sample.xml');
ns <- c(m=xmlNamespaceDefinitions(doc)[[1]]$uri);
df <- do.call(rbind,xpathApply(doc,'//m:location',namespaces=ns,function(locationNode) {
    locationName <- xpathSApply(locationNode,'m:locationName/text()',namespaces=ns,xmlValue);
    cbind(locationName,do.call(merge,xpathApply(locationNode,'m:weatherElement[m:elementName/text()="MaxT" or m:elementName/text()="MinT"]',namespaces=ns,function(elementNode) {
        elementName <- xpathSApply(elementNode,'m:elementName/text()',namespaces=ns,xmlValue);
        startTimes <- xpathSApply(elementNode,'m:time/m:startTime/text()',namespaces=ns,xmlValue);
        values <- xpathSApply(elementNode,'m:time/m:parameter/m:parameterName/text()',namespaces=ns,xmlValue);
        setNames(data.frame(startTimes,values,stringsAsFactors=F),c('StartTime',elementName));
    })));
}));
## fix data types from raw character strings
df$MaxT <- as.integer(df$MaxT);
df$MinT <- as.integer(df$MinT);
tzoSuffixRegex <- '([+-])(\\d{2}):(\\d{2})$';
df$StartTime <- do.call(c,lapply(df$StartTime,function(t) as.POSIXct(t,format='%Y-%m-%dT%H:%M:%S',chartr('+-','-+',sub(perl=T,'\\b0+','',sub(perl=T,paste0('.*',tzoSuffixRegex),'Etc/GMT\\1\\2',t)))))); ## four notes: (1) We have to use lapply() because the tz= parameter of as.POSIXct() (also strptime()) is unfortunately not vectorized. (2) Because POSIXct cannot store a time zone offset, but rather requires a time zone, we must "adapt" the full offset suffix to the truncated Etc/GMT pseudo time zone name. (3) We must use do.call(c,lapply(...)) rather than sapply(...) because sapply() weirdly simplifies to a named numeric vector, rather than a POSIXct vector. (4) We have to reverse the offset sign, because the Etc/GMT pseudo time zone names are bizarrely reversed from the standard notation; see <https://en.wikipedia.org/wiki/Tz_database#Area>
df;
##      locationName           StartTime MaxT MinT
## 1     Taipei City 2015-08-06 00:00:00   34   30
## 2     Taipei City 2015-08-06 06:00:00   30   25
## 3     Taipei City 2015-08-06 18:00:00   30   25
## 4 New Taipei City 2015-08-06 00:00:00   33   30
## 5 New Taipei City 2015-08-06 06:00:00   30   25
## 6 New Taipei City 2015-08-06 18:00:00   30   25

The code is obviously heavily tied to the design of the XML package, so refer to the package's documentation for important information. Here's my own summary of the XML functions I used in my code:

  • xmlInternalTreeParse() I use this to parse your source data, which I had saved as sample.xml in the pwd of my R session. Note that there's also an xmlTreeParse() function. The difference is that the internal function uses "internal" C pointer nodes which is more powerful, since it allows navigating the tree structure backwards; the non-internal function returns the data as a plain old R recursive list structure. I didn't actually have to take advantage of upward traversal for my solution (although I almost did, since I considered a different design initially that would have required it), but it's better to use the more powerful version, generally speaking. See http://www.omegahat.org/RSXML/shortIntro.html.
  • xmlNamespaceDefinitions() Namespace issues in XML are common and annoying. Generally, if nodes in an XML document are labelled with an xmlns attribute or live underneath such an element, they will be considered to exist within that namespace, and you'll have to identify them in your XPath expressions accordingly. For your document, there's one top-level namespace which is urn:cwb:gov:tw:cwbcommon:0.1. I used this XML function to extract this namespace and build a named vector around it. The name I use is m. I pass this vector to the namespaces argument of my subsequent XML function calls, which allows me to use the concise prefix m to prefix all element tag names.
  • xmlValue() When you retrieve a text node from the document, you can get its raw text content with this function. It's important to understand the distinction between a text node and the actual text content of the node; the text node is a data structure representing the node and its relation to the rest of the document; the text content is just a character string of the raw text of the node.
  • xpathApply() Like lapply(), runs a function once for each element of a list. In this case, however, the list consists of all matches of an XPath query against a given XML document or node or node set. There are four important arguments here: (1) the XML document or node, (2) the XPath query as a character string, (3) the namespaces in effect in the XPath query, and (4) the function to run on each matching node.
  • xpathSApply() To xpathApply() as sapply() is to lapply(). Generally, you won't simplify node lists, so it only makes sense to use this when you pass a custom lambda and return a primitive value like a character vector. I used this with xmlValue() to get the raw text values of text nodes.

The XML traversal begins by applying the XPath expression //m:location, which locates all location elements anywhere in the document (there are only two in your sample).

For each location node, I then retrieve the location name with the relative XPath m:locationName/text(), evaluated with the location node as the context node, utilizing the xpathSApply()+xmlValue() pattern to get the raw text. I then dive into the contained weatherElement nodes with the relative XPath m:weatherElement[m:elementName/text()="MaxT" or m:elementName/text()="MinT"]. Note how I'm using predicates to filter for specific weather element nodes; I use more relative XPath subexpressions inside the predicate modifier to filter against the raw text of the element names of the weather elements. Note that XPath 2.0 allows a more concise syntax: m:weatherElement[m:elementName/text()=("MaxT","MinT")], but the R XML package doesn't seem to support it.

For each weather element node, I retrieve its name, the start times of all time nodes underneath it, and all required numerical values (which are actually under elements with tag name parameterName, weirdly) using the xpathSApply()+xmlValue() pattern. I then build a data.frame of the result, which contains two columns: the start times as StartTime, and the values as whatever element name the current element had in the XML document.

Thus, the return value of the xmlApply() call that ran on both weather elements will be a list with two components, each component consisting of a data.frame of the required data underneath that weather element. The first data.frame will have columns StartTime and MaxT, and the second will have columns StartTime and MinT. We can then merge those with a simple call to merge(). We could've saved the return value and then run the call manually, e.g. merge(returnValue[[1]],returnValue[[2]]), but I decided to get a little fancy here and just call do.call(), which has the same effect.

Then, still within the location node context, we have to cbind() the location name as a leading column of the merged data.frame, and we can return that up to top-level.

The last step at top-level is to rbind() the data.frames from all locations that were matched by the initial XPath query, and capture the result in a variable.

And I figured you'd also want to coerce the raw text to appropriate data types, so I added a few lines for that. The MaxT and MinT look like they should be integer (although you could also use as.double() for doubles), and the StartTime should be POSIXct or POSIXlt (personally I always use POSIXct since it's more compact).

I tried to be maximally robust with the StartTime conversion, but IMO, modern software is often just not capable of fully handling the complexities of date/time data. In this case, we have date/time values with time zone offsets in the XML data, but the R POSIXct type can only take an optional time zone specifier. We don't have a time zone specifier for it. My solution was to use the somewhat deprecated and confusingly designed Etc/GMT... Olson time zone names, which allow to communicate the time zone offset to the POSIXct coercion function accurate to an hour. Funnily enough, upon collapsing all the values into a single vector, which necessarily must drop their individual tzone attributes and replace them with a single such attribute, or none at all, the tzone is in fact dropped completely from the resulting vector. Fortunately, the times themselves are still correct (to the hour), since the times are stored internally as seconds since 1970-01-01 00:00:00 UTC, but the original time zone offsets are lost. The time is displayed in the user's current time zone, which for me is EDT (UTC-4), which is why the raw times in my demo output are 12 hours behind the times in the XML data.


Random links I made use of:


Just to add one thing, you may notice there is an xmlToDataFrame() function in the XML package which might appear to be perfect for this task. However, it's quite a limited function, and depends on a very regular structure in order to produce a sensible result. In general, if you want to extract a subset of XML data that is somewhat scattered around the document, you're going to have to navigate the document yourself in code.

Here's a demonstration of how we can't use a simple call or applied set of calls to get the data we need:

Attempt #1: from location nodes

xpathApply(doc,'//m:location',namespaces=ns,xmlToDataFrame);
## Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("Wx",  :
##   duplicate subscripts for columns

The data structure underneath location nodes is too irregular, and xmlToDataFrame() refuses to try to jam it into a data.frame.

Attempt #2: from weather nodes

xpathApply(doc,'//m:location/m:weatherElement',namespaces=ns,xmlToDataFrame);
## [[1]]
##   text                 startTime                   endTime parameter
## 1   Wx                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00  Cloudy12
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00    Rain12
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00    Rain26
##
## [[2]]
##   text                 startTime                   endTime parameter
## 1 MaxT                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       34C
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       30C
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       30C
##
## [[3]]
##   text                 startTime                   endTime parameter
## 1 MinT                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       30C
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       25C
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       25C
##
## [[4]]
##   text                 startTime                   endTime   parameter
## 1   CI                      <NA>                      <NA>        <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00         HOT
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00 comforatble
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00 comforatble
##
## [[5]]
##   text                 startTime                   endTime    parameter
## 1  PoP                      <NA>                      <NA>         <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00 50percentage
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00 70percentage
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00 80percentage
##
## [[6]]
##   text                 startTime                   endTime parameter
## 1   Wx                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00  rainly12
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00  rainly12
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00  rainly26
##
## [[7]]
##   text                 startTime                   endTime parameter
## 1 MaxT                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       33C
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       30C
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       30C
##
## [[8]]
##   text                 startTime                   endTime parameter
## 1 MinT                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       30C
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       25C
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       25C
##
## [[9]]
##   text                 startTime                   endTime parameter
## 1   CI                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       Hot
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       Hot
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       Hot
##
## [[10]]
##   text                 startTime                   endTime     parameter
## 1  PoP                      <NA>                      <NA>          <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00 50pertcentage
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00 60pertcentage
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00 70pertcentage
##

The above data is missing the location names, and so even if we collected all location names separately, we wouldn't know which data.frames came from which location nodes--unless we wanted to start making some assumptions about which weather element names occur under all location nodes, which is theoretically doable, but obviously this is starting to get a little unreasonable. And clearly there's still a lot of work to be done to filter and reshape to get the required data in the required shape.

It also should be noted that the parameterName and parameterUnit text content have been concatenated into a parameter column (parameter being the name of the immediately ancestral element). The result actually looks kind of reasonable in this case, at least for the temperature parameters, because together they comprise numerical values with units (e.g. 30C), which is very common notation, but generally this behavior is probably a little bit questionable, and if you really want the numerical values without the units, you'd have to do some text processing to "undo" the concatenation, which, again, is starting to deviate from the realm of reasonable code.

Attempt #3: from time nodes

xpathApply(doc,'//m:location/m:weatherElement/m:time',namespaces=ns,xmlToDataFrame);
## [[1]]
##                        text parameterName parameterValue
## 1 2015-08-06T12:00:00+08:00          <NA>           <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>           <NA>
## 3                      <NA>        Cloudy             12
##
## [[2]]
##                        text parameterName parameterValue
## 1 2015-08-06T18:00:00+08:00          <NA>           <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>           <NA>
## 3                      <NA>          Rain             12
##
## [[3]]
##                        text parameterName parameterValue
## 1 2015-08-07T06:00:00+08:00          <NA>           <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>           <NA>
## 3                      <NA>          Rain             26
##
## [[4]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            34             C
##
## [[5]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[6]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[7]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[8]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            25             C
##
## [[9]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            25             C
##
## [[10]]
##                        text parameterName
## 1 2015-08-06T12:00:00+08:00          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>
## 3                      <NA>           HOT
##
## [[11]]
##                        text parameterName
## 1 2015-08-06T18:00:00+08:00          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>
## 3                      <NA>   comforatble
##
## [[12]]
##                        text parameterName
## 1 2015-08-07T06:00:00+08:00          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>
## 3                      <NA>   comforatble
##
## [[13]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            50    percentage
##
## [[14]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            70    percentage
##
## [[15]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            80    percentage
##
## [[16]]
##                        text parameterName parameterValue
## 1 2015-08-06T12:00:00+08:00          <NA>           <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>           <NA>
## 3                      <NA>        rainly             12
##
## [[17]]
##                        text parameterName parameterValue
## 1 2015-08-06T18:00:00+08:00          <NA>           <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>           <NA>
## 3                      <NA>        rainly             12
##
## [[18]]
##                        text parameterName parameterValue
## 1 2015-08-07T06:00:00+08:00          <NA>           <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>           <NA>
## 3                      <NA>        rainly             26
##
## [[19]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            33             C
##
## [[20]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[21]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[22]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[23]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            25             C
##
## [[24]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            25             C
##
## [[25]]
##                        text parameterName
## 1 2015-08-06T12:00:00+08:00          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>
## 3                      <NA>           Hot
##
## [[26]]
##                        text parameterName
## 1 2015-08-06T18:00:00+08:00          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>
## 3                      <NA>           Hot
##
## [[27]]
##                        text parameterName
## 1 2015-08-07T06:00:00+08:00          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>
## 3                      <NA>           Hot
##
## [[28]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            50   pertcentage
##
## [[29]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            60   pertcentage
##
## [[30]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            70   pertcentage
##

Now, not only are we missing the location names and mappings, but we don't have the weather element names and mappings to the above data.frames.

Attempt #4: single call passing location nodes

xmlToDataFrame(doc,nodes=xpathApply(doc,'//m:location',namespaces=ns));
## Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("Taipei City",  :
##   duplicate subscripts for columns

Same problem.

Attempt #5: single call passing weather nodes

xmlToDataFrame(doc,nodes=xpathApply(doc,'//m:location/m:weatherElement',namespaces=ns));
## Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("Wx",  :
##   duplicate subscripts for columns

Data underneath individual weather element nodes appears to be regular enough for xmlToDataFrame() since it worked in Attempt #2, but combining them all into one data.frame does not work.

Attempt #6: single call passing time nodes

xmlToDataFrame(doc,nodes=xpathApply(doc,'//m:location/m:weatherElement/m:time',namespaces=ns));
##                    startTime                   endTime     parameter
## 1  2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00      Cloudy12
## 2  2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00        Rain12
## 3  2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00        Rain26
## 4  2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           34C
## 5  2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           30C
## 6  2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           30C
## 7  2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           30C
## 8  2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           25C
## 9  2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           25C
## 10 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           HOT
## 11 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00   comforatble
## 12 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00   comforatble
## 13 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00  50percentage
## 14 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00  70percentage
## 15 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00  80percentage
## 16 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00      rainly12
## 17 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00      rainly12
## 18 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00      rainly26
## 19 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           33C
## 20 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           30C
## 21 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           30C
## 22 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           30C
## 23 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           25C
## 24 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           25C
## 25 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           Hot
## 26 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           Hot
## 27 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           Hot
## 28 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00 50pertcentage
## 29 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00 60pertcentage
## 30 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00 70pertcentage

As you can see, here we have the same problems I described earlier.

Thus, this is not a viable approach. As I said, manual traversal of the XML tree is what's required here.

Lastly, one could argue that we could combine manual traversal with calls to xmlToDataFrame(), and that doesn't sound so unreasonable to me. However, it wouldn't get us very far; we'd still have to handle a lot of the navigation ourselves, and we'd still have a lot of work to do to reshape the results into the required output. IMO, attempting to leverage xmlToDataFrame() inside an already-complex manual traversal scheme doesn't provide enough bang for the buck. We might as well just extract everything we need manually, and combine it into a data.frame using our own invocation of the constructor function data.frame(), as I do in my solution. That provides maximum control and (relative) simplicity.

Community
  • 1
  • 1
bgoldst
  • 30,505
  • 4
  • 34
  • 59