I am using the xml2
package to read a huge XML file into memory and the command fails with the following error:
Error: Char 0x0 out of allowed range [9]
My code looks like the following:
library(xml2)
doc <- read_xml('~/Downloads/FBrf.xml')
The data can be downloaded at ftp://ftp.flybase.net/releases/FB2015_05/reporting-xml/FBrf.xml.gz (about 140MB) and unpacked it has about 1.8GB.
Has anyone some advise how to figure out which characters are problematic or how to clean the file before reading it.
EDIT
Ok, since the file is pretty big I searched for other solutions on stack overflow and try to implement a solution from Martin Morgan which he presented here Combine values in huge XML-files
So what I have done so far is the following lines of code
library(XML)
branchFunction <- function(progress=10) {
res <- new.env(parent=emptyenv()) # for results
it <- 0L # iterator -- nodes visited
list(publication=function(elt) {
## handle 'publication' nodes
if (getNodeSet(elt, "not(/publication/feature/id)"))
## early exit -- no feature id
return(NULL)
it <<- it + 1L
if (it %% progress == 0L)
message(it)
publication <- getNodeSet(elt, "string(/publication/id/text())") # 'key'
res[[publication]] <-
list(miniref=getNodeSet(elt,
"normalize-space(/publication/miniref/text())"),
features= xpathSApply(elt, "//feature/id/text()", xmlValue))
}, getres = function() {
## retrieve the 'res' environment when done
res
}, get=function() {
## retrieve 'res' environment as data.frame
publication <- ls(res)
miniref <- unlist(eapply(res, "[[", "miniref"), use.names=FALSE)
feature <- eapply(res, "[[", "features")
len <- sapply(feature, length)
data.frame(publication=rep(publication, len),
feature=unlist(feature, use.names=FALSE),
miniref=rep(miniref, len))
})
}
branches <- branchFunction()
xmlEventParse("~/Downloads/jnk.xml", handlers=NULL, branches=branches)
# xmlEventParse("~/Downloads/FBrf.xml", handlers=NULL, branches=branches)
branches$get()
I upload the xml file to my server http://download.dejung.net/jnk.xml
The file has only a few kb, but the problem is the result. The second publication entry has an id FBrf0162243 and a miniref of Schwartz et al., 2003, Mol. Cell. Biol. 23(19): 6876--6886
.
My results from the code I posted above reports the wrong publication id to the corresponding miniref. The feature ids are correct....
FBrf0050934 FBgn0003277 Schwartz et al., 2003, Mol. Cell. Biol. 23(19): 6876--6886
Not sure why my code is reporting the wrong values, maybe someone can help me with the closures since this is very new to me.