I am using the xml2 package to read a huge XML file into memory and the command fails with the following error:

Error: Char 0x0 out of allowed range [9]

My code looks like the following:

doc <- read_xml('~/Downloads/FBrf.xml')

The data can be downloaded at ftp://ftp.flybase.net/releases/FB2015_05/reporting-xml/FBrf.xml.gz (about 140MB) and unpacked it has about 1.8GB.

Has anyone some advise how to figure out which characters are problematic or how to clean the file before reading it.


Ok, since the file is pretty big I searched for other solutions on stack overflow and try to implement a solution from Martin Morgan which he presented here Combine values in huge XML-files

So what I have done so far is the following lines of code

branchFunction <- function(progress=10) {
    res <- new.env(parent=emptyenv())   # for results
    it <- 0L                            # iterator -- nodes visited
    list(publication=function(elt) {
        ## handle 'publication' nodes 
        if (getNodeSet(elt, "not(/publication/feature/id)"))
            ## early exit -- no feature id
        it <<- it + 1L
        if (it %% progress == 0L)
        publication <- getNodeSet(elt, "string(/publication/id/text())") # 'key'
        res[[publication]] <-
                 features= xpathSApply(elt, "//feature/id/text()", xmlValue))
    }, getres = function() {
        ## retrieve the 'res' environment when done
    }, get=function() {
        ## retrieve 'res' environment as data.frame
        publication <- ls(res)
        miniref <- unlist(eapply(res, "[[", "miniref"), use.names=FALSE)
        feature <- eapply(res, "[[", "features")
        len <- sapply(feature, length)
        data.frame(publication=rep(publication, len),
                   feature=unlist(feature, use.names=FALSE), 
                   miniref=rep(miniref, len))

branches <- branchFunction()
xmlEventParse("~/Downloads/jnk.xml", handlers=NULL, branches=branches)
# xmlEventParse("~/Downloads/FBrf.xml", handlers=NULL, branches=branches)

I upload the xml file to my server http://download.dejung.net/jnk.xml

The file has only a few kb, but the problem is the result. The second publication entry has an id FBrf0162243 and a miniref of Schwartz et al., 2003, Mol. Cell. Biol. 23(19): 6876--6886.

My results from the code I posted above reports the wrong publication id to the corresponding miniref. The feature ids are correct....

FBrf0050934 FBgn0003277 Schwartz et al., 2003, Mol. Cell. Biol. 23(19): 6876--6886

Not sure why my code is reporting the wrong values, maybe someone can help me with the closures since this is very new to me.

  • 1
  • 1
  • 9,470
  • 8
  • 48
  • 96
  • maybe try `read_html` as this tries to guess the encoding by itself. – Rentrop Feb 02 '16 at 14:19
  • @hrbrmstr can I just delete these characters then? And how? @Floo0 I tried `read_html` yesterday but it took more than 20 minutes and I had to quit the process. Will try it again today and see if it is running through. The `read_xml` command quits after a few minutes. – drmariod Feb 03 '16 at 07:00
  • I tried to use closures now but can not get everything correct... – drmariod Feb 03 '16 at 13:31

2 Answers2


I occasionally encounter "embedded NULL" error messages that may be similar to this (if the 0x0 in this message means the same NULL issue). My approach is to try to delete them before reading in the file, as I have not found an R package that ignores them.

If you are on Unix or OS X, you could invoke sed in your R program via:

system( 'sed "s/\\0//g" ~/Downloads/dirty.xml > ~/Downloads/clean.xml' )

If this doesn't do the trick, you might want to expand this "blacklist" of characters -- see for example Unicode Regex; Invalid XML characters

If something is still wrong then sometimes I make a character whitelist -- delete everything not in the specified character set..

sed 's/[^A-Za-z0-9 _.,"]//g' ~/Downloads/dirty.csv > ~/Downloads/clean.csv

This is the one I use for .csv data files (don't care about </etc.>), so you'd maybe want to expand it to something like [^[:ascii:]]:

If you are on Windows, you likely have to go outside of R for this approach -- for example you can use Cygwin instead of the system() invocation above.

  • 1
  • 1
  • 15,256
  • 6
  • 74
  • 113

At the command line, I ran the command iconv -f utf-8 -t utf-8 FBrf.xml > outfile.xml on your file. It made a difference visible to the eye, but I don't have R installed to test it.

(if on Windows, you would need to install cygwin to get access to iconv)

  • 1,299
  • 2
  • 14
  • 35
  • iconv is another option, although I have not had as much success with it. It's worth noting that even in Windows R there is a function for it. (See `?iconv`) So a combination of `readLines` and `iconv` could be an option. – C8H10N4O2 Feb 05 '16 at 16:20