11

First I would like to apologize for a new question as my profile does not yet allow me to comment on other people's comments especially on two SO posts I've seen. So please bear with this older guy :-)

I am trying to read a list of 100 character files ranging in size from around 90KB to 2MB and then using the qdap package do some statistics with the text I extract from the files namely count sentences, words etc. The files contain webpage source previously scraped using RSelenium::remoteDriver$getPageSource() and saved to file using write(pgSource, fileName.txt). I am reading the files in a loop using:

pgSource <- readChar(file.path(fPath, fileNames[i]), nchars = 1e6)
doc <- read_html(pgSource)

that for some files is throwing

Error in eval(substitute(expr), envir, enclos) : 
  Excessive depth in document: 256 use XML_PARSE_HUGE option [1] 

I have seen these posts, SO33819103 and SO31419409 that point to similar problems but cannot fully understand how to use @shabbychef's workaround as suggested in both posts using the snippet suggested by @glossarch in the first link above.

library(drat)
drat:::add("shabbychef");
install.packages('xml2')
library("xml2")

EDIT: I noticed that when previously I was running another script scraping the data live from the webpages using URL's I did not encounter this problem. The code was the same, I was just reading the doc <- read_html(pgSource) after reading it from the RSelenium's remoteDriver.

What I would like to ask this gentle community is whether I am following the right steps in installing and loading xml2 after adding shabbychef's drat or whether I need to add some other step as suggested in SO17154308 post. Any help or suggestions are greatly appreciated. Thank you.

Community
  • 1
  • 1
salvu
  • 451
  • 5
  • 14
  • 1
    Those sizes are pretty reasonable and I suspect this may be malformed HTML as one of the SO posts you linked suggests. Can you provide some of the data? If not, running the HTML through [`htmltidy`](https://github.com/hrbrmstr/htmltidy) (use the GH version as I need to do a CRAN push of it soon) may "fix" it enough to prevent the parser error. w/r/t using Steven's code, you can also do `devtools::install_github("shabbychef/xml2")` if the drat method wasn't working. – hrbrmstr Sep 24 '16 at 09:43
  • Thanks for your help. I have tried to install both `htmltidy` and `shabbchef/xml2` as kindly suggested by you. I also had to install RTools from beforehand. This time I did not get the error as before as RStudio kept crashing after `doc – salvu Sep 24 '16 at 12:44
  • 3
    If you had a core dump for `htmltidy` it's prbly due to using the CRAN version vs the GitHub one (I fixed a bug that has yet to make it into CRAN). However, please install the `xml2` package from CRAN in a fresh R session. Then try `pg – hrbrmstr Sep 24 '16 at 14:10
  • @hrbrmstr - Thanks for your valuable help. For some reason, in the old session, `htmltidy` was not being downloaded and installed properly. I also used `xml2` from CRAN. Now read_html worked when tested with one of the problem files. In the meantime, I was using the live version to download the page sources and use them without saving as that did not produce errors. But this solution enables me to use saved source rather than go live. I don't want to trigger any web server's bells :-). Thank a million. – salvu Sep 25 '16 at 13:14

1 Answers1

10

I don't know if this is the right thing to do, but my question was answered by @hrbrmstr in one of his comments. I decided to post an answer so that people stumbling upon this question see that it has at least one answer.

The problem is basically solved by using the "HUGE" option when reading the html source. My problem was only related to when I loaded previously saved source. I did not find the same problem whilst using a "live" version of the application, i.e. reading the source from the website directly.

Anyway, now the August 2016 update of the excellent xml2 package permits the use of the HUGE option as follows:

doc <- read_html(pageSource, options = "HUGE")

For more information please read the xml2 reference manual here CRAN-xml2

I wish to thank @hrbrmstr again for his valuable contribution.

salvu
  • 451
  • 5
  • 14
  • 1
    It's 100% cool to answer your own answer! Glad it helped (these bugs can be really frustrating, so it's great that you can get on with analyzing vs data wrangling :-) – hrbrmstr Sep 25 '16 at 14:27