First I would like to apologize for a new question as my profile does not yet allow me to comment on other people's comments especially on two SO posts I've seen. So please bear with this older guy :-)
I am trying to read a list of 100 character files ranging in size from around 90KB to 2MB and then using the qdap
package do some statistics with the text I extract from the files namely count sentences, words etc. The files contain webpage source previously scraped using RSelenium::remoteDriver$getPageSource()
and saved to file using write(pgSource, fileName.txt)
. I am reading the files in a loop using:
pgSource <- readChar(file.path(fPath, fileNames[i]), nchars = 1e6)
doc <- read_html(pgSource)
that for some files is throwing
Error in eval(substitute(expr), envir, enclos) :
Excessive depth in document: 256 use XML_PARSE_HUGE option [1]
I have seen these posts, SO33819103 and SO31419409 that point to similar problems but cannot fully understand how to use @shabbychef's workaround as suggested in both posts using the snippet suggested by @glossarch in the first link above.
library(drat)
drat:::add("shabbychef");
install.packages('xml2')
library("xml2")
EDIT: I noticed that when previously I was running another script scraping the data live from the webpages using URL's I did not encounter this problem. The code was the same, I was just reading the doc <- read_html(pgSource)
after reading it from the RSelenium's remoteDriver
.
What I would like to ask this gentle community is whether I am following the right steps in installing and loading xml2
after adding shabbychef's drat or whether I need to add some other step as suggested in SO17154308 post. Any help or suggestions are greatly appreciated. Thank you.