Is there a disadvantage using readlines for HTML / XML parsing?

Question

Due to global IT settings, I am having a hard time to use htmlParse or read_HTML. The solution for my purpose, was just to use readLines from the base package and then parse it with htmlParse. Is there a disadvantage to this process that I am not aware of?

At least for my MWE it seems to yield the same output. Maybe this will be different for more elaborate HTML code.

library(XML)

mailing_url = "http://www.r-project.org/mail.html"

mailing lines <- readLines(mailing_url)

mailing_doc.RL = htmlParse(mailing_lines)
mailing_doc.HTML = htmlParse(mailing_url)

all.equal(mailing_doc.RL, mailing_doc.HTML)

What exactly does your IT settings prevent that this works but just using `htmlParse` directly doesn't work? I wouldn't think they would be any different. — MrFlick, Jul 02 '18 at 20:14
I am at my home computer but I think it was something like `could not resolve host name`. I am trying to contact my IT as well, but there are rather touchy on these topics and I do not want to wake any sleeping dogs. Since the code works at my home computer and does not at my office computer, I am kind of assuming it is because of the IT settings — Max M, Jul 02 '18 at 20:18
It's hard to believe that would `readLines` would work but `htmlParse` would fail. They would both have to resolve host names. You're sure they are different? — MrFlick, Jul 02 '18 at 20:20
For instance using `read_HTML` yields `Error in open.connection(x, "rb") : Could not resolve host: www.r-project.org` — Max M, Jul 03 '18 at 10:51
I found a workaround for my It Problem https://stackoverflow.com/questions/36043172/package-rvest-for-web-scraping-https-site-with-proxy/38463559#38463559 and https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached This is not an answer to my original question but it avoids my question — Max M, Jul 04 '18 at 07:19

Is there a disadvantage using readlines for HTML / XML parsing?

0 Answers0