1

I am trying to read the member type and comments on below link using rvest package . However, my code always return the top 10 comments only.I feel that read_html(url) is not reading the full page . Please help me on this.

Below is the code i am using :

url = "http://mmb.moneycontrol.com/stock-message-forum/axisbank/comments/3142?utm_source=PC_SENTI"


library(rvest)
html_content= read_html(url)
html_main_node = html_nodes(html_content,".info a")
html_text(html_main_node)

Thanks!

Apoorv
  • 137
  • 2
  • 14
  • `rvest` is reading the complete page. It’s just that the page keeps loading more content *dynamically*. – Konrad Rudolph Feb 09 '16 at 14:18
  • In that case, is there a way by which I can increase the read time of the session ?or any other suggestion to address it would be really helpful . – Apoorv Feb 09 '16 at 14:23
  • 3
    That wouldn’t help. The site loads content *when you scroll down*. But even if it loaded after a certain time, `read_html` wouldn’t execute scripts on the site, it’s reading the static content only — as it should. I’m not aware of an easy way to scrape this content. It’s a sad fact that such “fancy” scripts break usability (and, in this case, machine readability). That said, the website in question probably doesn’t *want* to be scraped so of course they’re not interested in making it easy. – Konrad Rudolph Feb 09 '16 at 14:26
  • 2
    I was able to achieve it using Rselenium package using the reference from [link](http://stackoverflow.com/questions/29861117/r-rvest-scraping-a-dynamic-ecommerce-page) – Apoorv Feb 09 '16 at 15:52
  • 1
    Posting this so folks don't go selenium crazy. Here's the Dynamic URL: [http://mmb.moneycontrol.com/india/messageboard/get_ajaxv2_topic_output.php?topic_id=3142&que=latest&pgno=5&last_table=msg_detail&limit=100](http://mmb.moneycontrol.com/india/messageboard/get_ajaxv2_topic_output.php?topic_id=3142&que=latest&pgno=5&last_table=msg_detail&limit=100) (change the 100 to as max as you think you can go). Developer Tools -> Network -> XHR clicks FTW – hrbrmstr Feb 09 '16 at 17:30
  • I'm having the same problem with https://www.rottentomatoes.com/browse/in-theaters/ ... any ideas? My workaround is to download the page manually and save as "Webpage, Complete" then read from the downloaded file. For example, read_html("https://www.rottentomatoes.com/browse/in-theaters/") %>% html_nodes(".movie_info") doesn't find anything but no prob with the downloaded version – jtr13 Sep 20 '19 at 22:08

0 Answers0