0

I´m trying to extract a YouTube Video description using Rvest. I know that it would be easier to just use the API but the end goal is to get more familiar with Rvest, rather than just getting the Video description. This is what I did so far:

# defining website
page <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"

# setting Xpath
Xp <- '/html/body/div[2]/div[4]/div/div[5]/div[2]/div[2]/div/div[2]/meta[2]'

# getting page
Website <- read_html(page)

# printing description
html_attr(Description, name = "content")

While this does point to the video description, I don't get the full video description but a character string that is cut off after a few lines:

[1] "The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johson in his first major speech of the campaign said a..."

Expected Output would be the full description

"The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johnson in his first major speech of the campaign said a Conservative government would unite the country and "level up" the prospects for people with massive investment in health, better infrastructure, more police, and a green revolution. But he said the key issue to solve was Brexit. Meanwhile Labour vowed to outspend the Tories on the NHS in England. 

Labour leader Jeremy Corbyn has also faced questions over his position on allowing a second referendum on Scottish independence. Today at the start of a two-day tour of Scotland, he said wouldn't allow one in the first term of a Labour government but later rowed back saying it wouldn't be a priority in the early years. 

Sophie Raworth presents tonight's BBC News at Ten and unravels the day's events with the BBC's political editor Laura Kuenssberg, health editor Hugh Pym and Scotland editor Sarah Smith.


Please subscribe HERE: LINK"

Is there any way of getting the full description with rvest?

Ju Ko
  • 361
  • 4
  • 19

1 Answers1

4

As you said you focus on learning, i add a few explanations how i arrived there, after showing the code.

Reproducible code:

library(rvest)
library(magrittr)
url <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
url %>% 
  read_html %>% 
  html_nodes(xpath = "//*[@id = 'eow-description']") %>% 
  html_text

Explanation:

1. Location the element

There are several ways to approach this. A common first step is to right click on your target element in the browser and select "inspect element". You will see sthg like this:

enter image description here

Next, you can try to extract the data.

url %>% 
      read_html %>% 
      html_nodes(xpath = "//*[@id = 'description']")

Unfortunately, this doesnt work in your case.

2. Ensure you have the correct source

So you have to ensure that your target data is within the document you loaded. You can either see this in the network activities of your browser or if you prefer to check within R, i wrote a small function for that:

showHtmlPage <- function(doc){
  tmp <- tempfile(fileext = ".html")
  doc %>% toString %>% writeLines(con = tmp)
  tmp %>% browseURL(browser = rstudioapi::viewer)
}

Usage:

url %>% read_html %>% showHtmlPage

You will see that your target data is in fact within the document you downloaded. So you can stick to rvest. Next, you have to find the xpath (or css),...

3. Locate target tag within downloaded document

You can search for tags that contain the text you are looking for

doc %>% html_nodes(xpath = "//*[contains(text(), 'The Conservatives and ')]")

output will be:

{xml_nodeset (1)}
[1] <p id="eow-description" class="">The Conservatives and Labour have ....

and there you see that you are looking for a tag with the id eow-description.

Tonio Liebrand
  • 15,033
  • 3
  • 27
  • 48