10

I am new to rvest. How do I extract those elements with 2 class names or only 1 class name in tag?

This is my code and issue:

doc <- paste("<html>",
             "<body>",
             "<span class='a1 b1'> text1 </span>",
             "<span class='b1'> text2 </span>",
             "</body>",
             "</html>"
            )
library(rvest)
read_html(doc) %>% html_nodes(".b1")  %>% html_text()
#output: text1, text2
#what i want: text2

#I also want to extract only elements with 2 class names
read_html(doc) %>% html_nodes(".a1 .b1") %>% html_text()
# Output that i want: text1

This is my machine spec:

Operation System: Windows 10.

rvest version: 0.3.2

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Anybody can help?

addicted
  • 2,139
  • 1
  • 18
  • 41

1 Answers1

17

You can use css selector as follows:

Select class contains b1 not a1:

read_html(doc) %>% html_nodes(".b1:not(.a1)")
# {xml_nodeset (1)}
# [1] <span class="b1"> text2 </span>

Or use the attribute selector:

read_html(doc) %>% html_nodes("[class='b1']")
# {xml_nodeset (1)}
# [1] <span class="b1"> text2 </span>

Select class contains both:

read_html(doc) %>% html_nodes(".a1.b1")
# {xml_nodeset (1)}
# [1] <span class="a1 b1"> text1 </span>
Psidom
  • 171,477
  • 20
  • 249
  • 286
  • Thanks! For your first solution, what is `:not()`? is it 1 syntax or the `:` can be used in conjunction with other tags/classes/ids? – addicted Aug 02 '17 at 03:45
  • 1
    `not` means literally. i.e.. the class should not contain class in the parenthesis, yes you can use it with tag name and id like `span.b1:not(.a1)`. You can check [here](https://stackoverflow.com/questions/1028248/how-to-combine-class-and-id-in-css-selector) for more info. – Psidom Aug 02 '17 at 03:58