I use R
and it's package xml2
to parse an html
document. I extracted a piece of html
file, which looks like this:
text <- ('<div>
<p><span class="number">1</span>First <span class="small-accent">previous</span></p>
<p><span class="number">2</span>Second <span class="accent">current</span></p>
<p><span class="number">3</span>Third </p>
<p><span class="number">4</span>Fourth <span class="small-accent">last</span> A</p>
</div>')
And my goal is to extract information from the text and to convert it into data frame, which looks like this one:
number label text_of_accent type_of_accent
1 1 First previous small-accent
2 2 Second current accent
3 3 Third
4 4 Fourth A last small-accent
I tried the following code:
library(xml2)
library(magrittr)
html_1 <- text %>%
read_html() %>%
xml_find_all( "//span[@class='number']")
number <- html_1 %>% xml_text()
label <- html_1 %>%
xml_parent() %>%
xml_text(trim = TRUE)
text_of_accent <- html_1 %>%
xml_siblings() %>%
xml_text()
type_of_accent <- html_1 %>%
xml_siblings() %>%
xml_attr("class")
Unfortunately, label
, text_of_accent
, type_of_accent
are not extracted as I expect:
label
[1] "1First previous" "2Second current" "3Third" "4Fourth last A"
text_of_accent
[1] "previous" "current" "last"
type_of_accent
[1] "small-accent" "accent" "small-accent"
Is it possible to achieve my goal with just xml2
or I need some additional tools? At least is it possible to extract pieces of text for label
?