I am currently studying the Text Analysis in R book by Silge and Robinson and given my newbie status I can't come around to understanding exactly how this regex "^chapter [\\divxlc]"
works out the chapter numbers when tidying the texts. I have checked the regex101 engine (I may ignore also how to make it work for what I need). Can somebody help me out in figuring it out? This is the code I am referring to:
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
My take on it is that this will identify chapter numbers also written in roman numerals(`\d would have sufficed for decimals, I think). Is it so? Is the a general formula to identify chapter numbers regardless of its numbering? If so, how would it identify chapters III, XXI, etc where some roman numerals repeat?
I would appreciate any indication or reference to look for clarification.
Thanks in advance.