-1

I am currently studying the Text Analysis in R book by Silge and Robinson and given my newbie status I can't come around to understanding exactly how this regex "^chapter [\\divxlc]" works out the chapter numbers when tidying the texts. I have checked the regex101 engine (I may ignore also how to make it work for what I need). Can somebody help me out in figuring it out? This is the code I am referring to:

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
     chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                             ignore_case = TRUE)))) %>%
 ungroup() %>%
 unnest_tokens(word, text)

My take on it is that this will identify chapter numbers also written in roman numerals(`\d would have sufficed for decimals, I think). Is it so? Is the a general formula to identify chapter numbers regardless of its numbering? If so, how would it identify chapters III, XXI, etc where some roman numerals repeat?

I would appreciate any indication or reference to look for clarification.

Thanks in advance.

  • See also https://stackoverflow.com/questions/4736/learning-regular-expressions – jonrsharpe Feb 11 '18 at 11:40
  • This is the reference you need - [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). And [**here is the regex demo**](https://regex101.com/r/3kzOvH/1), feel free to test your patterns there. `(?i)` stands for `ignore_case = TRUE`. – Wiktor Stribiżew Feb 11 '18 at 11:40

1 Answers1

1

The character class matches a single character between the square brackets. If the character after "chapter (space)" is a Roman numeral, you already have a match, and don't particularly care what it is followed by. You could add + to say "one or more" but this doesn't change which lines are matched, and omitting it saves a few cycles.

tripleee
  • 139,311
  • 24
  • 207
  • 268