-1

I have a long character that comes from a pdf that I want to process. I have recurring instances of Table X. Name of the table, that in my character are always followed by a \r\n

However, when I try to extract all the tables in a list, using List_Tables <-str_extract_all(Plain_Text, "Table\\s+\\d+\\.\\s+(([A-z]|\\s))+\\r\\n"), I do have often another line that is still in my extraction, e.g.

> List_Tables
[[1]]
 [1] "Table 1. Real GDP\r\n                                                           Percentage changes\r\n"                                                                    
 [2] "Table 2. Nominal GDP\r\n                                          Percentage changes\r\n"    

What have I missed in my code ?

Has QUIT--Anony-Mousse
  • 70,714
  • 12
  • 123
  • 184
Anthony Martin
  • 689
  • 1
  • 6
  • 23
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input in a reproducible format that can be used to test and verify possible solutions. This is one of the case where you don't want an extra slash if you want to match the line feeds. Maybe try `str_extract_all(Plain_Text, "Table\\s+\\d+\\.\\s+(([A-z]|\\s))+\r\n")` – MrFlick Oct 28 '19 at 16:16
  • `[A-z]` matches more than just letters, have a look at an ASCII table, there are some special characters between `Z` and `a`. – Toto Nov 02 '19 at 13:52

1 Answers1

2

\s matches all whitespace, including line breaks! When combined with the greedy quantifier +, this means that (([A-z]|\\s))+ matches, in your first example,

 Real GDP\r\n       […]       Percentage changes\r\n

The easiest way to fix this is to use a non-greedy quantifier: i.e. +? instead of +.

Just for completeness’ sake I’ll mention that there are alternatives, but they get more complicated. For instance, you could use negative assertions to include an “if” test to match whitespace which isn’t a line break character; or you could use the character class [ \t] instead of \s, which is more restrictive but also more explicit and probably closer to what you want.

Konrad Rudolph
  • 482,603
  • 120
  • 884
  • 1,141
  • Thank you that works. `[ \t]` works too. For anyone interested, `\t` is a more restrictive subset of `\s` as described a bit here : https://stackoverflow.com/questions/17950842/what-is-the-difference-between-s-and-t – Anthony Martin Oct 28 '19 at 16:34