1

I'm having trouble with a Regex statement that I want to use in R to extract full matches of a pattern from a data frame.

I have 11 sentence patterns and I want to be able to select only records matching these patterns from my data frame as full matches using one Regex (I've been able to get this to work with multiple Regex, but it's a real hassle). Any help would be please appreciated as to what I can do to simply this.

These are my sentences:

  • A change to headings 0101 through 0106 from any other chapter.
  • A change to subheadings 0712.20 through 0712.39 from any other chapter.
  • A change to heading 0903 from any other chapter.
  • A change to subheading 1806.20 from any other heading.
  • A change to subheading 1207.99 from any other chapter.
  • A change to heading 4302 from any other heading.
  • A change to subheading 4105.10 from heading 4102 or any other chapter.
  • A change to subheading 4105.30 from heading 4102, subheading 4105.10 or any other chapter.
  • A change to subheading 4106.21 from subheading 4103.10 or any other chapter.
  • A change to subheading 4106.22 from subheadings 4103.10 or 4106.21 or any other chapter.
  • A change to tariff item 7304.41.30 from subheading 7304.49 or any other chapter.

This is the Regex I have now, which selects full matches and partial matches (where I'm stuck) - so I end up getting records I don't want from my data frame in addition to these sentences (I know this is messy, just an example).

^A change to (?:headings|heading|subheadings|subheading|tariff item) (?:\d+\S\d+\S\d+|\d+\S\d+) (?:through \d+\S\d+ from any other chapter.|from any other chapter.|from any other heading.|)|from heading \d+\S\d+ or any other chapter.|from (?:heading|subheading|subheadings) \d+\S\d+|, subheading \d+\S\d+ or any other chapter| or any other chapter.| or \d+\S\d+

This is the how far I can get with the Regex as full matches on all 11 sentences. I'm having a problem continuing to group cleany after this:

^A change to (?:tariff item|headings|heading|subheading|subheadings) (?:\d+\S\d+|\d+\S\d+\S\d+|\d+\S\d+) (?:from|through) 

enter image description here

Ryan
  • 95
  • 5
  • 1
    What comes in-between "A change" and "chapter." or "heading." that does NOT match your patterns? – Chris Ruehlemann Jan 13 '19 at 17:05
  • I see what you're getting at. Why not just select everything starting with and ending with those combinations. Like this( ^(?:A change to)(?:.*)(?:other chapter.|other heading.)$ The reason why is because I wanted to make sure to only select these exact patterns without worrying that something I don't want selected was accidentally selected, but you're right. That would ideally be the easiest solution. Thks. – Ryan Jan 14 '19 at 13:26

1 Answers1

2

You may use

rx <- "A\\s+change\\s+to\\s+(?:(?:sub)?headings?|tariff\\s+item)\\s+\\d[0-9.]*(?:\\s+through\\s+\\d[0-9.]*)?\\s+from(?:(?:,?\\s+(?:sub)?headings?\\s+\\d[0-9.]*)+(?:\\s+or\\s+\\d[0-9.]*)*\\s+or)?\\s+any\\s+other\\s+(?:heading|chapter)\\."

See the regex demo. Note that \s+ matches 1 or more whitespace chars, and will match even if the number and type of whitespace between the words is not constant.

Details

  • A\\s+change\\s+to\\s+ - A change to substring
  • (?:(?:sub)?headings?|tariff\\s+item) - subheading, subheadings, heading, headings, tariff item substrings
  • \\s+\\d[0-9.]* - 1+ whitespaces, 1 digit and 0 or more digits or .
  • (?:\\s+through\\s+\\d[0-9.]*)? - an optional sequence of:
    • \\s+ - 1+ whitespaces
    • through - through
    • \\s+ - 1+ whitespaces
    • \\d[0-9.]* - 1 digit and 0 or more digits or .
  • \\s+from - 1+ whitespaces and from
  • (?:(?:,?\\s+(?:sub)?headings?\\s+\\d[0-9.]*)+(?:\\s+or\\s+\\d[0-9.]*)*\\s+or)? - an optional sequence of:
    • (?:,?\\s+(?:sub)?headings?\\s+\\d[0-9.]*)+ - 1 or more sequences of:
      • ,? - an optional ,
      • \\s+
      • (?:sub)?headings? - an optional sub, then heading and then an optional s
      • \\s+ - 1+ whitespaces
      • \\d[0-9.]* - a digit and then 0+ digits or . chars
    • (?:\\s+or\\s+\\d[0-9.]*)* - 0 or more sequences of:
      • \\s+ - 1+ whitespaces
      • or\\s+\\d[0-9.]* - or, 1+ whitespaces, a digit and then 0+ digits or . chars
    • \\s+or - 1+ whitespaces and or
  • \\s+any\\s+other\\s+(?:heading|chapter)\\. - any other heading. or any other chapter.

All 11 matches are returned in this online R demo:

text <- "A change to headings 0101 through 0106 from any other chapter.
A change to subheadings 0712.20 through 0712.39 from any other chapter.
A change to heading 0903 from any other chapter.
A change to subheading 1806.20 from any other heading.
A change to subheading 1207.99 from any other chapter.
A change to heading 4302 from any other heading.
A change to subheading 4105.10 from heading 4102 or any other chapter.
A change to subheading 4105.30 from heading 4102, subheading 4105.10 or any other chapter.
A change to subheading 4106.21 from subheading 4103.10 or any other chapter.
A change to subheading 4106.22 from subheadings 4103.10 or 4106.21 or any other chapter.
A change to tariff item 7304.41.30 from subheading 7304.49 or any other chapter."
rx <- "A\\s+change\\s+to\\s+(?:(?:sub)?headings?|tariff\\s+item)\\s+\\d[0-9.]*(?:\\s+through\\s+\\d[0-9.]*)?\\s+from(?:(?:,?\\s+(?:sub)?headings?\\s+\\d[0-9.]*)+(?:\\s+or\\s+\\d[0-9.]*)*\\s+or)?\\s+any\\s+other\\s+(?:heading|chapter)\\."
regmatches(text, gregexpr(rx, text))
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • A wonderful and wonderfully complex regex plus an excellent walk-through. What does `?:` signify? It seems that it can be left out in `(?:sub)` as well as in `?:,?` (but not in the other instances!): the regex still finds all 11 sentences. – Chris Ruehlemann Jan 14 '19 at 17:16
  • 1
    @ChrisRuehlemann See [this thread](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-do) about non-capturing groups. The value captured with it is not stored in memory. [Here](https://www.regular-expressions.info/brackets.html) is another good reference. – Wiktor Stribiżew Jan 14 '19 at 17:22
  • Just came across a good explanation at https://stackoverflow.com/questions/18799948/can-anyone-explain-in-regular-expression: " the `?:` within a parenthetical group turns off the capturing of that group." – Chris Ruehlemann Jan 14 '19 at 17:23