0

I'm using this code: (?i)(?<!\d)Item.*?1A.*?Risk.*?Factors.*?\n*(.+?)\n*Item.*?1B to grab the following text:

ITEM 1A.    RISK FACTORS

In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or 

In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or 


ITEM 1B.

But it would not grab anything in the capturing group, unless it's one paragraph like this:

ITEM 1A.    RISK FACTORS

In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or 

ITEM 1B.
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
md123
  • 305
  • 2
  • 7

2 Answers2

0

Your regex is matching any number of newlines, then any amount of text on one line, then any number of newlines - it's only looking for a single "paragraph" between newlines, since . does not capture across lines.

Try replacing it with something like [\s\S], which will capture everything - including newlines, paragraphs, text, space, anything you want. Of special note is that this will capture any number of paragraphs, with any amount of whitespace between them.

(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors\n*([\s\S]*?)\n*Item.*?1B

  • (?i)(?<!\d)Item.*?1A.*?Risk.*?Factors Match up to the end of risk factors.
  • \n* Match as many newlines as needed 'till we hit the next paragraph.
  • ([\s\S]*?) Capture anything, across any number of lines (lazy).
  • \n* Match as many newlines as needed 'till we hit the next paragraph.
  • Item.*?1B Match the rest of the content. (This doesn't match the . at the very end, did you mean for it to? If so, add \. to the end).

Try it here!

Nick Reed
  • 5,029
  • 4
  • 14
  • 34
0

Try

(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors.*?\n*((.*\n*)+)\n*Item.*?1B

And for the sake of your future regex headaches, an incredible resource: https://regex101.com

Cheers-

Cedric Druck
  • 978
  • 5
  • 18
  • This regex fails if there's more than two paragraphs. Not part of OP's question, but just wanted to point that out. – Nick Reed Oct 05 '19 at 18:46
  • "Fixed" is debatable - you've got back-to-back `.*` wildcards now, which can (and will) cause catastrophic backtracking. According to [this test,](https://regex101.com/r/cCfYXo/2) your regex takes more than **250,000** steps to determine a match, with growing time as you add more paragraphs. That can seriously hamper production code! (There's also an empty match just before `ITEM 1B`) – Nick Reed Oct 05 '19 at 18:52
  • Ok, I'll admit yours is more elegant / performant! – Cedric Druck Oct 05 '19 at 19:01
  • 1
    Yours is extremely close to being just as powerful and performant. Have a look at this: [link](https://regex101.com/r/cCfYXo/3). By removing the second `.*` from your capture group, performance time drops by two orders of magnitude! In general, try to avoid `.*.*` - [here's](https://www.regular-expressions.info/catastrophic.html) a great resource why. The tl;dr of it is: for `.*.*` the first one will grab everything and the second will fail, then the first will give up one and the second will take it but fail, and so on. The time grows exponentially as it tests `aaa a` `aa aa` `a aaa` etc. – Nick Reed Oct 05 '19 at 19:04
  • 1
    I never noticed regex101 had a steps/performance indication !! They're even more awesome than I thought. Thanks, I learnt something today! – Cedric Druck Oct 05 '19 at 19:09
  • Glad to help! Please consider modifying your answer so I can remove the downvote ;) – Nick Reed Oct 05 '19 at 19:13
  • Did so, thanks, you're a gentleman. – Cedric Druck Oct 05 '19 at 19:18