1

I have only a crude understanding of RegEx and I'm stumped!

My file is formatted like this:

>>  

    www.google.com  some stuff I don't want
    www.yahoo.com

<<section>>

    www.bing.com
    www.yahoo.com

<<section>>

    www.bing.com
    https://github.com/zeeshanu/learn-regex

Here are the conditions I am hoping to match:

  • only the lines after the first ">>" and before the first "<<" on it
  • select only first block of text on each line, no text following whitespace after first text
  • ignore any initial white space on line if any

I have been able make this regex to select the part of the line I want:

^([^>>]\s*\S*){1}

But I can't get it to work within the proper strings.

halfer
  • 18,701
  • 13
  • 79
  • 158
jakejake
  • 15
  • 4
  • So basically you just want the URLs? Do they always consist of only lowercase letters and dots? If you can narrow down what you're looking for a bit more, this might have a rather simple solution. Regarding your current pattern, `[^>>]` is a negated character class. Including `>` twice is redundant. The quantifier `{1}` is also redundant, as all characters/groups are implicitly quantified once unless otherwise indicated. – CAustin Oct 03 '17 at 19:14
  • Anyway, this might be all you need, at least according to the example you posted: `[a-z]+\.[a-z.]+` – CAustin Oct 03 '17 at 19:15
  • You could always use this `^\h*\K[^\s]+(?=.*$[\s\S]*<>\h*$`. Unfortunately, regex doesn't support quantifiers in lookbehinds, otherwise you could do that. The only other option I can think of is to use some sort of code capsule to ensure that somewhere before it a `>>`. See http://www.rexegg.com/regex-disambiguation.html#codecapsule for more info – ctwheels Oct 03 '17 at 19:20
  • Yes I just want the urls, but they could be any url. ctwheels expression worked, but if there were more double brackets, it selected beyond the first occurrance: https://regex101.com/r/e9dKzo/3 – jakejake Oct 03 '17 at 19:37
  • Try [`(?:\G(?!\A).*\R(?!\h*<>\s*)\K\S+`](https://regex101.com/r/gGFenL/2) – Wiktor Stribiżew Oct 03 '17 at 20:05
  • @WiktorStribiżew thanks, this works but can it stop at the first <
    >? so it only captures the first group of urls? https://regex101.com/r/e9dKzo/5
    – jakejake Oct 03 '17 at 20:08
  • Is [`<
    >[\s\S]*\z(*SKIP)(*F)|(?:\G(?!\A).*\R\h*|^>>\s*)\K\S+`](https://regex101.com/r/e9dKzo/6) doing what you need?
    – Wiktor Stribiżew Oct 03 '17 at 20:11
  • Yes! thanks Wiktor I would never have been able to figure that out – jakejake Oct 03 '17 at 20:17
  • Thanks for deconstructing my mistakes @CAustin, That makes sense it guess I was way off! – jakejake Oct 03 '17 at 20:19
  • @jakejake Posted as an answer, please consider accepting. – Wiktor Stribiżew Oct 03 '17 at 20:23
  • @ctwheels There is no need using code here, just `\G` operator comes handy here. – Wiktor Stribiżew Oct 03 '17 at 20:24
  • @WiktorStribiżew Thanks! I knew there was a reset token somewhere but couldn't find it in the documentation. I'll keep `\G` in mind for future regex. – ctwheels Oct 03 '17 at 20:26
  • 1
    @ctwheels You may check [Regexp Quote-Like Operators](https://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators), scroll to `\G assertion` – Wiktor Stribiżew Oct 03 '17 at 20:28

1 Answers1

2

You may use

(?:\G(?!\A).*\R\h*|^>>\s*)\K\S+

See the regex demo. You will most probably want to pass i modifier to make the pattern match in a case insensitive way.

Details

  • (?:\G(?!\A).*\R\h*|^>>\s*) - match the end of the previous match (\G(?!\A)) and then any 0+ chars other than line break chars, as many as possible (.*), then a line break (\R) and then any 0+ horizontal whitespaces (\h*), or (|) a >> substring at the start of the line and then 0+ whitespaces (\s*)
  • \K - omit the text matched so far
  • \S+ - and match and return just 1 or more chars other than whitespace.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397