Regex multiline match first part of lines between two strings

Question

I have only a crude understanding of RegEx and I'm stumped!

My file is formatted like this:

>>  

    www.google.com  some stuff I don't want
    www.yahoo.com

<<section>>

    www.bing.com
    www.yahoo.com

<<section>>

    www.bing.com
    https://github.com/zeeshanu/learn-regex

Here are the conditions I am hoping to match:

only the lines after the first ">>" and before the first "<<" on it
select only first block of text on each line, no text following whitespace after first text
ignore any initial white space on line if any

I have been able make this regex to select the part of the line I want:

^([^>>]\s*\S*){1}

But I can't get it to work within the proper strings.

So basically you just want the URLs? Do they always consist of only lowercase letters and dots? If you can narrow down what you're looking for a bit more, this might have a rather simple solution. Regarding your current pattern, `[^>>]` is a negated character class. Including `>` twice is redundant. The quantifier `{1}` is also redundant, as all characters/groups are implicitly quantified once unless otherwise indicated. — CAustin, Oct 03 '17 at 19:14
Anyway, this might be all you need, at least according to the example you posted: `[a-z]+\.[a-z.]+` — CAustin, Oct 03 '17 at 19:15
You could always use this `^\h*\K[^\s]+(?=.*$[\s\S]*<>\h*$`. Unfortunately, regex doesn't support quantifiers in lookbehinds, otherwise you could do that. The only other option I can think of is to use some sort of code capsule to ensure that somewhere before it a `>>`. See http://www.rexegg.com/regex-disambiguation.html#codecapsule for more info — ctwheels, Oct 03 '17 at 19:20
Yes I just want the urls, but they could be any url. ctwheels expression worked, but if there were more double brackets, it selected beyond the first occurrance: https://regex101.com/r/e9dKzo/3 — jakejake, Oct 03 '17 at 19:37
Try [`(?:\G(?!\A).*\R(?!\h*<>\s*)\K\S+`](https://regex101.com/r/gGFenL/2) — Wiktor Stribiżew, Oct 03 '17 at 20:05
@WiktorStribiżew thanks, this works but can it stop at the first <
>? so it only captures the first group of urls? https://regex101.com/r/e9dKzo/5 — jakejake, Oct 03 '17 at 20:08
Is [`<
>[\s\S]*\z(*SKIP)(*F)|(?:\G(?!\A).*\R\h*|^>>\s*)\K\S+`](https://regex101.com/r/e9dKzo/6) doing what you need? — Wiktor Stribiżew, Oct 03 '17 at 20:11
Yes! thanks Wiktor I would never have been able to figure that out — jakejake, Oct 03 '17 at 20:17
Thanks for deconstructing my mistakes @CAustin, That makes sense it guess I was way off! — jakejake, Oct 03 '17 at 20:19
@ctwheels There is no need using code here, just `\G` operator comes handy here. — Wiktor Stribiżew, Oct 03 '17 at 20:24
@WiktorStribiżew Thanks! I knew there was a reset token somewhere but couldn't find it in the documentation. I'll keep `\G` in mind for future regex. — ctwheels, Oct 03 '17 at 20:26
@ctwheels You may check [Regexp Quote-Like Operators](https://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators), scroll to `\G assertion` — Wiktor Stribiżew, Oct 03 '17 at 20:28

Wiktor Stribiżew · Accepted Answer · 2017-10-03T20:26:47.690

2

You may use

(?:\G(?!\A).*\R\h*|^>>\s*)\K\S+

See the regex demo. You will most probably want to pass i modifier to make the pattern match in a case insensitive way.

Details

(?:\G(?!\A).*\R\h*|^>>\s*) - match the end of the previous match (\G(?!\A)) and then any 0+ chars other than line break chars, as many as possible (.*), then a line break (\R) and then any 0+ horizontal whitespaces (\h*), or (|) a >> substring at the start of the line and then 0+ whitespaces (\s*)
\K - omit the text matched so far
\S+ - and match and return just 1 or more chars other than whitespace.

edited Oct 03 '17 at 20:26

answered Oct 03 '17 at 20:19

Wiktor Stribiżew

484,719
26
302
397

could you explain why "<
>[\s\S]*\z(*SKIP)(*F)" is needed? I took it out and it seems to work the same.
– jakejake Oct 03 '17 at 20:24
Ok, you may omit that part. – Wiktor Stribiżew Oct 03 '17 at 20:26
Some more reference links: [*`\K` match reset* operator](https://www.regular-expressions.info/keep.html) and [*`\G` operator*](https://stackoverflow.com/questions/21971701/when-is-g-useful-application-in-a-regex). – Wiktor Stribiżew Oct 03 '17 at 20:33

Regex multiline match first part of lines between two strings

1 Answers1