1

I’m having a hard time figuring out the regex code in Google Sheets to check a cell then return everything including new lines \n and returns \r before a certain pattern \*+.

A little more background: I'm using REGEXEXTRACT(A:A,"...") format inside a bigger ArrayFormula so that it automatically updates when a new row is added. This one’s working properly. It’s only the regex part I’m having trouble with.

So, for the purpose of this question, let's say I'm only worried about extracting the data from the A1 cell before a certain pattern and return that value in cell B1. Which brings us to this code in cell B1:

REGEXEXTRACT(A1,"...")

For example, this is how my A1 cell looks like:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan risus id ex dapibus sodales. 

Curabitur dui lacus, tincidunt vel ligula quis, volutpat mattis eros. 

In quis metus at ex auctor lobortis. Aliquam sed nisi purus. Sed cursus odio erat, ut tristique sapien interdum interdum. Morbi vel sollicitudin ante, non pellentesque libero. 

***********

Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Aenean egestas urna facilisis massa posuere, quis accumsan erat ornare. 

Curabitur at dapibus nibh. Nam nec vestibulum ligula. Phasellus bibendum mi urna, ac hendrerit libero interdum non. Suspendisse semper non elit aliquam auctor. 

Morbi vel sem tortor. Donec a sapien quis erat condimentum consequat in ut sem. Quisque in tellus sed est lobortis ultricies sed vitae enim.

I want to return this value in B1:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan risus id ex dapibus sodales. 

Curabitur dui lacus, tincidunt vel ligula quis, volutpat mattis eros. 

In quis metus at ex auctor lobortis. Aliquam sed nisi purus. Sed cursus odio erat, ut tristique sapien interdum interdum. Morbi vel sollicitudin ante, non pellentesque libero. 

Which is basically anything before the pattern *******. In Python, I can add the re.DOTALL to the .* but I can't get this to work in Google Sheets.

TheMaster
  • 32,296
  • 6
  • 31
  • 56
aylim14
  • 49
  • 8

4 Answers4

1

To make a dot match line breaks, you need to add (?s) to the pattern. To match any char, you may use a .. To match up to the leftmost occurrence, use lazy quantifier, *?. To actually extract a substring you need, wrap the part of the pattern you are interested in getting with capturing parentheses.

So, to match up to the first ******* substring, you may use

(?s)^(.*?)\*\*\*\*\*\*\*

or (?s)^(.*?)\*{7}. See the regex demo (note that Go regex engine is also RE2, so you may test your patterns there, at regex101.com).

  • (?s) - a DOTALL modifier
  • ^ - start of string
  • (.*?) - Group 1: any 0+ chars as few as possible
  • \*\*\*\*\*\*\* - 7 literal asterisk symbols.

Note you cannot rely on a negated character class (that matches line breaks) if your substring may contain * chars, that is, ^([^*]*)\*\*\*\*\*\*\* won't work in those cases.

If you just want to match any chars up to the first * in the string, your regex will simplify greatly to

^([^*]+)

It matches

  • ^ - start of string
  • ([^*]+) - Capturing group 1: one or more chars other than *.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • Thanks for this! I’ll try out the longer code later. So far, what I have is the short code `[^*]+` and looks like it’s working. But as you said, if there’s a * anywhere inside, it’ll break. So far, it looks like my text only uses it for breaks and not inside the copy. BUT I’ll continue to monitor and change this if i need to. Thanks again for the explanation. Really helps that you explained it part by part. – aylim14 Sep 09 '18 at 00:22
1

re.DOTALL flag in python corresponds to (?s) single line mode flag in re2.

Python:

(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

re2:

Flags: s let . match \n (default false)

So,

=REGEXEXTRACT(A1,"(?s)(.*?)\*")

This corresponds to re.findall()

Community
  • 1
  • 1
TheMaster
  • 32,296
  • 6
  • 31
  • 56
0

Not regex though might suit someone wanting the same result but less particular about the method:

  =ArrayFormula(LEFT(A1:A,Find("***********",A1:A)-3))
pnuts
  • 54,806
  • 9
  • 74
  • 122
0

If you really only want to match everything before the first *:

=REGEXEXTRACT(A1;"[^*]*")

If you want to allow a single star in the text and only stop at multiple (2 or more) stars (possibly divided by newlines) at the beginning of a line, you could try:

=REGEXEXTRACT(A1;"(?s)^(.*)\n(\*\n?){2,}")

But you would have to strip the stars. E.g.

=REGEXREPLACE(REGEXEXTRACT(A1;"(?s)^(.*)\n(\*\n?){2,}"); "\n(\*\n?){2,}"; "")

A lookahead does not seem to work in Google Sheets.

morja
  • 7,779
  • 2
  • 31
  • 51