2

Pattern 1:Delimited by |

Input : a|b|c|d     
Output: a|b|c|d 

Pick everything when delimited by a single pipe

Pattern 2:Delimited by | and ||
Example1:

Input :a|b||c||d       
Output:a|b||c   

Pick everything before last double pipe

Example2:

Input :a|b||c|d     
Output:a|b   

Pattern 3:Beginning of the string can have multiple pipes(odd or even) and further be deilimited by | and ||

Input :|||a|b||c||d     
Output:|||a|b||c 

Pick everything before last double pipe ,beginning of the string might have odd or even pipes and they must be selected.

Below covers all except scenario 1,My requirement is to cover all scenarios in one regexp_extract spark.sql("select regexp_extract('name|place|thing|ink', '(.*)(?=\\\\|\\\\|)') as demo").show(false)

If it can not be done in one regexp_extract.Can you suggest other options.

Please advise.

SeaBean
  • 6,349
  • 1
  • 3
  • 18
Priya
  • 55
  • 5
  • Please _format_ your question so that the code and data are monospaced. Add four spaces or more on each line of code/data. – Tim Biegeleisen Feb 26 '21 at 06:07
  • Welcome to SO Priya. Your question is quite interesting. Please see my answer below. Anyway, please follow the guidelines to properly format the codes and data in the question. Please also consider to [accept an answer](https://stackoverflow.com/help/someone-answers) if you find it fulfill your requirement. Let me know if any clarification required. – SeaBean Feb 26 '21 at 10:29

1 Answers1

1

Use the following RegEx:

^(\|*(?:(?!\|\|(?!.*\|\|)).)*)

See the RegEx Demo showing all the matches

This is a rather complicated requirement and requires the use of Tempered Greedy Token together with Negative Lookahead within the Tempering pattern. Let me explain the logics below:

Logics

  • ^ to match only from the beginning of string
  • (...) enclose the entire pattern after ^ to make it a capturing group
  • \|* for the requirement of Pattern 3 to match the multiple | at the beginning, as many as possible (hence use greedy *)
  • (?:(?!...).)* this is the main construct (skeleton) of Tempered Greedy Token whose details I will explain below:
  • \|\|(?!.*\|\|) this is the main body (core) of the Tempered Greedy Token. The first part before ( is to ensure the characters match up to but not including the pattern || The second part (?!.*\|\|) is to ensure the || pattern in the first part is not followed by any other double pipes || somewhere after, as per the requirement.

In fact, I think the question is quite interesting and requires sophisticated features of RegEx to support it. This is also the first example I seen so far that requires a Negative Lookahead within a Tempered Greedy Token construct.

SeaBean
  • 6,349
  • 1
  • 3
  • 18
  • Thanks!Above solution works in hive but Spark sql throws an error "Dangling metacharcter * at index 3.Since * is a quantifier I think we should not escape it.I just gave a try escaping it with \\* but still the dangling metacharacter error occurred. – Priya Feb 26 '21 at 15:02
  • 1
    Sorry, I'm not familiar with Spark SQL. Anyway, seen the regex you tried, that is `(.*)(?=\\\\|\\\\|)` also has `*` in it so I guess the problem doesn't really at the `*` quantifier. My wild guess is that the backslash \ used to escaping symbols. Seen you used 4 \ to escape the `|` symbol while I used only one \ which is accepted in other RegEx environments when probably quoted. Could you try adding more \ to the above regex and try again ? – SeaBean Feb 26 '21 at 15:44
  • Hi @Priya, while you trying to customize the regex to make it workable under Spark sql, I have further fine-tuned the regex to generalize it to cover some more sample cases not mentioned in your question: e.g. `||||a|b||c|a||f` where double slash segments at the end are not necessarily placed together (can have single slash segments in between). Please take this enhanced regex for your work. – SeaBean Feb 26 '21 at 16:58
  • @SeaBean-for Spark Sql I used the approach in the below post https://stackoverflow.com/questions/66388303/dangling-metacharacter-sparksql. Nevertheless appreciate your help.I am new to regex it will be great if you can share pointers on learning it. – Priya Feb 26 '21 at 19:49
  • @Priya That's great you find a solution from someone who know Spark Sql. The one answering you is also a guru in RegEx. – SeaBean Feb 26 '21 at 19:53
  • @Priya Welcome! My pleasure to help and through the answering process I also learn a lot. To learn regex, you can read some online guides e.g. [rexegg.com/](http://www.rexegg.com/) You can also read the posts about regex here in StackOverflow, especially those answered by Wiktor Stribiżew, the one helping you in the other post. The concept I used in this answer was actually learned from some of his posts. You can also study the [SO RegEx FAQ](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075). – SeaBean Feb 26 '21 at 21:06