Insert enter marks before the selected word

Question

I need to insert line breaks (enter marks) between a string before each new word starts.

String:

test (n) trial, experiment, check run (v) race, rush speed (n) race, sprint, rush, dash, zoom

Expected:

test (n) trial, experiment, check 
run (v) race, rush 
speed (n) race, sprint, rush, dash, zoom

This regular expression selects the word before paranthesis. But how do I insert an enter mark at the right place?

\w+(?=\s+((.*?)))

Update:

The answer do not apply to the actual string that I need to process. Does unicode strings treated differently by regular expression?

import re

regex = r"(\w+)(?= (?:[()])).*?"

test_str = "खत (स्त्री) पाहा : भेट मुलगा (पु) पोर‚ पोरगा‚ पोरटा‚  कारटा‚ किशोर‚ कुमार‚ कुमारिका‚ तरुण; लग्नाचा/उपवधू मुलगा; पाहा : पुत्र ‚ पुरुष (n) boy, lad, kid, urchin; पाहा : पुत्र ‚ पुरुष मुलगी (स्त्री) पोर‚ पोरगी‚ पोरटी‚ बाला‚ बाळा‚ बालिका‚ छोकरी‚ छोटी‚ बेटी‚ कारटी‚ नग्निका‚"

subst = "\\n\\1"

result = re.sub(regex, subst, test_str , 0, re.MULTILINE)

if result:
    print (result)

The first line break is correct "\nखत" but the second one is incorrect at "पुरु\nष ". Third and forth are missing.

Expected:

खत (स्त्री) पाहा : भेट 
मुलगा (पु) पोर‚ पोरगा‚ पोरटा‚  कारटा‚ किशोर‚ कुमार‚ कुमारिका‚ तरुण; लग्नाचा/उपवधू मुलगा; पाहा : पुत्र ‚ 
पुरुष (n) boy, lad, kid, urchin; पाहा : पुत्र ‚ पुरुष 
मुलगी (स्त्री) पोर‚ पोरगी‚ पोरटी‚ बाला‚ बाळा‚ बालिका‚ छोकरी‚ छोटी‚ बेटी‚ कारटी‚ नग्निका‚

Please explain how are you identifying a new word? – Jun 16 '20 at 05:08 — , Jun 16 '20 at 05:08
the word before paranthesis is a new word – shantanuo Jun 16 '20 at 05:19 — shantanuo, Jun 16 '20 at 05:19
Does [**this**](https://regex101.com/r/dXfQsC/1) help? – Jun 16 '20 at 05:23 — , Jun 16 '20 at 05:23
Updated my question with unicode string example. – shantanuo Jun 16 '20 at 05:38 — shantanuo, Jun 16 '20 at 05:38

score 1 · Answer 1 · answered Jun 16 '20 at 05:19

Here's a regex-replace statement that does that:

text = "test (n) trial, experiment, check run (v) race, rush speed (n) race, sprint, rush, dash, zoom"

re.sub(r"(\w+ \(\w\))", r"\n\1", text)

Output:

'\ntest (n) trial, experiment, check \nrun (v) race, rush \nspeed (n) race, sprint, rush, dash, zoom'

When printed, it provides:

test (n) trial, experiment, check 
run (v) race, rush 
speed (n) race, sprint, rush, dash, zoom

JvdV · Accepted Answer · 2020-06-16T08:43:09.990

You may try:

\s+([^(\s]+\s+(?=\(.*?\)))

See the Online Demo

\s+ - A whitespace character, one or more times. To prevent trailing spaces later on.
( - Opening 1st capturing group.
[^(\s]+ - Negated character class: No opening paranthesis or whitespace character, one or more times.
\s+ - A whitespace character, one or more times.
(?=\(.*?\))- Positive lookahead for literal opening paranthesis, any character other than newline zero or more times (lazy) and a literal closing paranthesis.
) - Closing 1st capturing group.

As an alternative maybe try to use regex module instead of re and try this pattern:

((?<=\s+)[\p{Devanagari}\p{L}]+(?=\s*\(.*))

\s+( - One or more spaces and Opening 1st capture group. This to prevent trailing spaces once we put newlines later on.
(?<=\s+) - Positive lookbehind for a whitespace character (to prevent start string).
[\p{Devanagari}\p{L}]+ - Character class for any one or more Devanagari or any kind of letter from any language.
(?=\s*\(.*) - Positive lookahead for a literal opening paranthesis, zero or more characters except newline.
) - Close 1st capture group.

Python Code:

import regex
test_str = "खत (स्त्री) पाहा : भेट मुलगा (पु) पोर‚ पोरगा‚ पोरटा‚  कारटा‚ किशोर‚ कुमार‚ कुमारिका‚ तरुण; लग्नाचा/उपवधू मुलगा; पाहा : पुत्र ‚ पुरुष (n) boy, lad, kid, urchin; पाहा : पुत्र ‚ पुरुष मुलगी (स्त्री) पोर‚ पोरगी‚ पोरटी‚ बाला‚ बाळा‚ बालिका‚ छोकरी‚ छोटी‚ बेटी‚ कारटी‚ नग्निका‚"
str_new = regex.sub(r'\s+((?<=\s+)[\p{Devanagari}\p{L}]+(?=\s*\(.*))', r'\n\1', test_str)

print(str_new)

Prints:

खत (स्त्री) पाहा : भेट
मुलगा (पु) पोर‚ पोरगा‚ पोरटा‚  कारटा‚ किशोर‚ कुमार‚ कुमारिका‚ तरुण; लग्नाचा/उपवधू मुलगा; पाहा : पुत्र ‚
पुरुष (n) boy, lad, kid, urchin; पाहा : पुत्र ‚ पुरुष
मुलगी (स्त्री) पोर‚ पोरगी‚ पोरटी‚ बाला‚ बाळा‚ बालिका‚ छोकरी‚ छोटी‚ बेटी‚ कारटी‚ नग्निका‚

Python Demo

Insert enter marks before the selected word

2 Answers2