-2

I need to insert line breaks (enter marks) between a string before each new word starts.

String:

test (n) trial, experiment, check run (v) race, rush speed (n) race, sprint, rush, dash, zoom

Expected:

test (n) trial, experiment, check 
run (v) race, rush 
speed (n) race, sprint, rush, dash, zoom

This regular expression selects the word before paranthesis. But how do I insert an enter mark at the right place?

\w+(?=\s+((.*?)))


Update:

The answer do not apply to the actual string that I need to process. Does unicode strings treated differently by regular expression?

import re

regex = r"(\w+)(?= (?:[()])).*?"

test_str = "खत (स्त्री) पाहा : भेट मुलगा (पु) पोर‚ पोरगा‚ पोरटा‚  कारटा‚ किशोर‚ कुमार‚ कुमारिका‚ तरुण; लग्नाचा/उपवधू मुलगा; पाहा : पुत्र ‚ पुरुष (n) boy, lad, kid, urchin; पाहा : पुत्र ‚ पुरुष मुलगी (स्त्री) पोर‚ पोरगी‚ पोरटी‚ बाला‚ बाळा‚ बालिका‚ छोकरी‚ छोटी‚ बेटी‚ कारटी‚ नग्निका‚"

subst = "\\n\\1"

result = re.sub(regex, subst, test_str , 0, re.MULTILINE)

if result:
    print (result)

The first line break is correct "\nखत" but the second one is incorrect at "पुरु\nष ". Third and forth are missing.

Expected:

खत (स्त्री) पाहा : भेट 
मुलगा (पु) पोर‚ पोरगा‚ पोरटा‚  कारटा‚ किशोर‚ कुमार‚ कुमारिका‚ तरुण; लग्नाचा/उपवधू मुलगा; पाहा : पुत्र ‚ 
पुरुष (n) boy, lad, kid, urchin; पाहा : पुत्र ‚ पुरुष 
मुलगी (स्त्री) पोर‚ पोरगी‚ पोरटी‚ बाला‚ बाळा‚ बालिका‚ छोकरी‚ छोटी‚ बेटी‚ कारटी‚ नग्निका‚
shantanuo
  • 27,732
  • 66
  • 204
  • 340

2 Answers2

1

Here's a regex-replace statement that does that:

text = "test (n) trial, experiment, check run (v) race, rush speed (n) race, sprint, rush, dash, zoom"

re.sub(r"(\w+ \(\w\))", r"\n\1", text)

Output:

'\ntest (n) trial, experiment, check \nrun (v) race, rush \nspeed (n) race, sprint, rush, dash, zoom'

When printed, it provides:

test (n) trial, experiment, check 
run (v) race, rush 
speed (n) race, sprint, rush, dash, zoom
Roy2012
  • 9,953
  • 2
  • 17
  • 32
1

You may try:

\s+([^(\s]+\s+(?=\(.*?\)))

See the Online Demo


  • \s+ - A whitespace character, one or more times. To prevent trailing spaces later on.
  • ( - Opening 1st capturing group.
  • [^(\s]+ - Negated character class: No opening paranthesis or whitespace character, one or more times.
  • \s+ - A whitespace character, one or more times.
  • (?=\(.*?\))- Positive lookahead for literal opening paranthesis, any character other than newline zero or more times (lazy) and a literal closing paranthesis.
  • ) - Closing 1st capturing group.

As an alternative maybe try to use regex module instead of re and try this pattern:

((?<=\s+)[\p{Devanagari}\p{L}]+(?=\s*\(.*))

  • \s+( - One or more spaces and Opening 1st capture group. This to prevent trailing spaces once we put newlines later on.
  • (?<=\s+) - Positive lookbehind for a whitespace character (to prevent start string).
  • [\p{Devanagari}\p{L}]+ - Character class for any one or more Devanagari or any kind of letter from any language.
  • (?=\s*\(.*) - Positive lookahead for a literal opening paranthesis, zero or more characters except newline.
  • ) - Close 1st capture group.

Python Code:

import regex
test_str = "खत (स्त्री) पाहा : भेट मुलगा (पु) पोर‚ पोरगा‚ पोरटा‚  कारटा‚ किशोर‚ कुमार‚ कुमारिका‚ तरुण; लग्नाचा/उपवधू मुलगा; पाहा : पुत्र ‚ पुरुष (n) boy, lad, kid, urchin; पाहा : पुत्र ‚ पुरुष मुलगी (स्त्री) पोर‚ पोरगी‚ पोरटी‚ बाला‚ बाळा‚ बालिका‚ छोकरी‚ छोटी‚ बेटी‚ कारटी‚ नग्निका‚"
str_new = regex.sub(r'\s+((?<=\s+)[\p{Devanagari}\p{L}]+(?=\s*\(.*))', r'\n\1', test_str)

print(str_new)

Prints:

खत (स्त्री) पाहा : भेट
मुलगा (पु) पोर‚ पोरगा‚ पोरटा‚  कारटा‚ किशोर‚ कुमार‚ कुमारिका‚ तरुण; लग्नाचा/उपवधू मुलगा; पाहा : पुत्र ‚
पुरुष (n) boy, lad, kid, urchin; पाहा : पुत्र ‚ पुरुष
मुलगी (स्त्री) पोर‚ पोरगी‚ पोरटी‚ बाला‚ बाळा‚ बालिका‚ छोकरी‚ छोटी‚ बेटी‚ कारटी‚ नग्निका‚

Python Demo

JvdV
  • 41,931
  • 5
  • 24
  • 46