How to match typical tri-phone using regex?

Question

For example there is vowel and consonant phonemes in Chinese

vowels  = ['a', 'ai', 'an', 'ang', 'ao', 'e', 'ei', 'en', 'eng', 'er', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'ii', 'iii', 'in', 'ing', 'iong', 'iou', 'o', 'ong', 'ou', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'ueng', 'uo', 'v', 'van', 've', 'vn', 'zh']

consonants = ['b','c','ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh',' sp', 'sil', 't', 'x', 'z']

Suppose I have tri-phone like this:

The tri-phone 'a-b+c' means previous,current,following phoneme is a,b and c.

I want to use regex to extract the adjacent vowels pattern like vowel-vowel+* and *-vowel+vowel.

For example

Match: zh-uei+x, b-ai+vn, e-uang+x

Don't match: sil-z+ai, vn-l+v, x-ia+f

I use this code:

v = '|'.join(vowels)           # Or v = '^'+'|'.join(consonants)
p = r'({0}\-{0}\+.*)|(.*\-{0}\+{0})'.format(v)

However re.match(p,'z-en+iang') still gives False. So how to fix it? Thanks

To help you get an answer faster, you might want to edit your answer so that a programmer with no experience in linguistics can answer it. It seems like it should be an easy solution, but it isn't totally clear what you're asking without looking up all the jargon. — 3ocene, Dec 13 '18 at 02:06
`\1` in regex matches exactly the contents (not the pattern) of group 1. — Michael Butscher, Dec 13 '18 at 02:14

Mohamed Benkedadra · Accepted Answer · 2018-12-13T03:05:07.023

import re

vowels  = ['a', 'ai', 'an', 'ang', 'ao', 'e', 'ei', 'en', 'eng', 'er', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'ii', 'iii', 'in', 'ing', 'iong', 'iou', 'o', 'ong', 'ou', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'ueng', 'uo', 'v', 'van', 've', 'vn', 'zh']
consonants = ['b','c','ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh','sp', 'sil', 't', 'x', 'z']
# joining vowels with |
vowels_string = '|'.join(vowels)
# joining consonants with |
consonants_string = '|'.join(consonants)

# joining all characters with |
all_chars = "{}|{}".format(vowels_string, consonants_string)

reg1 = '^(?:{1})-(?:{0})\+(?:{0})$'.format(vowels_string, all_chars) # allchars-vowel+vowel
reg2 = '^(?:{0})-(?:{0})\+(?:{1})$'.format(vowels_string, all_chars) # vowel-vowel+allchars 

# compiling the regex
regex = re.compile(
    '({})|({})'.format(reg1, reg2)
)

# testing
print(re.match(regex, 'zh-uei+x'))
print(re.match(regex, 'b-ai+vn'))
print(re.match(regex, 'e-uang+x'))
print(re.match(regex, 'z-en+iang'))

print(re.match(regex, 'sil-z+ai'))
print(re.match(regex, 'vn-l+v'))
print(re.match(regex, 'x-ia+f'))

vowels_string contains all vowels separated with or (|)
consonants_string contains all consonants separated with or (|)
all_chars contains all the characters separated with or (|)

the regex is the following : (1 is all_chars and 0 is vowels_string)

'^ -> beginning of string
(?:{1})  -> all characters
-
(?:{0}) -> vowels
\+
(?:{0}) -> vowels
$'-> end of string

what the purpose of the sign `?:`? If deleting it, the result change — partida, Dec 13 '18 at 03:03
@partida check out this post for ?: https://stackoverflow.com/questions/36524507/notation-in-regular-expression — Mohamed Benkedadra, Dec 13 '18 at 03:06
I find `?:` means non-capturing group. It seems it's not effect the result only use less memory? — partida, Dec 13 '18 at 03:31
@partida it affects the results sometimes check out this answer https://stackoverflow.com/a/3513858/6147182 ... when you use a non capturing groupe (?:) the same happens while matching as when you use a normal capturing group () ... but in the results the string matched using the not capturing group is not returned, i just added them by force of habit, if you need the string to be returned remove the ?: — Mohamed Benkedadra, Dec 13 '18 at 03:35

How to match typical tri-phone using regex?

1 Answers1