1

For example there is vowel and consonant phonemes in Chinese

vowels  = ['a', 'ai', 'an', 'ang', 'ao', 'e', 'ei', 'en', 'eng', 'er', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'ii', 'iii', 'in', 'ing', 'iong', 'iou', 'o', 'ong', 'ou', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'ueng', 'uo', 'v', 'van', 've', 'vn', 'zh']

consonants = ['b','c','ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh',' sp', 'sil', 't', 'x', 'z']

Suppose I have tri-phone like this:

The tri-phone 'a-b+c' means previous,current,following phoneme is a,b and c.

enter image description here

I want to use regex to extract the adjacent vowels pattern like vowel-vowel+* and *-vowel+vowel.

For example

Match: zh-uei+x, b-ai+vn, e-uang+x

Don't match: sil-z+ai, vn-l+v, x-ia+f

I use this code:

v = '|'.join(vowels)           # Or v = '^'+'|'.join(consonants)
p = r'({0}\-{0}\+.*)|(.*\-{0}\+{0})'.format(v)

However re.match(p,'z-en+iang') still gives False. So how to fix it? Thanks

Community
  • 1
  • 1
partida
  • 491
  • 4
  • 18
  • To help you get an answer faster, you might want to edit your answer so that a programmer with no experience in linguistics can answer it. It seems like it should be an easy solution, but it isn't totally clear what you're asking without looking up all the jargon. – 3ocene Dec 13 '18 at 02:06
  • @3ocene ok I will improve my question. Thank you – partida Dec 13 '18 at 02:10
  • `\1` in regex matches exactly the contents (not the pattern) of group 1. – Michael Butscher Dec 13 '18 at 02:14
  • @MichaelButscher Thanks for point out, I try another way. – partida Dec 13 '18 at 02:24

1 Answers1

1
import re

vowels  = ['a', 'ai', 'an', 'ang', 'ao', 'e', 'ei', 'en', 'eng', 'er', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'ii', 'iii', 'in', 'ing', 'iong', 'iou', 'o', 'ong', 'ou', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'ueng', 'uo', 'v', 'van', 've', 'vn', 'zh']
consonants = ['b','c','ch', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 'sh','sp', 'sil', 't', 'x', 'z']
# joining vowels with |
vowels_string = '|'.join(vowels)
# joining consonants with |
consonants_string = '|'.join(consonants)

# joining all characters with |
all_chars = "{}|{}".format(vowels_string, consonants_string)

reg1 = '^(?:{1})-(?:{0})\+(?:{0})$'.format(vowels_string, all_chars) # allchars-vowel+vowel
reg2 = '^(?:{0})-(?:{0})\+(?:{1})$'.format(vowels_string, all_chars) # vowel-vowel+allchars 

# compiling the regex
regex = re.compile(
    '({})|({})'.format(reg1, reg2)
)

# testing
print(re.match(regex, 'zh-uei+x'))
print(re.match(regex, 'b-ai+vn'))
print(re.match(regex, 'e-uang+x'))
print(re.match(regex, 'z-en+iang'))

print(re.match(regex, 'sil-z+ai'))
print(re.match(regex, 'vn-l+v'))
print(re.match(regex, 'x-ia+f'))
  • vowels_string contains all vowels separated with or (|)
  • consonants_string contains all consonants separated with or (|)

  • all_chars contains all the characters separated with or (|)

the regex is the following : (1 is all_chars and 0 is vowels_string)

'^ -> beginning of string
(?:{1})  -> all characters
-
(?:{0}) -> vowels
\+
(?:{0}) -> vowels
$'-> end of string
Mohamed Benkedadra
  • 1,445
  • 2
  • 13
  • 40
  • what the purpose of the sign `?:`? If deleting it, the result change – partida Dec 13 '18 at 03:03
  • 1
    @partida check out this post for ?: https://stackoverflow.com/questions/36524507/notation-in-regular-expression – Mohamed Benkedadra Dec 13 '18 at 03:06
  • I find `?:` means non-capturing group. It seems it's not effect the result only use less memory? – partida Dec 13 '18 at 03:31
  • 1
    @partida it affects the results sometimes check out this answer https://stackoverflow.com/a/3513858/6147182 ... when you use a non capturing groupe (?:) the same happens while matching as when you use a normal capturing group () ... but in the results the string matched using the not capturing group is not returned, i just added them by force of habit, if you need the string to be returned remove the ?: – Mohamed Benkedadra Dec 13 '18 at 03:35