Treat an emoji as one character in a regex

Question

Here's a small example:

reg = ur"((?P<initial>[+\-])(?P<rest>.+?))$"

(In both cases the file has -*- coding: utf-8 -*-)

In Python 2:

re.match(reg, u"hello").groupdict()
# => {u'initial': u'\ud83d', u'rest': u'\udc4dhello'}
# unicode why must you do this

Whereas, in Python 3:

re.match(reg, "hello").groupdict()
# => {'initial': '', 'rest': 'hello'}

The above behaviour is 100% perfect, but switching to Python 3 is currently not an option. What's the best way to replicate 3's results in 2, that works in both narrow and wide Python builds? The appears to be coming to me in the format "\ud83d\udc4d", which is what's making this tricky.

It looks like your Python 2 installation is a narrow build, so it has to break up Unicode chars with a codepoint >= 0x10000. Does `unichr(0x10000)` raise an error, or does it return `u'\U00010000'`? — PM 2Ring, Jan 16 '18 at 06:56
Although it doesn't solve your problem, there's some info about narrow vs wide build here: https://stackoverflow.com/questions/29109944/python-returns-length-of-2-for-single-unicode-character-string — PM 2Ring, Jan 16 '18 at 07:02
Also see https://stackoverflow.com/questions/35404144/correctly-extract-emojis-from-a-unicode-string — PM 2Ring, Jan 16 '18 at 07:08
A reminder that some emoji consist of more than one Unicode codepoint (involving combining characters and zero width joiners). It should be possible to write a regex to capture that, but it's not going to be trivial (and I'm not even going to attempt the feat). — Marius Gedminas, Jan 20 '18 at 10:57
@MariusGedminas Yeah, that's precisely the problem I'm having here - the character consists of a string with two "\u" codepoints. — naiveai, Jan 20 '18 at 12:48

Mark Tolonen · Accepted Answer · 2018-01-20T14:45:57.750

In a Python 2 narrow build, non-BMP characters are two surrogate code points, so you can't use them in the [] syntax correctly. u'[] is equivalent to u'[\ud83d\udc4d]', which means "match one of \ud83d or \udc4d. Python 2.7 example:

>>> u'\U0001f44d' == u'\ud83d\udc4d' == u''
True
>>> re.findall(u'[]',u'')
[u'\ud83d', u'\udc4d']

To fix in both Python 2 and 3, match u' OR [+-]. This returns the correct result in both Python 2 and 3:

#coding:utf8
from __future__ import print_function
import re

# Note the 'ur' syntax is an error in Python 3, so properly
# escape backslashes in the regex if needed.  In this case,
# the backslash was unnecessary.
reg = u"((?P<initial>|[+-])(?P<rest>.+?))$"

tests = u'hello',u'-hello',u'+hello',u'\\hello'
for test in tests:
    m = re.match(reg,test)
    if m:
        print(test,m.groups())
    else:
        print(test,m)

Output (Python 2.7):

hello (u'\U0001f44dhello', u'\U0001f44d', u'hello')
-hello (u'-hello', u'-', u'hello')
+hello (u'+hello', u'+', u'hello')
\hello None

Output (Python 3.6):

hello ('hello', '', 'hello')
-hello ('-hello', '-', 'hello')
+hello ('+hello', '+', 'hello')
\hello None

The Obscure Question · Answer 2 · 2018-01-16T06:40:26.713

3

Just use the u prefix by itself.

In Python 2.7:

>>> reg = u"((?P<initial>[+\-])(?P<rest>.+?))$"
>>> re.match(reg, u"hello").groupdict()
{'initial': '', 'rest': 'hello'}

edited Jan 16 '18 at 06:40

answered Jan 16 '18 at 06:32

The Obscure Question

1,034
11
25

I get `{'initial': '\xf0', 'rest': '\x9f\x91\x8dhello'}` when i try your example using `python 2.7.13` – Sohaib Farooqi Jan 16 '18 at 06:36
I don't get that, I get the exact same error as before. – naiveai Jan 16 '18 at 06:39
No. Putting Unicode into plain Python 2 strings is not a good idea. Only use `u` strings for Unicode. – PM 2Ring Jan 16 '18 at 06:40
Wait, actually, I get the same error when I run under my real usecase. In the REPL I get the same result as @GarbageCollector – naiveai Jan 16 '18 at 06:41
@naiveai @Garbage Collector Interesting, seems like this is system dependent then. Try the `u` prefix by itself instead. I believe the `ur` is messing with the regex. – The Obscure Question Jan 16 '18 at 06:41
@TheObscureQuestion Doesn't make a difference for me in both REPL and usecase. – naiveai Jan 16 '18 at 06:48
1

This will likely only work on wide unicode builds of Python 2. These are default on Linux, but other platforms tend to default to narrow builds. – Marius Gedminas Jan 20 '18 at 10:55

score 2 · Answer 3 · answered Jan 16 '18 at 06:43

2

This is because Python2 doesn't distinguish between bytes and unicode strings.

Note that the Python 2.7 interpreter represents the character as 4 bytes. To get the same behavior in Python 3, you have to explicitly convert the unicode string to a bytes object.

# Python 2.7
>>> s = "hello"
>>> s
'\xf0\x9f\x91\x8dhello'

# Python 3.5
>>> s = "hello"
>>> s
'hello'

So for Python 2, just use the hex representation of that character for the search pattern (including specifying the length) and it works.

>>> reg = "((?P<initial>[+\-\xf0\x9f\x91\x8d]{4})(?P<rest>.+?))$"
>>> re.match(reg, s).groupdict()
{'initial': '\xf0\x9f\x91\x8d', 'rest': 'hello'}

answered Jan 16 '18 at 06:43

UnoriginalNick

172
3

This works for the emoji but seems to break the regular `+` and `-` cases. – naiveai Jan 16 '18 at 06:56
@naiveai `[+\-\xf0\x9f\x91\x8d]{4}` => `[+-]|\xf0\x9f\x91\x8d` – Wiktor Stribiżew Jan 16 '18 at 08:14
@WiktorStribiżew That does work in the REPL, but in my usecase, I'm getting passed string with `\u` in them, which I'm pretty sure is my fault - any way I can make them match? Simply using `\ud83d\udc4d` doesn't seem to work. – naiveai Jan 16 '18 at 08:31
You need to use the \u escape code for the character you want to match. \ud83d\udc4d is from the 4-byte sequence being improperly interpreted as two 2-byte unicode characters. The \u escape sequence for the thumbs up emoji is u"\U0001F44D". – UnoriginalNick Jan 16 '18 at 16:10
@UnoriginalNick Ok, there is definitely some other bug, because even that doesn't work. Thanks so much for the answer though, it'll help me get a little closer. – naiveai Jan 16 '18 at 16:27
"This is because Python2 doesn't distinguish between bytes and unicode strings.". What do you think `u''` vs. `''` is? In this case you've intentionally used a byte string. Your regex is matching exactly four of *any of* `+` or `-` or `\xf0` or `\x9f` or `\x91` or `\x8d`. – Mark Tolonen Jan 20 '18 at 14:37

score 1 · Answer 4 · answered Jan 16 '18 at 06:29

There is one option to convert that unicode to emoji in python 2.7:

b = dict['vote'] # assign that unicode value to b 
print b.decode('unicode-escape')

I don't know this is what you are exactly looking for . But I think you can use it to resolve that issue in some way .

Treat an emoji as one character in a regex

4 Answers4