10

Here's a small example:

reg = ur"((?P<initial>[+\-])(?P<rest>.+?))$"

(In both cases the file has -*- coding: utf-8 -*-)

In Python 2:

re.match(reg, u"hello").groupdict()
# => {u'initial': u'\ud83d', u'rest': u'\udc4dhello'}
# unicode why must you do this

Whereas, in Python 3:

re.match(reg, "hello").groupdict()
# => {'initial': '', 'rest': 'hello'}

The above behaviour is 100% perfect, but switching to Python 3 is currently not an option. What's the best way to replicate 3's results in 2, that works in both narrow and wide Python builds? The appears to be coming to me in the format "\ud83d\udc4d", which is what's making this tricky.

naiveai
  • 235
  • 1
  • 12
  • 31
  • 2
    It looks like your Python 2 installation is a narrow build, so it has to break up Unicode chars with a codepoint >= 0x10000. Does `unichr(0x10000)` raise an error, or does it return `u'\U00010000'`? – PM 2Ring Jan 16 '18 at 06:56
  • Although it doesn't solve your problem, there's some info about narrow vs wide build here: https://stackoverflow.com/questions/29109944/python-returns-length-of-2-for-single-unicode-character-string – PM 2Ring Jan 16 '18 at 07:02
  • Also see https://stackoverflow.com/questions/35404144/correctly-extract-emojis-from-a-unicode-string – PM 2Ring Jan 16 '18 at 07:08
  • A reminder that some emoji consist of more than one Unicode codepoint (involving combining characters and zero width joiners). It should be possible to write a regex to capture that, but it's not going to be trivial (and I'm not even going to attempt the feat). – Marius Gedminas Jan 20 '18 at 10:57
  • @MariusGedminas Yeah, that's precisely the problem I'm having here - the character consists of a string with two "\u" codepoints. – naiveai Jan 20 '18 at 12:48

4 Answers4

4

In a Python 2 narrow build, non-BMP characters are two surrogate code points, so you can't use them in the [] syntax correctly. u'[] is equivalent to u'[\ud83d\udc4d]', which means "match one of \ud83d or \udc4d. Python 2.7 example:

>>> u'\U0001f44d' == u'\ud83d\udc4d' == u''
True
>>> re.findall(u'[]',u'')
[u'\ud83d', u'\udc4d']

To fix in both Python 2 and 3, match u' OR [+-]. This returns the correct result in both Python 2 and 3:

#coding:utf8
from __future__ import print_function
import re

# Note the 'ur' syntax is an error in Python 3, so properly
# escape backslashes in the regex if needed.  In this case,
# the backslash was unnecessary.
reg = u"((?P<initial>|[+-])(?P<rest>.+?))$"

tests = u'hello',u'-hello',u'+hello',u'\\hello'
for test in tests:
    m = re.match(reg,test)
    if m:
        print(test,m.groups())
    else:
        print(test,m)

Output (Python 2.7):

hello (u'\U0001f44dhello', u'\U0001f44d', u'hello')
-hello (u'-hello', u'-', u'hello')
+hello (u'+hello', u'+', u'hello')
\hello None

Output (Python 3.6):

hello ('hello', '', 'hello')
-hello ('-hello', '-', 'hello')
+hello ('+hello', '+', 'hello')
\hello None
Mark Tolonen
  • 132,868
  • 21
  • 152
  • 208
3

Just use the u prefix by itself.

In Python 2.7:

>>> reg = u"((?P<initial>[+\-])(?P<rest>.+?))$"
>>> re.match(reg, u"hello").groupdict()
{'initial': '', 'rest': 'hello'}
The Obscure Question
  • 1,034
  • 11
  • 25
2

This is because Python2 doesn't distinguish between bytes and unicode strings.

Note that the Python 2.7 interpreter represents the character as 4 bytes. To get the same behavior in Python 3, you have to explicitly convert the unicode string to a bytes object.

# Python 2.7
>>> s = "hello"
>>> s
'\xf0\x9f\x91\x8dhello'

# Python 3.5
>>> s = "hello"
>>> s
'hello'

So for Python 2, just use the hex representation of that character for the search pattern (including specifying the length) and it works.

>>> reg = "((?P<initial>[+\-\xf0\x9f\x91\x8d]{4})(?P<rest>.+?))$"
>>> re.match(reg, s).groupdict()
{'initial': '\xf0\x9f\x91\x8d', 'rest': 'hello'}
  • This works for the emoji but seems to break the regular `+` and `-` cases. – naiveai Jan 16 '18 at 06:56
  • @naiveai `[+\-\xf0\x9f\x91\x8d]{4}` => `[+-]|\xf0\x9f\x91\x8d` – Wiktor Stribiżew Jan 16 '18 at 08:14
  • @WiktorStribiżew That does work in the REPL, but in my usecase, I'm getting passed string with `\u` in them, which I'm pretty sure is my fault - any way I can make them match? Simply using `\ud83d\udc4d` doesn't seem to work. – naiveai Jan 16 '18 at 08:31
  • You need to use the \u escape code for the character you want to match. \ud83d\udc4d is from the 4-byte sequence being improperly interpreted as two 2-byte unicode characters. The \u escape sequence for the thumbs up emoji is u"\U0001F44D". – UnoriginalNick Jan 16 '18 at 16:10
  • @UnoriginalNick Ok, there is definitely some other bug, because even that doesn't work. Thanks so much for the answer though, it'll help me get a little closer. – naiveai Jan 16 '18 at 16:27
  • "This is because Python2 doesn't distinguish between bytes and unicode strings.". What do you think `u''` vs. `''` is? In this case you've intentionally used a byte string. Your regex is matching exactly four of *any of* `+` or `-` or `\xf0` or `\x9f` or `\x91` or `\x8d`. – Mark Tolonen Jan 20 '18 at 14:37
1

There is one option to convert that unicode to emoji in python 2.7:

b = dict['vote'] # assign that unicode value to b 
print b.decode('unicode-escape')

I don't know this is what you are exactly looking for . But I think you can use it to resolve that issue in some way .

Vikas P
  • 2,615
  • 1
  • 13
  • 26