-2

I'm trying to transform this unicode value:

string_value = u'd\xe9cid\xe9'

to

string_value = u'décidé'

I feel like I've tried everything:

decoded_str = string_value.decode('utf-8')

or

string_value = str(string_value)
decoded_str = string_value.encode('latin1').decode('utf-8')

or

string_value = string_value.decode('latin-1')

for this one the result is:

d\xc3\xa9cid\xc3\xa9

I have the same result if I do:

string_value = string_value.encode('utf-8')

I've read from: How do I convert 'blah \xe9 blah' to 'blah é blah'

also from: Why does Python print unicode characters when the default encoding is ASCII?

and: How do I convert a unicode to a string at the Python level?

EDIT:

My problem is I need to use the data, I mean if I have :

string_value = u'mai 2017 \u2013 Aujourd\u2019hui'

which is :

mai 2017 – Aujourd’hui

I want to do:

string_list = string_value.split('-')

But the result is:

[u'mai 2017 \u2013 Aujourd\u2019hui']

And I would:

['mai 2017', 'Aujourd’hui']

NEW EDIT:

I understand that I'm going to the wrong direction, thanks to your answer. \xe9 is the right representation of 'é' and it's not a problem. My real issue is why does json.loads() transform 'mai 2017 – Aujourd’hui' to 'mai 2017 \u2013 Aujourd\u2019hui' ?

deceze
  • 471,072
  • 76
  • 664
  • 811
PAscalinox
  • 37
  • 1
  • 10
  • Why do you care how a string is represented in your source code? Does it not come out correctly when you `print` it? – Jongware Mar 20 '18 at 10:38
  • string_value = string_value.encode('utf-8') is working for me in python2.7 – Rakesh Mar 20 '18 at 10:38
  • `u'd\xe9cid\xe9'` already represents `u'décidé'`. You don't need to do anything. – deceze Mar 20 '18 at 10:43
  • Thanks for your answer @usr2564301, the problem is I need to format the data. For exemple I've this unicode u'mai 2017 \u2013 Aujourd\u2019hui' that is 'mai 2017 – Aujourd’hui' and I want to split it at '-' and it's not working. I'm doing to edit my question with this exemple – PAscalinox Mar 20 '18 at 10:43
  • 2
    `-` is simply not the same character as `–` (`\u2013`). – deceze Mar 20 '18 at 10:47
  • Don't edit your question in a way that invalidates all existing answers. – deceze Mar 20 '18 at 11:21

2 Answers2

2

I am not sure what you're asking: \xe9 is a representation of the code point 233 (e9 in hexadecimal), which simply is the letter "é":

>>> u'é' == u'\xe9'
True

Your confusion might stem from the fact that the repr of a Python string is (in Python 2) in ASCII, so non-ASCII characters are escaped. The Python console displays a value using repr if you do not print it explicitly:

>>> print(repr(u'é'))
u'\xe9'

>>> print(repr(u'\xe9'))
u'\xe9'

However, when you print the value, then it that conversion doesn't happen and everything works as expected:

>>> print(u'é')
é

>>> print(u'\xe9')
é

Also note that in Python 3, repr returns Unicode:

Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print(repr(u'\xe9'))
'é'

Update after the question was edited:

As pointed out in the comments, \u2013 is not the same character as - (just as a and b are separate characters). So you'll need to split on \u2013 instead of splitting on -.

Florian Brucker
  • 7,641
  • 3
  • 37
  • 62
0

splitting a string with a unicode delimiter?

so...

print string_value.split(u"\u2013")
Chris Curvey
  • 7,513
  • 6
  • 38
  • 57