4

I downloaded a dataset of facebook messages and it was formatted like this:

f\u00c3\u00b8rste student

It's supposed to be første student but I cant seem to decode it correctly.

I tried:

str = 'f\u00c3\u00b8rste student'
print(str)
# 'første student'

str = 'f\u00c3\u00b8rste student'
print(str.encode('utf-8')) 
# b'f\xc3\x83\xc2\xb8rste student'

But it did't work.

vhflat
  • 532
  • 5
  • 18
  • 1
    `'ø'` is `'\u00f8'` – timgeb Dec 03 '18 at 21:53
  • Your string is in fact: 'første student' – Babak Dec 03 '18 at 21:54
  • Well, I'm trying to figure out how I can get from ´\u00c3\u00b8´ to ´ø´, seeing that my whole data set is formatted like this. – vhflat Dec 03 '18 at 21:55
  • Put `# -*- coding: utf-8 -*-` on top of your Python script. – Rafael Dec 03 '18 at 21:56
  • @Babak when I open it in sublime it looks like f\u00c3\u00b8rste student. Is that wrong? – vhflat Dec 03 '18 at 21:56
  • 2
    @Rafael That will not help `# -*- coding: utf-8 -*-` is specifing the file encoding of the source code only. – quant Dec 03 '18 at 21:57
  • @Prune, this is not a duplicate! that question is not the same at all. My data seem to be double encoded and to latin-1 for some reason. ARG facebook! – vhflat Dec 03 '18 at 22:01
  • @Prune This is not a UTF-8 encoding issue. The issue is that there are multiple characters looking like 'ø' that are quite similar. So yes, \u00f8 is such a character, but `\xC3\xb8` too. With this the answer is obvious. – quant Dec 03 '18 at 22:02
  • If you are sure that the original name is første, then my guess would be something has messed your source data up! – Babak Dec 03 '18 at 22:03
  • @Babak yes Im sure because its my own facebook messages that ive downloaded. I found [this](https://stackoverflow.com/questions/50008296/facebook-json-badly-encoded) question that I think is the same problem. – vhflat Dec 03 '18 at 22:05
  • 3
    @vhflat: Sorru; I reopened. – Prune Dec 03 '18 at 22:10
  • @vhflat yup... looks like you've got to encode/decode between things there... hopefully Martijn's answer helps there on the link you posted? – Jon Clements Dec 03 '18 at 22:14
  • 1
    Possible duplicate of [Facebook JSON badly encoded](https://stackoverflow.com/questions/50008296/facebook-json-badly-encoded) – snakecharmerb Feb 24 '19 at 10:03

1 Answers1

8

To undo whatever encoding foulup has taken place, you first need to convert the characters to the bytes with the same ordinals by encoding in ISO-8859-1 (Latin-1) and then after that decoding as UTF-8:

>>> 'f\u00c3\u00b8rste student'.encode('iso-8859-1').decode('utf-8')
'første student'
jwodder
  • 46,316
  • 9
  • 83
  • 106