0

I'm trying to read a twitter stream using Python.

The lines in my file, which seems to be correct, look like:

{"delete":{"status":{"id":471622360253345792,"user_id":2513833684,"id_str":"471622360253345792","user_id_str":"2513833684"}}}

When I read this line into memory using readline and call json.loads() on it, I get the following error:

No JSON object could be decoded

I'm thinking I have to convert the line somehow before calling json.loads() on it?

Some notes:

  1. If I paste the string from the file into IPython and call json.loads() on it, then everything works fine.
  2. When I print the line in IPython, it adds a strange character at the front and puts spaces between the rest of the characters. The first few characters look like:

    �{" d e l e t e " : { " s t a t u s

  3. If I display the string in IPython without calling print, the first few characters are:

    \xff\xfe{\x00"\x00d\x00e\x00l\x00e\x00t\x00e\x00"\x00:\x00{\x00"\x00s\x00t\x00a\x00t\x00u\x00s\x00"\x00

I have no idea how to fix this.

Edit: By request, the code that reads the twitter stream is here:

https://github.com/uwescience/datasci_course_materials/blob/master/assignment1/twitterstream.py

hahdawg
  • 1,169
  • 1
  • 9
  • 16
  • Can we see your actual code? Particularly the part that does the reading. – Cory Kramer Jul 02 '14 at 22:27
  • The __FF FE__ header shows that the byte stream is encoded as UTF-8. See: http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom – johntellsall Jul 02 '14 at 23:01

3 Answers3

0

By the looks of it you have some non-ascii text and possibly your parser isn't handling the different encoding.

If you check the documentation on the json library you see:

If the contents of fp are encoded with an ASCII based encoding other than UTF-8 
(e.g. latin-1), then an appropriate encoding name must be specified. Encodings 
that are not ASCII based (such as UCS-2) are not allowed, and should be wrapped 
with codecs.getreader(encoding)(fp), or simply decoded to a unicode object and 
passed to loads().

so I would check that your json is properly formatted and then look into the encoding.

Damon Swayn
  • 1,288
  • 1
  • 15
  • 33
0
json.loads(twitter_data, encoding='utf-16')
jossgray
  • 447
  • 5
  • 18
0

Are you using Windows for the assignments? The default encoding for the text file retrieved under Windows is UCS-2 LE BOM, which is not recognized by json.loads(). You can either use Linux OS or use third party software like Notepad++ in which you can save to UTF-8 encoding conveniently.

Fontaine007
  • 447
  • 1
  • 6
  • 14