2

I have a script that collects JSON data from Twitter's API. This script collects data and parses it with jq every minute. This data gets collected into a single file that ends up looking like the following:

[
  {"text": "Tweet 01",
   "id": "001"
  },
  {"text": "Tweet 02",
   "id": "002"
  },
  {"text": "Tweet 03",
   "id": "003"
  }
]
[
  {"text": "Tweet 04",
   "id": "004"
  },
  {"text": "Tweet 05",
   "id": "005"
  },
  {"text": "Tweet 06",
   "id": "006"
  },
  {"text": "Tweet 07",
   "id": "007"
  },
  {"text": "Tweet 08",
   "id": "008"
  }
]
[
  {"text": "Tweet 09",
   "id": "009"
  },
  {"text": "Tweet 10",
   "id": "010"
  }
]

I've previously had a single list of JSON data per file, and Pandas easily can work with one list in a file. But how can I efficiently iterate over these multiple lists, that are NOT comma-separated and are NOT of necessarily the same length?

My ultimate goal is to aggregate ALL the JSON data from this one file and convert it to a CSV file, where each column is a key in the JSON data. It should end up looking like:

text, id
Tweet 01, 001
Tweet 02, 002
Tweet 03, 003
Tweet 04, 004
Tweet 05, 005
Tweet 06, 006
Tweet 07, 007
Tweet 08, 008
Tweet 09, 009
Tweet 10, 010

If I were to try reading the file anyway, the following occurs:

>>> import pandas as pd
>>> df = pd.read_json("sample.json")
>>> df.head()
Traceback (most recent call last):
  File "lists.py", line 3, in <module>
    df = pd.read_json("sample.json")
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/util/_decorators.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 608, in read_json
    result = json_reader.read()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 731, in read
    obj = self._get_object_parser(self.data)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 753, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 857, in parse
    self._parse_no_numpy()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Trailing data
Trenton McKinney
  • 29,033
  • 18
  • 54
  • 66
  • Possible solutions here using `JSONDecoder.raw_decode`: https://stackoverflow.com/questions/27907633/how-to-extract-multiple-json-objects-from-one-file/27907893#27907893. JSON is not a good format for appending data like this unless you don't pretty print and make each json a single line. You could also add your own decorations like "===============" between records so you can split the file before getting the json. But really, choose a better format. A simple CSV would work well here. – tdelaney Jul 15 '20 at 17:55

3 Answers3

1
  • The file contents need to be converted to a standard list, by reading the file in, and converting the individual lists into a single list.
  • .readlines reads each row of the file in as a list of strings
    • Use a list comprehension to iterate through each row, to remove whitespace and newlines at the front and tail of the string, using str.strip.
  • str.join combines items in a list into a single string.
  • str.replace to replace '][' with ','.
  • Use ast.literal_eval to convert the string back to an list.
from ast import literal_eval
import pandas as pd

# open and clean the contents of the file
with open('test.json', 'r') as f:
    data = literal_eval(''.join([row.strip() for row in f.readlines()]).replace('][', ','))

# print(data)
[{'text': 'Tweet 01', 'id': '001'},
 {'text': 'Tweet 02', 'id': '002'},
 {'text': 'Tweet 03', 'id': '003'},
 {'text': 'Tweet 04', 'id': '004'},
 {'text': 'Tweet 05', 'id': '005'},
 {'text': 'Tweet 06', 'id': '006'},
 {'text': 'Tweet 07', 'id': '007'},
 {'text': 'Tweet 08', 'id': '008'},
 {'text': 'Tweet 09', 'id': '009'},
 {'text': 'Tweet 10', 'id': '010'}]

# load into pandas
df = pd.json_normalize(data)

# display(df)
       text   id
0  Tweet 01  001
1  Tweet 02  002
2  Tweet 03  003
3  Tweet 04  004
4  Tweet 05  005
5  Tweet 06  006
6  Tweet 07  007
7  Tweet 08  008
8  Tweet 09  009
9  Tweet 10  010
Trenton McKinney
  • 29,033
  • 18
  • 54
  • 66
1

As @Trenton McKinney said, you need to clean the data. So you could use f.read() to get the file as a string, then str.replace() to replace the ']\n[' with ',', because ']\n[' it's causing the error, and finally you could try with pd.read_json:

with open('data.json') as f:
    data=f.read()
data=data.replace(']\n[',',')

df=pd.read_json(data)

Output:

df
0  Tweet 01   1
1  Tweet 02   2
2  Tweet 03   3
3  Tweet 04   4
4  Tweet 05   5
5  Tweet 06   6
6  Tweet 07   7
7  Tweet 08   8
8  Tweet 09   9
9  Tweet 10  10
MrNobody33
  • 6,003
  • 3
  • 17
0

As long as the original json file fits in memory, you can use raw_decoder to pull the json lists out, one at a time. raw_decoder returns the length of data consumed in the parse and you can trim the string and continue til no data remains.

import csv
import json
with open('data.json') as f:
    rawdata = f.read()

decoder = json.JSONDecoder().raw_decode

with open('data.csv','w') as f:
    writer = csv.DictWriter(f, fieldnames=["id", "text"])
    while rawdata:
        data, idx = decoder(rawdata)
        writer.writerows(data)
        rawdata = rawdata[idx:].lstrip()

print(open('data.csv').read())
MrNobody33
  • 6,003
  • 3
  • 17
tdelaney
  • 55,698
  • 4
  • 59
  • 89