0

While making persistent API calls, I am looping over a large list in order to reorganize my data and save it to a file, like so:

for item in music:
    # initialize data container
    data = defaultdict(list)
    genre = item[0]
    artist= item[1]
    track= item[2]
    # in actual code, api calls happen here, processing genre, artist and track
    data['genre']= genre
    data['artist'] = artist
    data['track'] = track
    # use 'a' -append mode
    with open('data.json', mode='a') as f:
        f.write(json.dumps([data], indent=4))

NOTE: Since I have a window of one hour to make api calls (after which token expires), I must save data to disk on the fly, inside the for loop.

The method above appends data to data.json file, but my dumped lists are not comma separated and file ends up being populated like so:

[
  {
    "genre": "Alternative", 
    "artist": "Radiohead", 
    "album": "Ok computer"
  }
]
[
  {
    "genre": "Eletronic", 
    "artist": "Kraftwerk", 
    "album": "Computer World"
  }
]

So, how can I dump my data ending up with a list of lists separated by commas?

8-Bit Borges
  • 8,099
  • 15
  • 64
  • 132
  • Your representation doesn't make sense. Either you want `{[...], [...]}`, or {...}\n{...}... so which is it? – cs95 May 25 '18 at 04:48
  • why would you do `[data]` – Prakash Palnati May 25 '18 at 04:49
  • to avoid errors like `ValueError: Extra data: line 21452 column 2 - line 95339735 column 2 (char 649677 - 2869023268)` when an api call is made and returns no dictionary at all. – 8-Bit Borges May 25 '18 at 04:49
  • @coldspeed any representation which can be indexed and retrieved later. – 8-Bit Borges May 25 '18 at 04:52
  • Rule of thumb: If you're opening the same file in every iteration of your loop, you're doing something wrong. Start by building your result, then dump it to the file. Don't do both at the same time. – Aran-Fey May 25 '18 at 04:53
  • data must be saved to file on a regular basis because I'm looping thru API results, whose connection breaks after some time and must be resumed from file's last entry. – 8-Bit Borges May 25 '18 at 04:56
  • Don't indent your dumped json. Each line will then contain a valid json document that can be parsed independently. This is called the jsonl format. If you must indent then this answer will help you load your data https://stackoverflow.com/a/50384432/529630 – Dunes May 25 '18 at 07:28

3 Answers3

0

One approach is to read the JSON file before writing to it.

Ex:

import json
for item in music:
    # initialize data container
    data = defaultdict(list)
    genre = item[0]
    artist= item[1]
    track= item[2]
    data['genre']= genre
    data['artist'] = artist
    data['track'] = track

    # Read JSON
    with open('data.json', mode='r') as f:
        fileData = json.load(f)
    fileData.append(data)

    with open('data.json', mode='w') as f:
        f.write(json.dumps(fileData, indent=4))
Rakesh
  • 75,210
  • 17
  • 57
  • 95
0

Something like this would work

import json

music = [['Alternative', 'Radiohead', 'Ok computer'], ['Eletronic', 'Kraftwerk', 'Computer World']]


output = list()

for item in music:
    data = dict()
    genre = item[0]
    artist= item[1]
    track= item[2]
    data['genre']= genre
    data['artist'] = artist
    data['track'] = track
    output.append(data)


with open('data.json', mode='a') as f:
        f.write(json.dumps(output, indent=4))

My data.json contains:

[
    {
        "genre": "Alternative", 
        "track": "Ok computer", 
        "artist": "Radiohead"
    }, 
    {
        "genre": "Eletronic", 
        "track": "Computer World", 
        "artist": "Kraftwerk"
    }
]
user1596115
  • 301
  • 4
  • 15
  • my problem here is that I cannot wait for loop to end in order to save it to file. it must be done while inside the for loop, like in the example. 'music' is just a simplification. in the actual code, there is an api call that processes music data, and I have a window of one hour until my token expires. so I must save data to disk while api calls perists. I cannot append to output AFTER one hour...I hope I made myself clear. – 8-Bit Borges May 25 '18 at 05:58
0

For large datasets, pandas (for serializing) and pickle (for saving) work together like a charm.

df = pd.DataFrame()

for item in music:
    # initialize data container
    data = defaultdict(list)
    genre = item[0]
    artist= item[1]
    track= item[2]
    # in actual code, api calls happen here, processing genre, artist and track
    data['genre']= genre
    data['artist'] = artist
    data['track'] = track
    df = df.append(data, ignore_index=True)
    df.to_pickle('data.pkl')
8-Bit Borges
  • 8,099
  • 15
  • 64
  • 132