0

This may be redundant, but after reading previous posts and answers I still have not gotten my code to work. I have a very large file containing multiple json objects that are not delimited by any values:

{"_index": "1234", "_type": "11", "_id": "1234", "_score": 0.0, "fields": {"c_u": ["url.com"], "tawgs.id": ["p6427"]}}{"_index": "1234", "_type": "11", "_id": "786fd4ad2415aa7b", "_score": 0.0, "fields": {"c_u": ["url2.com"], "tawgs.id": ["p12519"]}}{"_index": "1234", "_type": "11", "_id": "5826e7cbd92d951a", "_score": 0.0, "fields": {"tawgs.id": ["p8453", "p8458"]}}

I've read that this is exactly what JSON-RPC is supposed to look like, but still can't achieve opening/parsing the file to create a dataframe in python.

I tried something of the format of:

i = 0
d = json.JSONDecoder()
while True:
    try:
        obj, i = d.raw_decode(s, i)
    except ValueError:
        return
    yield obj

but it didn't work.

I've also tried a basic:

with open('output.json','r') as f:
    data = json.load(f)

but am thrown the error:

JSONDecodeError: Extra data: line 1 column 184 (char 183)

Trying json.decode() with append didn't work either and returned data empty []

data = []
with open('es-output.json', 'r') as f:
    for line in f:
        try:
            data.append(json.loads(line))
        except json.decoder.JSONDecodeError:
            pass # skip this line 

Please help! Thanks in advance

aesthetics
  • 31
  • 5

2 Answers2

1

This will try to decode the JSON stream inside s iteratively:

s = '''{"_index": "1234", "_type": "11", "_id": "1234", "_score": 0.0, "fields": {"c_u": ["url.com"], "tawgs.id": ["p6427"]}}{"_index": "1234", "_type": "11", "_id": "786fd4ad2415aa7b", "_score": 0.0, "fields": {"c_u": ["url2.com"], "tawgs.id": ["p12519"]}}{"_index": "1234", "_type": "11", "_id": "5826e7cbd92d951a", "_score": 0.0, "fields": {"tawgs.id": ["p8453", "p8458"]}}'''

import json

d = json.JSONDecoder()

idx = 0
while True:
    if idx >= len(s):
        break
    data, i = d.raw_decode(s[idx:])
    idx += i
    print(data)
    print('*' * 80)

Prints:

{'_index': '1234', '_type': '11', '_id': '1234', '_score': 0.0, 'fields': {'c_u': ['url.com'], 'tawgs.id': ['p6427']}}
********************************************************************************
{'_index': '1234', '_type': '11', '_id': '786fd4ad2415aa7b', '_score': 0.0, 'fields': {'c_u': ['url2.com'], 'tawgs.id': ['p12519']}}
********************************************************************************
{'_index': '1234', '_type': '11', '_id': '5826e7cbd92d951a', '_score': 0.0, 'fields': {'tawgs.id': ['p8453', 'p8458']}}
********************************************************************************
Andrej Kesely
  • 81,807
  • 10
  • 31
  • 56
  • so if my "s" value is a json file that isn't a string object because I used json.dump() to write the file, how would I initially convert my file to become a json string type? when I try to use json.dumps() to write my file, I get back an empty set – aesthetics Jul 17 '19 at 18:19
  • @aesthetics Just load the content of the file with JSON objects inside `s`: `s = open('your_file.txt', 'r').read()` – Andrej Kesely Jul 17 '19 at 18:20
  • @aesthetics I don't have any experience with elasticsearch, but you either have some string that is containing JSON values or you will need to load this string from file. – Andrej Kesely Jul 17 '19 at 18:23
  • as a next step, and just for clarification, would there be a way to flatten the json if the type is a string to get it into a dataframe? – aesthetics Jul 17 '19 at 18:30
  • @aesthetics That's question for Panda/Numpy specialists, but I bet there are some methods for loading data directly from json. You may open other question, these comments are not suitable for it. – Andrej Kesely Jul 17 '19 at 18:34
0

The problem is in the data itself! In this data you use 3 values but without keys.

The first one is :

{"_index".... ["p6427"]}}

The second one is :

{"_index".... ["p12519"]}}

The third one is :

{"_index".... ["p8458"]}}

You'd rather affect to each value a key, for example :

{
"k1":{"_index": "1234", "_type": "11", "_id": "1234", "_score": 0.0, "fields": {"c_u": ["url.com"], "tawgs.id": ["p6427"]}},

"k2":{"_index": "1234", "_type": "11", "_id": "786fd4ad2415aa7b", "_score": 0.0, "fields": {"c_u": ["url2.com"], "tawgs.id": ["p12519"]}},

"k3":{"_index": "11_20190714_184325_01", "_type": "11", "_id": "5826e7cbd92d951a", "_score": 0.0, "fields": {"tawgs.id": ["p8453", "p8458"]}}
}

This way everything will work well and data will be well loaded.