1

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.

What do you think is the problem, and how can I solve it?

If I use unzipping software such as 7zip to unzip the file, this problem disappears.

This is my code:

with gzip.open('filename' , 'rb') as f:
    json_content = json.loads(f.read())

This is the error I get:

Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)

I used this code:

with gzip.open ('filename', mode='rb') as f:
    print(f.read())

and realized that the file starts with b' (as shown below):

b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"

I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.

I uploaded a sample of these files in the following link just a few json.gz files

martineau
  • 99,260
  • 22
  • 139
  • 249
Mike Sal
  • 77
  • 2
  • 12
  • 3
    Not because it is started with "b'". `b` is the indicator of a byte array, and `'` surrounds a string. – Geno Chen Feb 16 '19 at 17:36
  • @GenoChen, so, do you think b' does not cause the problem? can you tell me what you think might cause it? – Mike Sal Feb 16 '19 at 17:41
  • @MikeSal Could you please provide the whole output of that `print` instead of **stripping just the first line**? – Geno Chen Feb 16 '19 at 17:47
  • @GenoChen here it is: https://docs.google.com/document/d/1ZEa5hXe3vvD5CtKAIGehDB3yfZ9l0m6J26S8OqtQUXw/edit?usp=sharing – Mike Sal Feb 16 '19 at 17:51
  • @MikeSal As my test, "line 2 column 1 (char 1585) means the first line has 1584 chars, and the beginning of second line generated the problem. However I still don't know how to reproduce the "Extra data". – Geno Chen Feb 16 '19 at 17:56
  • 1
    Not a duplicate... Your file is a mix of two JSON objects. Split them. – Geno Chen Feb 16 '19 at 18:04
  • 1
    Mike: I reopened your question because it's not a strict duplicate of the other one. I think you should upload your `.gz` file somewhere and add a link to it to your question (because I no longer think the issue is just a text vs binary mode I/O problem). – martineau Feb 16 '19 at 22:04
  • 1
    @martineau. thanks a lot. ok. I will do it – Mike Sal Feb 17 '19 at 01:30
  • 2
    I don't know why you say it works if you decompress with 7zip. `json.load()` fails the same way with your 7zip-uncompressed file. – Charles Duffy Feb 17 '19 at 01:57
  • @CharlesDuffy can you think of any solutions? – Mike Sal Feb 17 '19 at 02:39
  • @CharlesDuffy I went over the question you marked as a duplicate. these two questions are not similar. I want to store the whole json file, but that question was about scraping some data from a json file. – Mike Sal Feb 17 '19 at 02:45
  • The essential aspects aren't about extracting the specific pieces but about parsing an input document with more than one object; the answers there, particularly the one by Dunes, can clearly be adopted. Anyhow, if you're having trouble applying the techniques taught there, the right thing to do is write a question that shows your effort to apply them and a specific issue encountered in the process, with a self-contained reproducer. – Charles Duffy Feb 17 '19 at 03:00
  • 1
    I went back and looked at the marked-duplicate question and its answers again. **Absolutely nothing** about the answer by Dunes is specific to cases where only a few specific fields are being scraped; every single element of that answer is directly and completely applicable to your question without any modification or adaptation needed. – Charles Duffy Feb 17 '19 at 04:56
  • @Charles Duffy: Please reopen this question so I can answer it. Doing what's in Dunes answer to the duplicate you chose is not quite what's needed here and I and I would like to post an answer to this question. – martineau Feb 17 '19 at 15:10
  • 1
    @martineau, done, as per your request. – Charles Duffy Feb 17 '19 at 16:07
  • ...that said, I disagree that the answer in question requires any modification to be applied here -- see it successfully decoding the OP's content at https://ideone.com/TPl5gY – Charles Duffy Feb 17 '19 at 20:59

1 Answers1

3

The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.

Dunes' answer to the question @Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.

Here's what I mean:

import json
import gzip


filename = '00_activities.json.gz'  # Sample file.

json_content = []
with gzip.open(filename , 'rb') as gzip_file:
    for line in gzip_file:  # Read one line.
        line = line.rstrip()
        if line:  # Any JSON data on it?
            obj = json.loads(line)
            json_content.append(obj)

print(json.dumps(json_content, indent=4))  # Pretty-print data parsed.    

Note that the output it prints shows what valid JSON might have looked like.

martineau
  • 99,260
  • 22
  • 139
  • 249
  • If the OP has jsonl, I would have closed with a different duplicate -- wonder if maybe I was looking at `jq . – Charles Duffy Feb 17 '19 at 17:59
  • See https://stackoverflow.com/questions/50475635/python-loading-jsonl-file-as-json-objects as onesuch standing duplicate. – Charles Duffy Feb 17 '19 at 18:00
  • ...that said, the answer by Dunes should work in this case as well -- the code searching past whitespace should work to find code on the next line just as easily as it finds code after a single regular space or a tab, unless perhaps `re.MULTILINE` or a similar flag needed to be set. – Charles Duffy Feb 17 '19 at 18:33
  • @Charles: Yes, I agree it could be a dup of that JSON Lines question (but I don't think Dune's answer to the other one, as posted, would be able to parse them). To be honest, I didn't realize there was any kind of official [JSONL format](http://jsonlines.org/), but was thinking recently there ought to be (or the official JSON format should be extended to include it—as has been done to some degree in the past—because it's so common). – martineau Feb 17 '19 at 18:50
  • See the function from Dune's answer successfully decoding the OP's content with no changes whatsoever at https://ideone.com/TPl5gY – Charles Duffy Feb 17 '19 at 20:58
  • @Charles: While it is possible (I just did it) to use Dune's answer to decode the contents of the files, doing so requires some additional scaffolding to deal with the bytes vs text issue that's caused by the JSON data source being in a gzipped file—so, while there are similarities, I personally still don't think it's a duplicate question according to this site's definition. I also strongly suspect the OP might have a hard time using it even though it not really that hard—but who knows... – martineau Feb 17 '19 at 21:22
  • Switch from `'rb'` to `'rt'` (in Python 3), and the "bytes-vs-text issue" is gone. If it's *not* a duplicate, that's an indicator that the other question needs editing to make it more general. (That said, we *do* have the ability to edit the duplicate list, and so could have this listed both of the other question we're already discussing, and one that focuses narrowly on the bytes-vs-text issue). – Charles Duffy Feb 18 '19 at 04:18
  • @Charles: It's not just a simple matter of opening the gzipped file in `'rt'` mode because that cannot be passed directly to Dunes' `decode_stacked()` generator function which expects a string. The entire file can be passed to it with `gzip_file.read().decode('utf-8')`, which requires reading in the entire file into memory—something less-that-ideal esp if dealing with a huge file (often the reason they were compressed). The code currently posted in my answer doesn't require that, only that there's one object per line (JSONL format)—so I don't think your scheme's a good one. – martineau Feb 18 '19 at 17:07
  • I agree that Dunes' answer isn't ideal with huge files, but that's a reason for a new question focused *specifically on that failing*, which this question isn't. And moreover, we *do* -- as aforementioned -- already have other JSONL-based duplicates. – Charles Duffy Feb 18 '19 at 17:11
  • @Charles: I'm just going by what's currently in the OP's question and the single sample `.gz` file they provided. I will defer to you since you greatly out-rank me—i.e. your call. – martineau Feb 18 '19 at 17:19