6

I have tried to open the file from the Yelp dataset challenge website (https://www.yelp.com/dataset_challenge). I have successfully done that, however, I cannot open the file, as it does not have an extension. It is about 4 GB. I thought it might've been a JSON file because when I searched around, in the past it was. However, I can't figure out how to open this or convert it to CSV. I'd like to use some analysis with Python on this data. Can anyone help me? Thank you.

3 Answers3

7

I was having the same issue. Turns out that the file inside the tar (the one without the extension) is a tar file as well - so the download is basically a tar file inside a tar file. After extracting the original file, add the tar extension to it, and then extract that. After extracting that, you'll have all the different json files for the data set.

Bjafri5
  • 136
  • 6
2

The github project for Yelp dataset examples has a few samples, one of them is "json_to_csv_converter" which should help you do what you're asking for.

Yelp's Academic Dataset Examples

Let me know if this helps!

William Cross
  • 337
  • 1
  • 9
  • I looked at this, but I was under the impression it had to be a json file. I will try it and I will see if this works. Thank you. – Jonathan Villegas Apr 26 '17 at 02:47
  • from what I can tell the download is a TAR file (compressed like a ZIP folder). You may need to extract the contents before you can view the datasets. Make sure your computer is set to show all file extensions, I haven't had a look at the data myself but it sounds like it may have an extension that is simply just not showing on your computer. I could be wrong, but this is my gut feeling. – William Cross Apr 26 '17 at 02:50
  • I used 7-zip to extract it. I'm not sure if that is the right tool or not. It seemed to work, but then the file came out with no extension. I wanted to try to view the contents in some sort of plain text, but the file is too large. When I open it in an IDE such as pycharm, it asks what kind of file it is, and if I pick text or JSON, it still displays with a lot of weird characters. Thank you for your response. – Jonathan Villegas Apr 26 '17 at 05:18
  • Unfortunately, I still haven't had any luck. I am going to try to use the Python function/method this afternoon, but I'm not sure if it will work if it's not specifically a json file. I'm hoping it might, but I'll try it and see. – Jonathan Villegas Apr 26 '17 at 17:56
  • from what I can see there should be a file called yelp_academic_dataset.json which would be the dataset in json format. I haven't downloaded the sample data so cannot confirm, but this is what the documentation references – William Cross Apr 27 '17 at 02:07
  • Turns out it's a .tar file inside of a .tar file. After changing the extension and unzipping the file again, the .json file is extracted. – Jonathan Villegas May 01 '17 at 18:08
0

Sorry for the response to the old question, but the issue still exists) It's certainly not a tar inside tar it's a tar.gz without the gz extension. Probably backend function has a bug)

For open in the usual way, just rename file yelp_dataset.tar to yelp_dataset.tar.gz

But you may don't do that if you want. Below python3 code worked fine for me:

import tarfile

with tarfile.open('yelp_dataset.tar', 'r:gz') as tar:
    print([f.name for f in tar.getmembers()])

the result is:

['.',
 './yelp_academic_dataset_user.json',
 './yelp_academic_dataset_tip.json',
 './yelp_academic_dataset_checkin.json',
 './Dataset_User_Agreement.pdf',
 './yelp_academic_dataset_business.json',
 './yelp_academic_dataset_review.json']
Ivan Usalko
  • 156
  • 1
  • 7