-1

I have a JSON file which contains 46k+ tweets in english and other languages as well which I want to save as csv file. Below is a part of json file.

    [{"user_id": 938118866135343104, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @PTISPOfficial: پاکستان تحریک انصاف کے وائس چیئرمین شاہ محمود قریشی  بغیر کسی پروٹوکول کے پاکستان سپر لیگ کا میچ دیکھنے کے لئے اسٹیڈیم م…", "tweet_id": 976166125502427136}
{"user_id": 959235642, "date_time": "03/20/2018 18:38:35", "tweet_content": "At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. \n\n#PZvQG \n#ABC", "tweet_id": 976166125535973378}
{"user_id": 395163528, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali\nWatch ball by ball highlights at (link removed)\n\n#PZvQG #HBLPSL #PSL2018 @_crici…", "tweet_id": 976166126202839040}
{"user_id": 3117825702, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @JeremyMcLellan: Rumor has it Amir Liaquat isn’t allowed to play in #PSL2018 because he keeps switching teams every week.", "tweet_id": 976166126483902466}
{"user_id": 3310967346, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…", "tweet_id": 976166126559354880}
{"user_id": 701494826194354179, "date_time": "03/20/2018 18:38:35", "tweet_content": "I wanted a super over\n#PZvQG", "tweet_id": 976166126836178944}
{"user_id": 347132028, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ♥️  #PZvQG", "tweet_id": 976166126685171713}
{"user_id": 3461853618, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @walterMiitty: It's harder than I thought to tell the truth\nIt's gonna leave you in pieces\nAll alone with your demons\nAnd I know that we…", "tweet_id": 976166126924201986}]

I followed this solution to convert it into CSV but got invalid syntax error on an urdu tweet. I also tried this:

    import json
with open("PeshVsQuetta.json") as f:
all_tweets = []
for line in f:
    text_dict = json.loads(line)
    all_tweets.append(text_dict)

print(all_tweets[0]['tweet_content'])

which gives me following error.

    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 148: character maps to <undefined>

I even saved json file as txt file and tried this:

    import pandas as pd
    from ast import literal_eval
    columns = ['Tweet ID','Author ID','Tweet','Time']
    df1 = pd.DataFrame(columns = columns)
    f = open('PeshvsQuetta.txt',encoding = 'utf-8')
    counter = 1
    for line in f:
         if(counter != 1):
             s1 = literal_eval(line)
             ser = pd.Series([s1['tweet_id'],s1['user_id'],s1['tweet_content'],s1["date_time"]],index=['Tweet ID','Author ID','Tweet','Time'])
             df1 = df1.append(ser,ignore_index=True)
    counter = counter + 1
    df1.to_csv('PeshVsQuetta1.csv', encoding='utf-8',index=False,columns = columns)

But the resulting csv file has each series saved in one cell and it has lots of empty rows and some tweets are saved over multiple rows. Below is the image.

enter image description here Any help would be truly appreciated.

cs95
  • 274,032
  • 76
  • 480
  • 537
farheen
  • 101
  • 1
  • 2
  • 12
  • have you tried `pd.read_csv()`? – Ashish Acharya May 28 '18 at 01:35
  • I've closed, but it isn't a 100% duplicate. – cs95 May 28 '18 at 01:37
  • Hmm, I'm going to reopen this again, the duplicate doesn't cut it for me. – cs95 May 28 '18 at 01:43
  • @coldspeed for me it is an exact duplicate, the error is the same and the solution is the same. – eyllanesc May 28 '18 at 01:47
  • @eyllanesc I agree, but there were a couple of other things with OP's code that needed fixing. Okay, I've exhausted my vote for this question, but I'll try and see if I can have it closed. – cs95 May 28 '18 at 01:47
  • @eyllanesc Relax, I didn't do it to erase your vote, I only had a change in heart. Nothing underhand or unethical here. [I'm trying to get it closed again](https://chat.stackoverflow.com/transcript/message/42698663#42698663) now. – cs95 May 28 '18 at 01:52
  • @eyllanesc Sure, no problem. By the way, [is it still a dupe](https://stackoverflow.com/questions/50558077/convert-json-to-csv-with-pandas?noredirect=1#comment88127169_50558086)? – cs95 May 28 '18 at 01:54
  • If my question is that simple or duplicate then please provide me a solution to it. Down vote doesn't help. – farheen May 28 '18 at 02:28
  • @farheen provide a [mcve] – eyllanesc May 28 '18 at 03:19
  • Try forcing an encoding with more character support than standard ASCII (it looks like you have also arabic characters). Or test each row and keep only the one that doesn't throw the exception (plain UTF-8) – DDS May 28 '18 at 09:08

1 Answers1

3

You should just be able to use Pandas as follows:

import pandas as pd

with open('PeshVsQuetta.json', encoding='utf-8-sig') as f_input:
    df = pd.read_json(f_input)

df.to_csv('PeshVsQuetta.csv', encoding='utf-8', index=False)

This assumes that your JSON file contains a BOM at the start. For the data you have given above, this produces the following CSV file:

date_time,tweet_content,tweet_id,user_id
2018-03-20 18:38:35,RT @PTISPOfficial: پاکستان تحریک انصاف کے وائس چیئرمین شاہ محمود قریشی  بغیر کسی پروٹوکول کے پاکستان سپر لیگ کا میچ دیکھنے کے لئے اسٹیڈیم م…,976166125502427136,938118866135343104
2018-03-20 18:38:35,"At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. 

#PZvQG 
#ABC",976166125535973378,959235642
2018-03-20 18:38:35,"RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali
Watch ball by ball highlights at (link removed)

#PZvQG #HBLPSL #PSL2018 @_crici…",976166126202839040,395163528
2018-03-20 18:38:35,RT @JeremyMcLellan: Rumor has it Amir Liaquat isn’t allowed to play in #PSL2018 because he keeps switching teams every week.,976166126483902466,3117825702
2018-03-20 18:38:35,RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…,976166126559354880,3310967346
2018-03-20 18:38:35,"I wanted a super over
#PZvQG",976166126836178944,701494826194354179
2018-03-20 18:38:35,RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ♥️  #PZvQG,976166126685171713,347132028
2018-03-20 18:38:35,"RT @walterMiitty: It's harder than I thought to tell the truth
It's gonna leave you in pieces
All alone with your demons
And I know that we…",976166126924201986,3461853618

Note: some of your fields contain newlines, so the output might look a bit odd. An application reading this though will handle it correctly (as long as you tell it the encoding is UTF-8 when importing it)

Martin Evans
  • 37,882
  • 15
  • 62
  • 83
  • Thanks. The csv file is handled correctly when I read it in data frame but this data is collected for a research project so I would like it to be in proper format. – farheen May 28 '18 at 19:25
  • The output I have shown is correct. Why do you not think it is in a proper format? – Martin Evans May 28 '18 at 19:27