8

I have a data like this in Pandas dataframe

   id     import_id              investor_id     loan_id      meta
   35736  unremit_loss_100312         Q05         0051765139  {u'total_paid': u'75', u'total_expense': u'75'}
   35737  unremit_loss_100313         Q06         0051765140  {u'total_paid': u'77', u'total_expense': u'78'}
   35739  unremit_loss_100314         Q06         0051765141  {u'total_paid': u'80', u'total_expense': u'65'}

How to sort based on total_expense which is value of json field
ex: total_expense on meta field

Output should be

id     import_id              investor_id     loan_id      meta
35739  unremit_loss_100314         Q06         0051765141  {u'total_paid': u'80', u'total_expense': u'65'}
35736  unremit_loss_100312         Q05         0051765139  {u'total_paid': u'75', u'total_expense': u'75'}
35737  unremit_loss_100313         Q06         0051765140  {u'total_paid': u'77', u'total_expense': u'78'}
Jameel Grand
  • 1,903
  • 13
  • 30
  • Try `df = df.iloc[df['meta'].str.get('total_expense').argsort().values, :]` and see if that works. If not, you will need to show us more than 1 row of data. – cs95 Apr 09 '19 at 07:47
  • it is not working, all rows are becoming the same values if I use your logic. question modified and added more rows @coldspeed – Jameel Grand Apr 10 '19 at 06:39
  • I've provided two options. It seems you need to convert the extracted values to integer before sorting. Take a look and please clearly let me know what the issue is if it didn't work. Thanks. – cs95 Apr 11 '19 at 19:04
  • @JameelGrand I have attempted a simple way to achieve it, does it solve your problem? – Bhishan Poudel Apr 12 '19 at 19:52

4 Answers4

4

Setup and Preprocessing

import ast
import numpy as np

if isinstance(x.at[0, 'meta'], str):
    df['meta'] = df['meta'].map(ast.literal_eval)

str.get with Series.argsort

df.iloc[df['meta'].str.get('total_expense').astype(int).argsort()]

      id            import_id investor_id   loan_id                                         meta
2  35739  unremit_loss_100314         Q06  51765141  {'total_paid': '80', 'total_expense': '65'}
0  35736  unremit_loss_100312         Q05  51765139  {'total_paid': '75', 'total_expense': '75'}
1  35737  unremit_loss_100313         Q06  51765140  {'total_paid': '77', 'total_expense': '78'}

List Comprehension

df.iloc[np.argsort([int(x.get('total_expense', '-1')) for x in df['meta']])]

      id            import_id investor_id   loan_id                                         meta
2  35739  unremit_loss_100314         Q06  51765141  {'total_paid': '80', 'total_expense': '65'}
0  35736  unremit_loss_100312         Q05  51765139  {'total_paid': '75', 'total_expense': '75'}
1  35737  unremit_loss_100313         Q06  51765140  {'total_paid': '77', 'total_expense': '78'}

If you need to handle NaNs/missing data, use

u = [  
  int(x.get('total_expense', '-1')) if isinstance(x, dict) else -1 
  for x in df['meta']
]
df.iloc[np.argsort(u)]

      id            import_id investor_id   loan_id                                         meta
2  35739  unremit_loss_100314         Q06  51765141  {'total_paid': '80', 'total_expense': '65'}
0  35736  unremit_loss_100312         Q05  51765139  {'total_paid': '75', 'total_expense': '75'}
1  35737  unremit_loss_100313         Q06  51765140  {'total_paid': '77', 'total_expense': '78'}
cs95
  • 274,032
  • 76
  • 480
  • 537
2

Use:

print (df)
      id            import_id investor_id   loan_id  \
0  35736  unremit_loss_100312         Q05  51765139   
1  35736  unremit_loss_100312         Q05  51765139   
2  35736  unremit_loss_100312         Q05  51765139   

                                               meta  
0   {u'total_paid': u'75', u'total_expense': u'75'}  
1   {u'total_paid': u'75', u'total_expense': u'20'}  
2  {u'total_paid': u'75', u'total_expense': u'100'}  

import ast

df['meta'] = df['meta'].apply(ast.literal_eval)

df = df.iloc[df['meta'].str['total_expense'].astype(int).argsort()]

print (df)
      id            import_id investor_id   loan_id  \
1  35736  unremit_loss_100312         Q05  51765139   
0  35736  unremit_loss_100312         Q05  51765139   
2  35736  unremit_loss_100312         Q05  51765139   

                                           meta  
1   {'total_paid': '75', 'total_expense': '20'}  
0   {'total_paid': '75', 'total_expense': '75'}  
2  {'total_paid': '75', 'total_expense': '100'} 

If possible if missing total_expense key for some row convert missing values to some integer lower like all another values, like -1 for first position of these rows:

print (df)
      id            import_id investor_id   loan_id  \
0  35736  unremit_loss_100312         Q05  51765139   
1  35736  unremit_loss_100312         Q05  51765139   
2  35736  unremit_loss_100312         Q05  51765139   

                                              meta  
0  {u'total_paid': u'75', u'total_expense': u'75'}  
1  {u'total_paid': u'75', u'total_expense': u'20'}  
2                           {u'total_paid': u'75'} 

df['meta'] = df['meta'].apply(ast.literal_eval)


df = df.iloc[df['meta'].str.get('total_expense').fillna(-1).astype(int).argsort()]
print (df)
      id            import_id investor_id   loan_id  \
2  35736  unremit_loss_100312         Q05  51765139   
1  35736  unremit_loss_100312         Q05  51765139   
0  35736  unremit_loss_100312         Q05  51765139   

                                          meta  
2                         {'total_paid': '75'}  
1  {'total_paid': '75', 'total_expense': '20'}  
0  {'total_paid': '75', 'total_expense': '75'}  

Another solution:

df['new'] = df['meta'].str.get('total_expense').astype(int)
df = df.sort_values('new').drop('new', axis=1)
jezrael
  • 629,482
  • 62
  • 918
  • 895
  • Thank you for the response , this is making all my rows to the same value. ie. all my id field values becoming same number, Updated question with more rows. – Jameel Grand Apr 10 '19 at 06:42
  • @JameelGrand - Can you add output after my solution to question? – jezrael Apr 10 '19 at 06:43
  • @JameelGrand - added new solution, can you check it? – jezrael Apr 10 '19 at 06:45
  • no luck :( it is giving unsorted dataframe and df['meta'].str.get('total_expense') is giving all Nan values , not actually taking values – Jameel Grand Apr 10 '19 at 07:35
  • @JameelGrand - do you use `df['meta'] = df['meta'].apply(ast.literal_eval)` like first step? and then `df['new'] = df['meta'].str.get('total_expense').astype(int)` ? – jezrael Apr 10 '19 at 07:36
1

EDIT 2:

Found a slightly better way to do this without using apply:

from pandas.io.json import json_normalize

df = pd.concat([df, json_normalize(df['meta'])], axis = 1)\
       .sort_values(by = 'total_expense')\
       .drop(columns = ['total_paid', 'total_expense'])

EDIT:

df = pd.concat([df, df['meta'].apply(pd.Series)], axis = 1).sort_values(by = 'total_expense').drop(columns = ['total_paid', 'total_expense'])

If you want it to look like the original, just drop the columns you concat after sorting.

Original:

df = pd.concat([df, df['meta'].apply(pd.Series)], axis = 1).drop(columns ='meta').sort_values(by = 'total_expense')

df['meta'].apply(pd.Series) makes the dicts in the meta column into its own df. We can concat it with its original, drop the meta column (as its redundant), then sort the values by 'total expense'

Ben Pap
  • 2,434
  • 1
  • 5
  • 16
0

Using regex:

df = pd.read_clipboard(r'\s\s+')
pattern = r"""u'total_expense': u'([0-9.]+)'"""
df['total_expense'] = df.meta.str.extract(pattern)
df.sort_values('total_expense')

Using apply:

df['total_expense'] = df.meta.apply(eval).apply(
                        lambda x: x.get('total_expense', -1))
df.sort_values('total_expense')

Output:

      id            import_id investor_id   loan_id  \
2  35739  unremit_loss_100314         Q06  51765141   
0  35736  unremit_loss_100312         Q05  51765139   
1  35737  unremit_loss_100313         Q06  51765140   

                                              meta total_expense  
2  {u'total_paid': u'80', u'total_expense': u'65'}            65  
0  {u'total_paid': u'75', u'total_expense': u'75'}            75  
1  {u'total_paid': u'77', u'total_expense': u'78'}            78  
Bhishan Poudel
  • 1
  • 9
  • 63
  • 108