2

Summary

In short, I need to extract data from a pandas series containing individual OrderedDicts. So far progress has been good but I have now hit a stumbling block.

When I define my own dataframe for demonstration purposes on Stack Overflow I'm able to use the OrderedDict indexing functionality to find the data that I'm after within an OrderedDict. However, when I work with real data where I'm not defining the OrderedDict within the dataframe I have to parse the OrderedDict via the standard Json package using a function.

The OrderedDicts I'm working with have multiple nested hierarchies that can manipulated the usual way of...

from collections import OrderedDict

example = OrderedDict([('attributes', OrderedDict([('type', 'Name'), ('url', 'URLHERE')])), ('UserRole', OrderedDict([('attributes', OrderedDict([('type', 'UserRole'), ('url', 'URLHERE')])), ('Name', 'Telephone Sales')]))])

print(example['UserRole']['Name'])

The above code will result in 'Telephone Sales'. However, this only works when I have defined the DataFrame manually for the example as I have to use the collections.OrderedDict Package without the need to parse.

Background

Below is some code I prepared for StackOverflow that loosely demonstrates my problem.

import pandas as pd
import json
from collections import OrderedDict

# Settings
pd.set_option('display.max_colwidth', -1)


# Functions
def extract_odict_item(odict, key_1, key_2=None):
    data = json.dumps(odict)
    final_data = json.loads(data)

    if key_2 is None:
        if final_data is not None:
            return final_data[key_1]
        else:
            return None

    elif key_2 is not None:
        if final_data is not None:
            return final_data[key_1][key_2]
        else:
            return None

# Data
accounts = [
    OrderedDict([('attributes', OrderedDict([('type', 'Account'), ('url', 'URLHERE')])), ('Name', 'Supermarket'), ('AccountNumber', 'ACC1234'), ('MID__c', '123456789')]),
    OrderedDict([('attributes', OrderedDict([('type', 'Account'), ('url', 'URLHERE')])), ('Name', 'Bar'), ('AccountNumber', 'ACC9876'), ('MID__c', '987654321')]),
    OrderedDict([('attributes', OrderedDict([('type', 'Account'), ('url', 'URLHERE')])), ('Name', 'Florist'), ('AccountNumber', 'ACC1298'), ('MID__c', '123459876')])
]

owner = [
    OrderedDict([('attributes', OrderedDict([('type', 'Name'), ('url', 'URLHERE')])), ('UserRole', OrderedDict([('attributes', OrderedDict([('type', 'UserRole'), ('url', 'URLHERE')])), ('Name', 'Telephoone Sales')]))]),
    OrderedDict([('attributes', OrderedDict([('type', 'Name'), ('url', 'URLHERE')])), ('UserRole', OrderedDict([('attributes', OrderedDict([('type', 'UserRole'), ('url', 'URLHERE')])), ('Name', 'Field Sales')]))]),
    OrderedDict([('attributes', OrderedDict([('type', 'Name'), ('url', 'URLHERE')])), ('UserRole', OrderedDict([('attributes', OrderedDict([('type', 'UserRole'), ('url', 'URLHERE')])), ('Name', 'Online Sale')]))])
]

# Dataframe
df = pd.DataFrame({'ConvertedAccounts': accounts,
                   'Owner': owner
                   })

# Extract data from OrderedDict using usual indexing
df['MerchantID'] = df['ConvertedAccounts'].apply(lambda x: x['MID__c'])
df['UserRole'] = df['Owner'].apply(lambda x: x['UserRole']['Name'])

# Extract data from OrderedDict using function
df['extracted_MerchantID'] = df['ConvertedAccounts'].apply(lambda x: extract_odict_item(x, 'MID__c'))
df['extracted_UserRole'] = df['Owner'].apply(
    lambda x: extract_odict_item(x, 'UserRole', 'Name'))

# Drop junk columns
df = df.drop(columns=['ConvertedAccounts', 'Owner'])

print(df)

In the code above I have the function extract_odict_item() which I can use to extract data from each individual OrderedDict within the dataframe and put it into a new column as long as I specify what I want. However, I want to be able to specify as many arguments as I want via *args to represent how many nests I want to traverse and extract the value from the final key.

Expected Results

I want to be able to use the below function to accept multiple arguments and create a nested index selector like so...

# Functions
def extract_odict_item(odict, *args):
    data = json.dumps(odict)
    final_data = json.loads(data)
    if len(args) == 0:
        raise Exception('Requires atleast 1 argument')

    elif len(args) == 1:
        if final_data is not None:
            return final_data[args[0]]
        else:
            return None

    elif len(args) > 1:
        ### Pseudo Code ###
        # if final_data is not None:
        #     return final_data[args[0]][args[1]][args[2]] etc.....
        # else:
        #     return None

So if I call extract_odict_item

extract_odict_item(odict, 'item1', 'item2', 'item3')

It should return final_data['item1']['item2']['item3']

I may have over complicated this but I can't think of anything else or if this is even possible within Python.

Answer

I was able to use a recursive function to handle the selection of what data I needed from the ordereddict

import json
from collections import OrderedDict

# Settings
pd.set_option('display.max_colwidth', -10)

# Data
owner = [
    OrderedDict([('attributes', OrderedDict([('type', 'Name'), ('url', 'URLHERE')])), ('UserRole', OrderedDict([('attributes', OrderedDict([('type', 'UserRole'), ('url', 'URLHERE')])), ('Name', 'Telephoone Sales')]))]),
    OrderedDict([('attributes', OrderedDict([('type', 'Name'), ('url', 'URLHERE')])), ('UserRole', OrderedDict([('attributes', OrderedDict([('type', 'UserRole'), ('url', 'URLHERE')])), ('Name', 'Field Sales')]))]),
    OrderedDict([('attributes', OrderedDict([('type', 'Name'), ('url', 'URLHERE')])), ('UserRole', OrderedDict([('attributes', OrderedDict([('type', 'UserRole'), ('url', 'URLHERE')])), ('Name', 'Online Sale')]))])
]

# Functions
def rec_ext(odict, item_list):
    new_list = item_list.copy()
    data = json.dumps(odict)
    final_data = json.loads(data)
    el = new_list.pop()
    if isinstance(final_data[el], dict):
        return rec_ext(final_data[el], new_list)
    else:
        return final_data[el]


# Dataframe
df = pd.DataFrame({'owner': owner
                   })

my_columns = ['UserRole', 'Name']
my_columns.reverse()
df['owner2'] = df['owner'].apply(lambda x: rec_ext(x, my_columns))

print(df['owner2'])
Ryan Davies
  • 167
  • 1
  • 9

1 Answers1

2

This is not an exact answer - but you can try recursion if I am understanding your question correctly -

d = {1: {2: {3: {4: 5}}}}#Arbitrarily nested dict
l = [1, 2, 3, 4]

def rec_ext(my_dict, my_list):
     el = my_list.pop()
     if isinstance(my_dict[el], dict):
         return rec_ext(my_dict[el], my_list)
     else:
         return my_dict[el]

l.reverse() #we reverse because we are "popping" in the function 
rec_ext(d, l)
#Returns 5
Mortz
  • 1,600
  • 12
  • 27
  • Thanks Mortz your answer has helped alot! Problem is that I get an IndexError: pop from empty list when I apply the function to a Pandas column using a Lambda `my_columns = ['UserRole', 'Name']` `my_columns.reverse()` `df['owner2'] = df['owner'].apply(lambda x: rec_ext(x, my_columns))` – Ryan Davies May 09 '19 at 09:38
  • `def rec_ext2(odict, column_list): new_list = column_list data = json.dumps(odict) final_data = json.loads(data) try: el = new_list.pop() if isinstance(final_data[el], dict): return rec_ext(final_data[el], new_list) else: return final_data[el] except IndexError: pass` Demonstrates the problem well as the function works for 1 item within the pandas series – Ryan Davies May 09 '19 at 09:43
  • 1
    The reason is that `my_columns` list is modified in place - which means that it works for the first row of your dataframe and for the subsequent rows, the `my_columns` list is `[]`. You can work around this problem by calling your function as `df['owner2'] = df['owner'].apply(lambda x: rec_ext(x, my_columns[:]))`. `my_columns[:]` ensures that you are passing a copy of the list to the function call and not a reference to the list. Hope this helps – Mortz May 09 '19 at 10:45
  • Oh wow I didn't know this thanks a lot! It does make me wonder if .reverse() & [:] steps can be built into the function itself? – Ryan Davies May 09 '19 at 10:50
  • 1
    Looks like you can include a list.copy() into the function to avoid having to use a list[:] slice ```new_list = item_list.copy()``` – Ryan Davies May 09 '19 at 10:55