0

Parent) as csv (700,000 rows) input

Child   Parent
fA00    f0
fA9 fA0
fA31    fA0
fA30    fA0
fA1 fA00
dccfA1  fA00
fA2 fA00
fA3 fA00
fA01    fA00
fA4 fA00
fA5 fA00
fA6 fA00
fA7 fA00
fA0 fA00
fA142149    fA00
fA02    fA00
fA8 fA00
qA1 fA10
fA22    fA10
fA23    fA10
fA11    fA10
qA2     fA10
fA15    fA11
fA13    fA11
fA12    fA11
fA14    fA13
fA17    fA16
fA18    fA17
fA19    fA17
fA20    fA17
fA21    fA19
etc....

It goes up to 14 levels deep. The top parent is f0

I want to iterate through the child parent relationships to determine the path

Expected Result

f0 --- top
f0\fa00
f0\fa00\.Child
f0\fa00\.Child2etc
f0\fA0
f0\fA0\.Child
f0\fA0\.Child2etc

How can I do this in Python?

Shadow
  • 6,976
  • 4
  • 39
  • 52
  • 1
    Have you tried anything? People are usually keen to point out errors in your attempts but not to write code for you. – Paul Rooney Dec 20 '17 at 04:12
  • Your question is a little vague. Do you want to be able to give a child name and from that obtain the path to the top parent? Are all the node names unique? Is the data structure fixed, or do you need to add &/or remove nodes? You _could_ create a custom class for your nodes, but I'd probably just use nested dictionaries. – PM 2Ring Dec 20 '17 at 04:18
  • Ca you post a small but complete example data set say 3 levels deep and an example of the output you want. I don't see where the 'Child2etc' comes from in your question as it stands. – Mike Robins Dec 20 '17 at 07:56
  • @PM 2Ring - I was hoping not to manually define dictionaries as the data is variable. – Cameron Stewart Dec 20 '17 at 21:47
  • @PaulRooney I'm new to this, thankyou for the tip. I understand from what I have read that although the data is a heirachy, it is actually a graph structure. I was reading around and seeing if using networkx there is a way to create the graph using csv import as opposed to manually creating edges and nodes. – Cameron Stewart Dec 20 '17 at 21:52
  • @MikeRobins Here is the sample data https://docs.google.com/spreadsheets/d/e/2PACX-1vQ1y72RGE6-hBHl9uhRDumuLeBgpPliCxVUPLmK2zCEZ_Ltot4-E2Dgig7b4J2tih-h57NXup5428vT/pubhtml – Cameron Stewart Dec 20 '17 at 21:53
  • Ok. So it looks like you want to print all the paths from the 'f0' root node to every single child node. But I don't quite get the logic in your expected output: why are some separators backslash and some backslash-dot? (BTW, backslash is a somewhat annoying character to work with, do you really need that?). I've had a look at your Google Docs data, which was a little problematic in my old browser; it'd be easier if it were a simple ASCII file. – PM 2Ring Dec 21 '17 at 00:57
  • (cont) Anyway, I counted 2730 nodes in that data, including the 'f0' root node. However, 12 of those nodes don't have a parent node: fA700 fA763 fA982 fA993 qA49 qA54 qA58 qA69 qA70 qA76 qA94 qA97. How do you want to handle such nodes? – PM 2Ring Dec 21 '17 at 00:58
  • @PM2Ring no - doesn't have to be backslashes. the nodes without parents are a result of how i sampled the dataset. In the complete they would have parents, but have been cut off. Down the track, it is possible that orphaned children could occur so it would be good if they started a new tree..ie became the top level. – Cameron Stewart Dec 21 '17 at 23:54

1 Answers1

4

I started out thinking complicated recursive construction of tree structures but basically it is very simple. Create a mapping of child to parent then starting at a child list its parent then the parent's parent up to the top. A recursive routine extracts the child's ancestry easily.

'''
This is the family tree:
------------------------
f0:
    a0:
        b0
        b1:
        b2:
    a1:
        b3:
        b4:
    a2:
        b5:
            c0
            c1
'''
ancestry = [
    ('b1', 'a0'),
    ('c1', 'b5'),
    ('b2', 'a0'),
    ('b3', 'a1'),
    ('b4', 'a1'),
    ('b5', 'a2'),
    ('a0', 'f0'),
    ('a1', 'f0'),
    ('a2', 'f0'),
    ('b0', 'a0'),
    ('c0', 'b5'),
]

And the code is:

parents = set()
children = {}
for c,p in ancestry:
    parents.add(p)
    children[c] = p

# recursively determine parents until child has no parent
def ancestors(p):
    return (ancestors(children[p]) if p in children else []) + [p]

# for each child that has no children print the geneology
for k in (set(children.keys()) - parents):
    print '/'.join(ancestors(k))

Output is:

f0/a1/b4
f0/a0/b0
f0/a0/b1
f0/a0/b2
f0/a1/b3
f0/a2/b5/c1
f0/a2/b5/c0

I'll leave it as an exercise to read the csv file, and maybe sort the outputs better.

Mike Robins
  • 1,585
  • 7
  • 13
  • thanks for this , the only issue I could see is it didn't create the top ancestor as a stand alone entry, eg `f0 f0/a0 f0/a1 f0/a0/b0 f0/a1/b4 f0/a0/b0` – Cameron Stewart Dec 21 '17 at 23:56
  • Got it going with the import and dynamic dictionary ` # In[1]: import pandas as pd # In[2]: df = pd.read_csv('Parent Child Sample Data - Sheet1.csv') # In[9]: df['Dictionery'] = list(zip(df['Child'], df['Parent'])) # In[11]: ancestry = df['Dictionery']` – Cameron Stewart Dec 22 '17 at 00:58
  • # In[12]: parents = set() children = {} for c,p in ancestry: parents.add(p) children[c] = p # In[18]: # recursively determine parents until child has no parent def ancestors(p): return (ancestors(children[p]) if p in children else []) + [p] `# In[19]: #for each child that has no children print the geneology for k in (set(children.keys()) - parents): print('\\'.join(ancestors(k)))` – Cameron Stewart Dec 22 '17 at 00:59
  • @Cameron Stewart, there is no need to descend from one top level node, the code just traces up until it can't go further. You may like to consider a visualization that I describe on [another question](https://stackoverflow.com/questions/47787295/creating-a-hierarchical-tree-visualization-from-a-dictionary-structure-in-python). I hope you accept my answer. – Mike Robins Dec 22 '17 at 01:55