Merging very large array returned from recursive functions

Question

df1
[601601,859078]
[601601,137726]
[620253,859078]

This is a pull from an SQL database and returned as a dataframe. Each of these data grabs can be tens of thousands of lines deep. I need to then read the right column and make this a new input to the SQL database. These need to then be associated correctly. I.E., with the previous input and the input below:

df2
[859078,682084]
[859078,783085]

I want to get an output of (merge df1 right column with df2 left column [outer merge])

[601601,859078,682084]
[601601,859078,783085]
[601601,137726,]
[620253,859078,682084]
[620253,859078,783085]

This is pretty much what pd.merge(df1,df2,who='outer') is for and I was able to get this to work but I use a recursive function to achieve all the calls to the database. I ran across this stack overflow question that says this is horribly inefficient since the dataframe needs to be copied each time and sure enough, it is painfully slow. So I followed his example and this is what I have:

def sqlFormat(arg, array):
    array = removeDuplicates(array)
    for x in range(len(array)):
        array[x] = f"'{array[x]}'"
    return arg + "({})".format(",".join(array))

def recursiveFunction(arg, data, cnxn, counter):
    sql_query = pd.read_sql_query(arg, cnxn).add_suffix('_{}'.format(counter)).astype('Int32')
    if not sql_query.empty:
        data.append(sql_query)
        counter += 1
        recursiveFunction(sqlFormat("SELECT x, y FROM SQL.Table WHERE x IN ", eval( "sql_query.y_"+str(counter-1)+".to_numpy(dtype='Int32')")), data, cnxn, counter)
        return data

def readSQL(cnxn, array):
    data=[]
    counter = 1
    dfPreLoad= #dataframe return of SQL Query; takes cnxn, array
    #arg used below modified here
    data = recursiveFunction(arg, data, cnxn, counter)
    dfOutput = pd.concat(data,ignore_index=True)

Basically, I pass data, a numpy array, through the recursive function that appends as it runs, I can then turn it into a dataframe with pd.concat(data) which returns this:

The picture doesn't have like values in it but pd.concat is effectively placing each array in its own columns after the last row of the last array.

I would like to be able to split data back into dataframes and merge them though this may have the same issue as before. I could also write a custom merge function after converting the concat back to an array. Any ideas on how to achieve this?

Py 3.7

Did you read the info here: https://stackoverflow.com/questions/53645882/pandas-merging-101 ? — Paul H, Dec 29 '20 at 22:27
You should be doing this operation in your SQL query, it would likely be a simple `LEFT JOIN` — Nick, Dec 29 '20 at 22:27
@PaulH I have read through the pandas docs already. I will take a look to see if there is anything in there I missed but one of the issue I have is getting the data array to multiple dataframes so I can apply the pandas merging methods — onyex, Dec 29 '20 at 22:43
@Nick I will look into this. I was not aware it was possible to write a recursive SQL function — onyex, Dec 29 '20 at 22:50
@onyex it depends on your version of SQL. If you could share the queries you are using in the question it might be possible to come up with a possible combination query. — Nick, Dec 29 '20 at 22:51
@Nick Sure, I kept it out earlier to keep it simple. The post is updated. I am using pyodbc. — onyex, Dec 29 '20 at 23:03
Is your underlying DBMS SQL Server? Can you share the table structure and some sample data and expected output? I'm pretty sure this can be done with one query if it is SQL Server or SQLite. Also, what depth of recursion might be required? — Nick, Dec 30 '20 at 00:27
@Nick Yep, I do believe it is an RDBMS SQL server. The table itself is 2m+ and I'm going to simplify it. The data gathered is unique identifiers for parent and child parts. I am attempting to generate an indented BOM based the parent child info. Recursions required is unknown since it is dependent on the depth of the BOM. I have included tables at the end of the main post. The first one is a simplified table. I am only using col d and f. The next table is what I am asking for. The table after is what I actually need but I can post-process. The last table is how someone would read it. — onyex, Dec 30 '20 at 04:37
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226620/discussion-between-onyex-and-nick). — onyex, Dec 30 '20 at 19:00

score 1 · Accepted Answer · answered Dec 30 '20 at 11:21

Based on your comments & additions to the question, you could generate the results you want using a recursive CTE query which joins each row to its parent row in the CTE, building a comma separated list of f values:

WITH cte AS (
  SELECT f, CAST(f AS VARCHAR(255)) AS ids
  FROM test
  WHERE d IS NULL
  UNION ALL
  SELECT test.f, CAST(CONCAT(cte.ids, ',', test.f) AS VARCHAR(255))
  FROM test
  JOIN cte ON cte.f = test.d
)
SELECT ids
FROM cte

For the sample data you provided this will produce the following output:

ids
1
2
1,3
1,4
1,4,6
1,4,7
1,3,5

This can then be readily split into columns and ordered in python.

This was the answer. It was modified with `WITH UNQ AS ( SELECT DISTINCT d, f FROM test ),` — onyex, Dec 31 '20 at 19:39

Merging very large array returned from recursive functions

1 Answers1