df1
[601601,859078]
[601601,137726]
[620253,859078]
This is a pull from an SQL database and returned as a dataframe. Each of these data grabs can be tens of thousands of lines deep. I need to then read the right column and make this a new input to the SQL database. These need to then be associated correctly. I.E., with the previous input and the input below:
df2
[859078,682084]
[859078,783085]
I want to get an output of (merge df1 right column with df2 left column [outer merge])
[601601,859078,682084]
[601601,859078,783085]
[601601,137726,]
[620253,859078,682084]
[620253,859078,783085]
This is pretty much what pd.merge(df1,df2,who='outer')
is for and I was able to get this to work but I use a recursive function to achieve all the calls to the database. I ran across this stack overflow question that says this is horribly inefficient since the dataframe needs to be copied each time and sure enough, it is painfully slow. So I followed his example and this is what I have:
def sqlFormat(arg, array):
array = removeDuplicates(array)
for x in range(len(array)):
array[x] = f"'{array[x]}'"
return arg + "({})".format(",".join(array))
def recursiveFunction(arg, data, cnxn, counter):
sql_query = pd.read_sql_query(arg, cnxn).add_suffix('_{}'.format(counter)).astype('Int32')
if not sql_query.empty:
data.append(sql_query)
counter += 1
recursiveFunction(sqlFormat("SELECT x, y FROM SQL.Table WHERE x IN ", eval( "sql_query.y_"+str(counter-1)+".to_numpy(dtype='Int32')")), data, cnxn, counter)
return data
def readSQL(cnxn, array):
data=[]
counter = 1
dfPreLoad= #dataframe return of SQL Query; takes cnxn, array
#arg used below modified here
data = recursiveFunction(arg, data, cnxn, counter)
dfOutput = pd.concat(data,ignore_index=True)
Basically, I pass data
, a numpy array, through the recursive function that appends as it runs, I can then turn it into a dataframe with pd.concat(data)
which returns this:
The picture doesn't have like values in it but pd.concat
is effectively placing each array in its own columns after the last row of the last array.
I would like to be able to split data back into dataframes and merge them though this may have the same issue as before. I could also write a custom merge function after converting the concat back to an array. Any ideas on how to achieve this?
Py 3.7