1

I am calculating the correlation between columns in a large table using the in-database Python engine in SQL Server 2019 and, since this calculation returns a diagonal matrix, would like to be able to view the results in SSMS with the rows labeled mirroring the column names.

I know the basics of SQL querying, but not much beyond, so maybe I'm not phrasing my searches precisely.

Here is an example of my code:

execute sp_execute_external_script 
@language = N'Python',
@script = N'
import pandas as pd
from pandas import DataFrame

df = InputDataSet.corr()
OutputDataSet = df

',
@input_data_1 = N'select GHI ,
MNO,
JKL
from PIVOTED_TIME_ID_MATRIX'

with result sets ((GHI float,
MNO float,
JKL float))

This returns:

***** GHI | MNO | JKL
Row 1   1   0.5   0.5
Row 2 0.5     1   0.5
Row 3 0.5   0.5     1 

and I would like to see:

***** GHI | MNO | JKL
GHI     1   0.5   0.5
MNO   0.5     1   0.5
JKL   0.5   0.5     1 

Is this possible?

Barbaros Özhan
  • 39,060
  • 9
  • 21
  • 45
b00m1
  • 33
  • 3
  • Can you try `df.set_index(df.columns)` below `df = InputDataSet.corr()`? – ResidentSleeper Apr 08 '19 at 18:49
  • Yes, it outputs nicely to the console window with a print(), but it doesn't pass the dataframe index to the actual result set where I can save the output into a CSV. Thanks for the suggestion though, I might have to figure out a way to use that as a workaround if there isn't a way in SQL to make the index mirror the columns. Just trying to automate this for my team as much as possible. – b00m1 Apr 08 '19 at 19:45
  • Try `df.set_index(df.columns, inplace=True)` this should work. – ResidentSleeper Apr 09 '19 at 01:29
  • Thanks @ResidentSleeper! I ended up combining this solution with a workaround to produce the output I'm after. Since the DataFrame's index doesn't appear to get passed back into the result set to SSMS, I appended the results of the call to df.columns to a new column, then shifted it to the front of the DataFrame and added a new column in the result set. – b00m1 Apr 10 '19 at 14:33

1 Answers1

0

I ended up combining the advice above to use df.columns, as well as a method for rearranging columns from here with a workaround to produce the output I was looking for.

...'
df = InputDataSet.corr()
#puts the names of the existing columns into a new column on the end of df
df["columns"] = df.columns 
cols = df.columns.tolist()
#shift the "columns" column to the front of the dataframe
cols = cols[-1:] + cols[:-1]
df = df[cols]
OutputDataSet = df

',
@input_data_1 = N'select GHI ,
MNO,
JKL
from PIVOTED_TIME_ID_MATRIX'

with result sets ((column_names varchar(max), --add a new column in the result set
GHI float,
MNO float,
JKL float))
b00m1
  • 33
  • 3