4

Given a dataset with the following structure:

time  var1  var2  var2  var1  var3
      loc1  loc1  loc2  loc2  loc1
1     11    12    13    14    15
2     21    22    23          25
3           32    33    34    35

Given as a .csv:

time,var1,var2,var2,var1,var3
,loc1,loc1,loc2,loc2,loc1
1,11,12,13,14,15
2,21,22,23,,25
3,,32,33,34,35

Note: some values are missing, not all variables are available for all locations, timestamps are available for every record, columns may appear out of order, but timestamp is reliably the first column. I'm not sure all these aspects are relevant to an optimal solution, but there they are.

I didn't have to much trouble setting up an xarray three dimensional array that would allow me to access values by timestamp, location, variable name. It was looping through the location names after determining the unique ones, filtering the data by location and adding the results one location at a time. But I am wondering what a pythonic and, for lack of a better word, pandastic solution would look like?

Question: Is there some compact and efficient way, likely using pandas and xarray, to load this dataset or any similar one (with different variable and location names) from .csv into a 3d-array like an xarray DataArray?

Grismar
  • 12,597
  • 2
  • 21
  • 37

1 Answers1

4
df = pd.read_csv('tst.csv', header=[0, 1], index_col=0).sort_index(1)

time  var1       var2      var3
      loc1  loc2 loc1 loc2 loc1
1     11.0  14.0   12   13   15
2     21.0   NaN   22   23   25
3      NaN  34.0   32   33   35

However, to get into a 3-D array, we must project this into a cartesian product of the axes available to us.

cols = pd.MultiIndex.from_product(df.columns.levels)

d1 = df.reindex(columns=cols)
d1

   var1       var2      var3     
   loc1  loc2 loc1 loc2 loc1 loc2
1  11.0  14.0   12   13   15  NaN
2  21.0   NaN   22   23   25  NaN
3   NaN  34.0   32   33   35  NaN

Then use numpy.reshape and numpy.transpose

d1.values.reshape(3, 3, 2).transpose(1, 0, 2)

array([[[ 11.,  14.],
        [ 21.,  NaN],
        [ NaN,  34.]],

       [[ 12.,  13.],
        [ 22.,  23.],
        [ 32.,  33.]],

       [[ 15.,  NaN],
        [ 25.,  NaN],
        [ 35.,  NaN]]])
Grismar
  • 12,597
  • 2
  • 21
  • 37
piRSquared
  • 240,659
  • 38
  • 359
  • 510
  • Ah yes, I got your first suggestion working with a realistic dataset and had some technical issues but then I found it wasn't quite the shape I expected. I was coming back here to write about that and to ask about transposition, but you beat me to it. – Grismar Oct 04 '17 at 03:56
  • Just a note: another complication for my actual source file was that the first columns is junk. I had an issue where I told pandas to get the index from the 1 column and that it should read the dtype of the 0 column as a str, but this removes .levels from .columns. I solved that by removing the excess column in a pre-processing function and feeding the result to pandas. – Grismar Oct 04 '17 at 05:15
  • I have a similar problem to this one, could you help me out with it? Here is my question: https://stackoverflow.com/q/58358170/5838180 – NeStack Oct 13 '19 at 21:15