1

I have a dataset of chemical properties that I downloaded from here.

I want to filter this dataset. For a given compound, phase, and pressure, I want only the measurements taken above a temperature at which the lowest measurement occurs.

For example, for specific heat capacity, I want something like:

aggregate(
  seq(nrow(data)),
  list(data$phase, data$compound, data$p), 
  function(ids) {  
    subset = data[ids,]
    subset[ subset$T > subset$T[  subset$Cp == min(subset$Cp)  ] ,]   
  } 
)

However that returns something I can't make sense of. If I had to guess, I'd say it's returning a dataframe in which the cells in one column are populated by vectors that contain the contents of the dataframes I return from the callback function.

Is there any way I can convince aggregate() to call rbind() on the data frames returned by the callback? Is there a function I should be using besides aggregate()?

16807
  • 1,261
  • 2
  • 15
  • 31

1 Answers1

0

This is lame, but I did find a way to get around the problem by returning a vector of ids from the callback:

id.list = aggregate(
  seq(nrow(data)),
  list(data$phase, data$compound, data$p), 
  function(ids) {  
    subset = data[ids,]
    ids[ subset$T > subset$T[  subset$Cp == min(subset$Cp)  ] ]   
  }
)

This returns a data frame where a column, x, stores vectors of ids.

If I select that column:

id.list$x

I get a list of vectors, which I learned from this answer could be flattened into a single vector:

stack(id.list$x)$values

which is a single vector of ids. I then just retrieve the rows from the original data frame:

data[stack(id.list$x)$values,]

So the entire code reads:

id.list = aggregate(
  seq(nrow(data)),
  list(data$phase, data$compound, data$p), 
  function(ids) {  
    subset = data[ids,]
    ids[ subset$T > subset$T[  subset$Cp == min(subset$Cp)  ] ]   
  }
)
answer = data[stack(id.list$x)$values,]

I will concede the answer to anyone who can find a more succinct solution.

16807
  • 1,261
  • 2
  • 15
  • 31