0

I have a data frame that contains the pairs of elements found in a number of datasets. The order of pairs should not matter, they are given once by alphabetic sequence, however the first instance may differ between databases, as in the example.

data <- data.frame(i = c("b","b","b","c"), j = c("c","d","d","a"), +
        database = c(1,1,2,3))

I would like to generate a score for them that would show the ratio of the instances in each database that contain the same pair.

I can imagine a crude function like this:

# For each database that includes particular i or j,  test whether
# they have a connection to another particular element at j or i, 
# respectively. Count the number of successes.

# Divide it by:
# Count(number of databases that contain either of the members of the pair in i or j)

The results I would expect from the example data set (order unimportant) are:

a c 0.5
b c 0.33
b d 1

I could see how this crude loop system might work, but I'm quite sure there is a more elegant solution, anyone able to help? Perhaps there is a specific function for this in a graph library. Thanks!

puslet88
  • 1,208
  • 13
  • 24
  • I don't understand your example. The element "b" is contained in every database but only databases 1 and 2 contain bd. Hence I would expect "b d 2/3" instead of "b d 1" in our example output. – Patrick Roocks Mar 17 '15 at 12:22
  • You're right, I mistyped the database and missed the point that there will be uneven ratios for many instances at the moment. Fixed it. Thanks. – puslet88 Mar 17 '15 at 12:49
  • Could you rationalize the expected output for your sample data set? – Marat Talipov Mar 17 '15 at 14:15
  • The data contains pairs of variables that were found to significantly covary in the independent datasets stated in dataset numbers. It is very much improvised, but I am trying to find an algorithm that would find a type of "natural classes" between them. At the moment this is the closest I've made it to a measure of natural belonging. I'd imagine there are some established and more direct ways to do it, but unfortunately I'm not well informed on types of cluster analysis. – puslet88 Mar 17 '15 at 15:11
  • I definitely do not know how many classes there will be at the beginning, nor how many of the items will form natural classes, but I can specify a threshold on them. So here I have done that, and the result is a network with the nodes given above. I'm looking for a way to generalize over a number of such datasets, where these natural classes don't have to covary in all datasets, just under the noticeable conditions. – puslet88 Mar 17 '15 at 15:14

2 Answers2

1

Just a bit playing around with joins (i.e. merge)

library(dplyr)

data <- data.frame(i = c("b","b","b","c"), j = c("c","d","d","a"),
                   database = c(1,1,2,3), stringsAsFactors = FALSE)

# Sort pairs lexicographic and count occurences of pairs
data2 <- mutate(data, x=pmin(i,j), y=pmax(i,j))
pairs_all <- summarize(group_by(data2, x, y), n_all = length(unique(database)))

# Introduce helper index to identify the pairs (for following joins)
pairs_all$pair_id <- 1:nrow(pairs_all)

# Count occurences of elements of pairs
r <- 
 merge(pairs_all, 
         summarize(group_by(merge(merge(pairs_all,
                                        transmute(data2, x, db1 = database)), 
                                  transmute(data2, y, db2 = database)), pair_id),
                   n_any = length(unique(union(db1,db2)))))

# Finally calculate the result
transmute(r, x, y, n_all/n_any)
Patrick Roocks
  • 2,791
  • 1
  • 12
  • 25
0

Whew, this was awful! But I've coded my aforementioned hack. For anyone stumbling on equally obscure improvised network comparison in the future. If anyone still knows of references that would simplify, make it more solid to find this type of natural groups among network node pairs, let me know. :)

#Calculate the score one row at a time
for (linenr in 1:length(data$i)){ 
  count_pair = 0
  count_one = 0
  # Loop through datasets
  for(setname in levels(data$database)){
    subset <- subset(data, database == setname)
    #Test whether either variable appears in dataset
    if(sum(c(as.character(data$i[linenr]),as.character(data$j[linenr])) %in%
             c(as.character(subset$i),as.character(subset$j))) > 0) 
      {count_one = count_one + 1}
    for (line2nr in 1:length(subset$i)){ 
    #Test whether dataset contains lines which have both elements of the original pair
      if(sum(c(as.character(data$i[linenr]),as.character(data$j[linenr])) %in%
               c(as.character(subset$i[line2nr]),as.character(subset$j[line2nr])))
         == 2) 
          {count_pair = count_pair + 1}
    }
  }
  #Simple ratio calculation
  data$score[linenr] <- count_pair/count_one
}

frame <- data.frame(data$i,data$j,data$score)
#Remove database duplicates
result <- frame[!duplicated(frame),]
#This still doesn't deal with changed order duplicates, but does the job now.
puslet88
  • 1,208
  • 13
  • 24