0

I have below mentioned directory structure:

Folder named A contains txt files named 1, 2, 3, .., 5
Folder named B contains txt files named 1, 2, 3, .., 5
|
--A (Folder)
  |---1.txt
  |---2.txt
  ....
  |---5.txt

--B (Folder)
  |---1.txt
  |---2.txt
  ....
  |---5.txt

I am reading these text files into data frames through 2 nested for loops. Single data frame looks like this:

df <- data.frame(Comp.1 = c(0.3, -0.2, -1, NA, 1),
         Comp.2 = c(-0.4, -0.1, NA, 0, 0.6),
         Comp.3 = c(0.2, NA, -0.4, 0.3, NA))
row.names(df) <- c("Param1", "Param2", "Param3", "Param4", "Param5")

Values always lie between -1 and +1. Number of rows (parameters) and number of columns (components) of all these data frames are not same. For eg: the above data frame is of 3x5, others can be 5x15, 4x10, 5x40, etc.

I want a plot that has:

1. parameters on x-axis
2. components on y-axis
3. values as points in the above graph 
4. shape of point representing folder name (A = square, B = triangle, C = circle, .., E)
5. color inside the point shape representing file name (1, 2, 3, .., 5)
6. color intensity describing value (For eg: light red [almost white] color representing closer to -1 like -0.98, dark red representing closer to 1 like 0.98)

I have this code:

alphabets = c("A", "B", "C", "D", "E", "F")
numbers = c(1, 2, 3, 4, 5)

pca.plot <- ggplot(data = NULL, aes(xlab="Principal Components",ylab="Parameters"))

for (alphabet in alphabets){
   for(number in numbers){

   filename=paste("/filepath/",alphabet,"/",number,".txt", sep="")

   df <- read.table(filename)

   #Making all row dimensions = 62. Adding rows with NAs
   if(length(row.names.data.frame(df))<62){
      row_length = length(row.names.data.frame(df))
      for(i in row_length:61){
          new_row = c(NA, NA, NA, NA, NA, NA)
          df<-rbind(df, new_row)  
      }
   }

   df$row.names<-rownames(df)
   long.df<-melt(df,id=c("row.names"), na.rm = TRUE)
   pca.plot<-pca.plot+geom_point(data=long.df,aes(x=variable,y=row.names, shape = number, color=alphabet, size = value))
   }
}

Output of this code is this: enter image description here

EDIT: After following @Gregor's steps mentioned in comments, I have a big_data_frame like this: head(big_data, 3)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 params alphabet number 1 NA NA NA NA NA param1 A 1 2 NA NA NA 0.89 NA param2 A 1 3 NA -0.95 NA NA NA param3 A 1

Globox
  • 85
  • 1
  • 10
  • 2
    Combine your data into one data frame - one *tidy* data frame - and this will be trivial. I would recommend reading your data [into a list of data frames](http://stackoverflow.com/a/24376207/903061) and then combining them all at once. – Gregor Thomas Feb 14 '17 at 22:48
  • I have list of data frames ready. Filled with NAs wherever rows/columns weren't there. How should I plot now? How do we access attribute names of data frame list? – Globox Feb 14 '17 at 23:45
  • 2
    Please notice the first sentence of my comment: **Combine your data into one data frame.** If you need help with this, see the section called *Combining a list of data frames into a single data frame* in [the answer I linked above](http://stackoverflow.com/a/24376207/903061). Make sure that the attributes you want to plot, including the file name and folder name, are columns in your data frame. If the file names are the names of your list, then, as stated in the link, `dplyr::bind_rows` or `data.table::rbindlist` will automatically add them as columns. – Gregor Thomas Feb 15 '17 at 00:16
  • Great. Can you show it in your question? If you post `dput(droplevels(head(your_data, 10)))` we will get a copy/pasteable version of the first 10 rows of your data. – Gregor Thomas Feb 15 '17 at 21:57
  • when i try to melt this big_data frame, `big_data.long – Globox Feb 15 '17 at 22:18

1 Answers1

1

You need to melt the data frame to collapse all the Comp columns. The other columns should stay the same:

long_data = reshape2::melt(
    big_data,
    id.vars = c("params", "alphabet", "number"),
    variable.name = "comp",
    value.name = "value",
    na.rm = T
)

Now, most of your requirements are easy:

  1. parameters on x-axis
  2. components on y-axis
  3. values as points in the above graph
  4. shape of point representing folder name (A = square, B = triangle, C = circle, .., E)
  5. color inside the point shape representing file name (1, 2, 3, .., 5)
  6. color intensity describing value (For eg: light red [almost white] color representing closer to -1 like -0.98, dark red representing closer to 1 like 0.98)
ggplot(long_data, aes(
    x = params, y = comp, size = value,
    shape = folder, color = factor(number), alpha = value
)) +
    geom_point()

The tricky part is the requirements for both color intensity and overall color. The only way I know to approximate this using standard ggplot is to use transparency as I did above. This is the approach taken in, e.g., this question.


Note this is untested as your data isn't shared reproducibly. Share data with dput as suggested in the comments if there are issues that need testing.

Community
  • 1
  • 1
Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257
  • Worked for me. Thanks Gregor. I can tweak the fancy part. Liked your way of leading me to solution. :) – Globox Feb 16 '17 at 18:14
  • Thanks! Glad it worked out - and glad you appreciated the approach. Not everyone loves it but I'm convinced you learn more from it :D – Gregor Thomas Feb 16 '17 at 18:21
  • Next time you share data though, do it with `dput`. Makes it *so* much easier to reproduce for the people trying to help you. – Gregor Thomas Feb 16 '17 at 18:22