-1

I'm plotting the GO(Gene Ontology) analysis with ggplot2. There are 15 groups, each of them have 3 items. what I want is look like this: enter image description here

But what I got from ggplot is like this:

enter image description here

as you can see, the groups with same item were combined together, they are too tight to be recognized. but I hope they can seperate, each of them are independent.

is there anyway I can realize that? Thx!

here is the data:

dput(t3)
structure(list(X = 1:15, GO = structure(c(4L, 1L, 9L, 4L, 7L, 
13L, 4L, 8L, 11L, 3L, 12L, 2L, 5L, 10L, 6L), .Label = c("GO:0002433", 
"GO:0006644", "GO:0006650", "GO:0007169", "GO:0007266", "GO:0008360", 
"GO:0033674", "GO:0038093", "GO:0038096", "GO:0051056", "GO:0051347", 
"GO:0090407", "GO:1901652"), class = "factor"), Category = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "GO Biological Processes", class = "factor"), 
    Description = structure(c(13L, 4L, 2L, 13L, 7L, 11L, 13L, 
    1L, 8L, 3L, 5L, 6L, 12L, 10L, 9L), .Label = c("Fc receptor signaling pathway", 
    "Fc-gamma receptor signaling pathway involved in phagocytosis", 
    "glycerophospholipid metabolic process", "immune response-regulating cell surface receptor signaling pathway involved in phagocytosis", 
    "organophosphate biosynthetic process", "phospholipid metabolic process", 
    "positive regulation of kinase activity", "positive regulation of transferase activity", 
    "regulation of cell shape", "regulation of small GTPase mediated signal transduction", 
    "response to peptide", "Rho protein signal transduction", 
    "transmembrane receptor protein tyrosine kinase signaling pathway"
    ), class = "factor"), LogP = c(-40.887821181404, -38.4419736211887, 
    -38.4419736211887, -43.8825656440477, -40.9658168441261, 
    -40.928738226587, -52.6082917563572, -50.1381337600203, -46.3514494215002, 
    -71.5496270612963, -67.9829155303795, -67.4635662792925, 
    -40.1994934987666, -36.6884206140971, -33.5740640802768), 
    Label = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L, 5L, 3L, 3L, 
    4L, 5L, 5L, 4L, 4L), .Label = c("A", "B", "C", "D", "E"), class = "factor"), 
    Pvalue = c(1.2947288298802e-41, 3.61431815145327e-39, 3.61431815145327e-39, 
    1.3104919452196e-44, 1.08189012276903e-41, 1.17831599603069e-41, 
    2.46438322354707e-53, 7.27555687363364e-51, 4.45195307925231e-47, 
    2.82080418123444e-72, 1.04012244840109e-68, 3.43901223333678e-68, 
    6.31693635444945e-41, 2.04917659044434e-37, 2.66646519778058e-34
    )), class = "data.frame", row.names = c(NA, -15L))

and here is my code:

ggplot(top3, aes(x = reorder(description, -log10pvalue), y = log10pvalue, fill = Label)) +
  geom_bar(stat = "identity", position=position_dodge) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(x = "Description", y = "Log Pvalue") +
  coord_flip()
  • Where is `top3` in your question? – Duck Aug 18 '20 at 15:50
  • 1
    I'd recommend facets for this - keep your plot as is and add `+ facet_grid(~group)`, or whatever your grouping variable is called. If you need more help - please share some sample data reproducibly - `dput()` is best for that. Don't need the whole thing, but maybe all 3 items from each of 3 groups. – Gregor Thomas Aug 18 '20 at 15:56
  • @Duck already updated – richardzhang Aug 18 '20 at 16:14
  • please use dput to add your data. It is hard to read it in the format you provided - espcially since you have adjacent character columns spearated by spaces, with spaces within them too. – dww Aug 18 '20 at 16:22
  • already use dput() for data sharing, thx! – richardzhang Aug 18 '20 at 16:31
  • @GregorThomas thank you, facet does makes it better, but also makes them not from the same start. – richardzhang Aug 18 '20 at 16:35

2 Answers2

2

There are two fundamental approaches you can use here: (1) without facets, or (2) with facets. I personally prefer without facets in this case, but I'll show you both using your example data.

First of all, the data you posted does not contain the column log10pvalue; however, I will assume that we can use LogP. Your dataset as shared is t3, so we'll use that convention here too. The base plot is as follows:

ggplot(t3, aes(x=reorder(Description, -Pvalue), y=LogP, fill=Label)) +
  geom_bar(stat='identity', position=position_dodge()) +
  labs(x='Description', y='Pvalue') +
  coord_flip()

enter image description here

From the look of it, only one Description value contains more than one Label in this dataset, but it will work for demonstrating the solutions.

Pre-work notes

Before implementing solutions, note that your plot code can be simplified a bit. Using geom_bar(stat='identity') is just a longhanded way of saying geom_col(), which creates the bars based on x and with height of y by default. You can read about the difference between the two geoms here.

Secondly, instead of defining x and y, then flipping with coord_flip(), use of geom_col() can just be used to specify your discrete axis as y and your continuous axis as x to not have to flip the axes after.

I'll implement these conventions in both solutions.

1. Without Facets

Your code initially was almost right to show without faceting, but needed a bit of tweaking to get to look right. There's a few things to consider here:

  • Bar width is not preserved. In your plot and the plot above, you can see that the bar width depends on the number of labels. This means that a Description with one Label will be wider than the bars for a Description with more than one Label. To fix this, you can use the argument preserve="single" inside position_dodge() to ensure that the width of the bars are equal regardless of number of Labels. You also want to change to use position_dodge2() here instead of position_dodge(). The difference is apparent if you switch, but basically position_dodge2() ensures that bars are centered correctly.

  • Labels are loooong. Unless you can abbreviate your Description names... these things are long and are taking up all your plot space. What you want is to wrap the labels of the axis text. I like the usage of wrap_format() within the scales package here.

  • Bars are going "the wrong way". You want to reverse the axis, which you can do by using scale_x_reverse().

Combining these solutions we get the following without using facets:

ggplot(t3, aes(y=reorder(Description, -Pvalue), x=LogP, fill=Label)) +
  geom_col(position=position_dodge2(preserve='single'), color="black") +
  labs(x='Description', y='Pvalue') +
  scale_x_reverse() +
  scale_y_discrete(labels=wrap_format(38))+
  theme_classic()

enter image description here

Not too bad. To show exactly as the example you showed, you'll notice that the colors of the bars are connected to the Description, and not the Label. You can just change the fill= aesthetic to fix this. If you do that, you also need to set a group= aesthetic so that position_dodge2() will know on what basis to use for the "dodging" or it will not work. The problem with this approach is that it is not clear what bar corresponds to what Label... you can probably address that with annotations, but regardless, here's what that would look like:

ggplot(t3, aes(y=reorder(Description, -Pvalue), x=LogP, fill=Description, group=Label)) +
  geom_col(position=position_dodge2(preserve='single'), color="black", show.legend = FALSE) +
# you have to set show.legend=FALSE or it will be... bad
  labs(x='Description', y='Pvalue') +
  scale_x_reverse() +
  scale_y_discrete(labels=wrap_format(38))+
  theme_classic()

enter image description here

2. With Facets

To use facets, you can take the code from above and add facet_wrap() or facet_grid(). A lot of people prefer facet_wrap(), but honestly I always find it's easier to use facet_grid() and specify using . notation if you want to have the facets arranged vertically or horizontally. Here, we want them arranged vertically, and faceted based on Description. You also want to make sure to include scales="free_y". If you left this out, every facet would include a space for every other Description... and it would look really bad.

Finally, I remove the facet label (since we already have axis labels enabled) by setting the theme element strip.text to be element_blank().

ggplot(t3, aes(y=reorder(Description, -Pvalue), x=LogP, fill=Label)) +
  geom_col(position=position_dodge2(preserve='single'), color="black") +
  labs(x='Description', y='Pvalue') +
  scale_x_reverse() +
  scale_y_discrete(labels=wrap_format(38))+
  theme_classic() + facet_grid(Description~., scales='free_y') +
  theme(strip.text=element_blank())

enter image description here

You'll notice your reordering by Pvalue does not work here. The order of the facets is in the order in which the levels are set when t3$Description is converted to a factor. You can solve this by setting t3$Description to a factor yourself with a specific order given to the levels= argument before your plot code. One way to do this that utilizes arrange() and the %>% notation from dplyr and tidyr is the following:

# arrange by Pvalue
t3 <- t3 %>% arrange(Pvalue)
# create factor and set levels according to the way in which they appear
# in your arranged dataset
t3$Description <- factor(t3$Description, levels=unique(t3$Description))

# ... your plot code from above here again

enter image description here

chemdork123
  • 7,090
  • 1
  • 6
  • 21
  • Thank you for such a detailed answer! but as you can see, one of my questions is the ones share the same terms are narrowed together. but in the example I showed, they are separated. the 2nd and the 4th are both "Cell Cycle Phase", but they are separated. – richardzhang Aug 18 '20 at 17:35
1

From your provided data, I assume you grouping is represented by Label, and you want to arrange those next to each other, in order of decreasing statistical significance. You can achieve this by setting X (which has unique values) as a factor, with the levels ordered according to the order you want your bars in.

Typically, people use -log10 p-value for that, so higher values are more significant.

When you do this, you get something like this:

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
t3 <- structure(list(X = 1:15, GO = structure(c(4L, 1L, 9L, 4L, 7L, 
                                          13L, 4L, 8L, 11L, 3L, 12L, 2L, 5L, 10L, 6L), .Label = c("GO:0002433", 
                                                                                                  "GO:0006644", "GO:0006650", "GO:0007169", "GO:0007266", "GO:0008360", 
                                                                                                  "GO:0033674", "GO:0038093", "GO:0038096", "GO:0051056", "GO:0051347", 
                                                                                                  "GO:0090407", "GO:1901652"), class = "factor"), Category = structure(c(1L, 
                                                                                                                                                                         1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "GO Biological Processes", class = "factor"), 
               Description = structure(c(13L, 4L, 2L, 13L, 7L, 11L, 13L, 
                                         1L, 8L, 3L, 5L, 6L, 12L, 10L, 9L), .Label = c("Fc receptor signaling pathway", 
                                                                                       "Fc-gamma receptor signaling pathway involved in phagocytosis", 
                                                                                       "glycerophospholipid metabolic process", "immune response-regulating cell surface receptor signaling pathway involved in phagocytosis", 
                                                                                       "organophosphate biosynthetic process", "phospholipid metabolic process", 
                                                                                       "positive regulation of kinase activity", "positive regulation of transferase activity", 
                                                                                       "regulation of cell shape", "regulation of small GTPase mediated signal transduction", 
                                                                                       "response to peptide", "Rho protein signal transduction", 
                                                                                       "transmembrane receptor protein tyrosine kinase signaling pathway"
                                         ), class = "factor"), LogP = c(-40.887821181404, -38.4419736211887, 
                                                                        -38.4419736211887, -43.8825656440477, -40.9658168441261, 
                                                                        -40.928738226587, -52.6082917563572, -50.1381337600203, -46.3514494215002, 
                                                                        -71.5496270612963, -67.9829155303795, -67.4635662792925, 
                                                                        -40.1994934987666, -36.6884206140971, -33.5740640802768), 
               Label = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L, 5L, 3L, 3L, 
                                   4L, 5L, 5L, 4L, 4L), .Label = c("A", "B", "C", "D", "E"), class = "factor"), 
               Pvalue = c(1.2947288298802e-41, 3.61431815145327e-39, 3.61431815145327e-39, 
                          1.3104919452196e-44, 1.08189012276903e-41, 1.17831599603069e-41, 
                          2.46438322354707e-53, 7.27555687363364e-51, 4.45195307925231e-47, 
                          2.82080418123444e-72, 1.04012244840109e-68, 3.43901223333678e-68, 
                          6.31693635444945e-41, 2.04917659044434e-37, 2.66646519778058e-34
               )), class = "data.frame", row.names = c(NA, -15L))

t3 <- t3 %>% dplyr::mutate(LogP = -log10(Pvalue)) %>% 
    dplyr::arrange(Label, -Pvalue)
ggplot(t3, aes(x=factor(X, unique(X)), y=LogP, fill=Label)) +
    geom_bar(stat='identity') +
    geom_hline(yintercept = 2, linetype=2, col="red") +
    labs(x=NULL, y=expression(-log[10] * ' p-value'))+
    coord_flip()+
    scale_x_discrete(breaks=seq_len(dim(t3)[1]), labels=t3$Description)+
    theme_classic()

Created on 2020-08-18 by the reprex package (v0.3.0)

user12728748
  • 6,092
  • 2
  • 3
  • 9