1

so I have a CSV file that I imported into R studio. I ran my analysis and then was informed that there was a mistake with the data collection. Every 4 rows actually represents 1 month and I need to combine every 4 rows into 1 for each of 6 variables. The data looks like this:

             X2   X3   X4   X5   X6   X7
1          3830    0    0 1184    0 5765
2             0    0  153    0  153  153
3             0    0    0   73   73    0
4             0    0  153    0  153    0
5          3815    0 3354  651  903 5174
6             0    0  138  101  273  558
7             0    0  322    0  272  322
8             0    0    0  101  273  420
9             0    0 3682 3302   83 6485
10          165    0   63    0  293  165
11            0    0   98   98    0    0
12            0    0    0    0  230    0

So basically I want rows 1,2,3,4 combined into one new row. 5,6,7,8 into another etc. What is the easiest way to go about doing this in R???

Phil
  • 4,424
  • 3
  • 22
  • 55
Josh_PL
  • 25
  • 5

4 Answers4

3

Here's an approach with dplyr that relies on first creating a vector to summarize against. Then, we use group_by by to identify the rows to aggregate. Finally, we use across to perform the action on all the columns.

library(dplyr) #Version >= 1.0.0
result <- data %>%
            mutate(Aggregate = rep(seq(1,ceiling(nrow(data)/4)),each = 4)) %>%
            group_by(Aggregate) %>%
            summarise(across(everything(), sum))
result
# A tibble: 3 x 7
#  Aggregate    X2    X3    X4    X5    X6    X7
#      <int> <int> <int> <int> <int> <int> <int>
#1         1  3830     0   306  1257   379  5918
#2         2  3815     0  3814   853  1721  6474
#3         3   165     0  3843  3400   606  6650

If you're using an older version of dplyr, you could do this:

data %>%
  mutate(Aggregate = rep(seq(1,ceiling(nrow(data)/4)),each = 4)) %>%
  group_by(Aggregate) %>%
  summarise_all(sum)

Data

data <- structure(list(X2 = c(3830L, 0L, 0L, 0L, 3815L, 0L, 0L, 0L, 0L, 
165L, 0L, 0L), X3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), X4 = c(0L, 153L, 0L, 153L, 3354L, 138L, 322L, 0L, 3682L, 
63L, 98L, 0L), X5 = c(1184L, 0L, 73L, 0L, 651L, 101L, 0L, 101L, 
3302L, 0L, 98L, 0L), X6 = c(0L, 153L, 73L, 153L, 903L, 273L, 
272L, 273L, 83L, 293L, 0L, 230L), X7 = c(5765L, 153L, 0L, 0L, 
5174L, 558L, 322L, 420L, 6485L, 165L, 0L, 0L)), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")
Ian Campbell
  • 19,690
  • 9
  • 20
  • 44
  • I get this error when I run that code: Error in seq.default(1, ceiling(nrow(data)/4)) : 'to' must be of length 1 – Josh_PL Jul 13 '20 at 19:00
  • Make sure that you replace `data`, both in the beginning (line2), and in that line, with the name of your actual data.frame. – Ian Campbell Jul 13 '20 at 19:01
  • Ahhh I missed it in line 2. Now it is saying this error: Error in across(everything(), sum) : could not find function "across" – Josh_PL Jul 13 '20 at 19:07
  • Check `packageVersion("dplyr")`. As stated in the code, it needs to be `>=1.0.0`. Or, alternatively, you could use `summarise_all(sum)`. – Ian Campbell Jul 13 '20 at 19:08
  • Sweet! Thank you! is there an easy way to save that output as a variable that I can then play with? – Josh_PL Jul 13 '20 at 19:16
  • Absolutely, simply change line 2 to `result %`. The result will be saved in a data.frame named `result`. – Ian Campbell Jul 13 '20 at 19:16
  • Sorry if this is silly, but for changing line 2, do you just tag it on in front of the mute or change the whole line, or does that become a line itself – Josh_PL Jul 13 '20 at 19:21
  • Remember that the `%>%` operator pipes the left hand side into the first argument of the right hand side. In this case, the data.frame is being piped into `mutate`. After all the evaluation on the the right side of `% ...`. – Ian Campbell Jul 13 '20 at 19:23
  • Ahhhh that makes sense! thank you! I appreciate it! – Josh_PL Jul 13 '20 at 19:26
  • Hey Ian, I have another question about doing something similar. I have to do another mutation of the same data set. But what I need to do is leave row 1 unchanged, aggregate the next three in a similar fashion. I have tried playing with this to make it work and I just cant quite get it. – Josh_PL Jul 14 '20 at 14:42
  • Hi Josh, I think that sounds like a new question. This question is positively scored, so you shouldn't have an issue asking too many questions. – Ian Campbell Jul 14 '20 at 14:44
  • Thanks! I posted another question but I found that you have been exceedingly helpful and I greatly appreciate it! Basically I want to do the same thing, here all four rows were aggregated into 1 new row. What I would like to do now is for every for rows, I would like to leave the first row unchanged and then aggregate the next three rows. So leave row 1 alone. Aggregate rows 2-4. Leave row 5 alone. Aggregate rows 6-8, etc. – Josh_PL Jul 14 '20 at 14:47
3

If we want to do a group by every 4 rows, create a grouping column with gl and use aggregate to get the sum of all the columns (assuming they are all numeric) with respect to the grouping column

df1$grp <- as.integer(gl(nrow(df1), 4, nrow(df1)))
aggregate(.~ grp, df1, FUN = sum,  na.rm = TRUE)
#  grp   X2 X3   X4   X5   X6   X7
#1   1 3830  0  306 1257  379 5918
#2   2 3815  0 3814  853 1721 6474
#3   3  165  0 3843 3400  606 6650

data

df1 <- structure(list(X2 = c(3830L, 0L, 0L, 0L, 3815L, 0L, 0L, 0L, 0L, 
165L, 0L, 0L), X3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), X4 = c(0L, 153L, 0L, 153L, 3354L, 138L, 322L, 0L, 3682L, 
63L, 98L, 0L), X5 = c(1184L, 0L, 73L, 0L, 651L, 101L, 0L, 101L, 
3302L, 0L, 98L, 0L), X6 = c(0L, 153L, 73L, 153L, 903L, 273L, 
272L, 273L, 83L, 293L, 0L, 230L), X7 = c(5765L, 153L, 0L, 0L, 
5174L, 558L, 322L, 420L, 6485L, 165L, 0L, 0L)), class = "data.frame", 
row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
akrun
  • 674,427
  • 24
  • 381
  • 486
2

data.table version of akrun's aggregate answer

library(data.table)
setDT(df)

df[, lapply(.SD, sum), by = .(grp = df[, gl(.N, 4, .N)])]
#    df   X2 X3   X4   X5   X6   X7
# 1:  1 3830  0  306 1257  379 5918
# 2:  2 3815  0 3814  853 1721 6474
# 3:  3  165  0 3843 3400  606 6650
IceCreamToucan
  • 23,575
  • 2
  • 13
  • 25
0

You can try:

library(tidyverse)
#Data

df <- structure(list(X2 = c(3830L, 0L, 0L, 0L, 3815L, 0L, 0L, 0L, 0L, 
165L, 0L, 0L), X3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), X4 = c(0L, 153L, 0L, 153L, 3354L, 138L, 322L, 0L, 3682L, 
63L, 98L, 0L), X5 = c(1184L, 0L, 73L, 0L, 651L, 101L, 0L, 101L, 
3302L, 0L, 98L, 0L), X6 = c(0L, 153L, 73L, 153L, 903L, 273L, 
272L, 273L, 83L, 293L, 0L, 230L), X7 = c(5765L, 153L, 0L, 0L, 
5174L, 558L, 322L, 420L, 6485L, 165L, 0L, 0L)), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")

#Create index
index <- seq(1,dim(df)[1],by=4)
index2 <- 1:length(index)
#Create empty var
df$index <- NA
df$index[index] <- index2
#Now fill
df %>% fill(index) %>% group_by(index) %>% summarise_all(sum,na.rm=T)

# A tibble: 3 x 7
  index    X2    X3    X4    X5    X6    X7
  <int> <int> <int> <int> <int> <int> <int>
1     1  3830     0   306  1257   379  5918
2     2  3815     0  3814   853  1721  6474
3     3   165     0  3843  3400   606  6650
Duck
  • 37,428
  • 12
  • 34
  • 70