Combining every four lines into 1

Question

so I have a CSV file that I imported into R studio. I ran my analysis and then was informed that there was a mistake with the data collection. Every 4 rows actually represents 1 month and I need to combine every 4 rows into 1 for each of 6 variables. The data looks like this:

             X2   X3   X4   X5   X6   X7
1          3830    0    0 1184    0 5765
2             0    0  153    0  153  153
3             0    0    0   73   73    0
4             0    0  153    0  153    0
5          3815    0 3354  651  903 5174
6             0    0  138  101  273  558
7             0    0  322    0  272  322
8             0    0    0  101  273  420
9             0    0 3682 3302   83 6485
10          165    0   63    0  293  165
11            0    0   98   98    0    0
12            0    0    0    0  230    0

So basically I want rows 1,2,3,4 combined into one new row. 5,6,7,8 into another etc. What is the easiest way to go about doing this in R???

When you say "combined", do you mean you want to sum the values in each column, so that each new row is a sum of four rows in your original dataset? — Tim Goodman, Jul 13 '20 at 18:53
Yes to both! So for X2, I want to have a new row that says 3830 for (3830 +0+0+0) — Josh_PL, Jul 13 '20 at 18:54

Ian Campbell · Accepted Answer · 2020-07-13T19:24:31.933

3

Here's an approach with dplyr that relies on first creating a vector to summarize against. Then, we use group_by by to identify the rows to aggregate. Finally, we use across to perform the action on all the columns.

library(dplyr) #Version >= 1.0.0
result <- data %>%
            mutate(Aggregate = rep(seq(1,ceiling(nrow(data)/4)),each = 4)) %>%
            group_by(Aggregate) %>%
            summarise(across(everything(), sum))
result
# A tibble: 3 x 7
#  Aggregate    X2    X3    X4    X5    X6    X7
#      <int> <int> <int> <int> <int> <int> <int>
#1         1  3830     0   306  1257   379  5918
#2         2  3815     0  3814   853  1721  6474
#3         3   165     0  3843  3400   606  6650

If you're using an older version of dplyr, you could do this:

data %>%
  mutate(Aggregate = rep(seq(1,ceiling(nrow(data)/4)),each = 4)) %>%
  group_by(Aggregate) %>%
  summarise_all(sum)

Data

data <- structure(list(X2 = c(3830L, 0L, 0L, 0L, 3815L, 0L, 0L, 0L, 0L, 
165L, 0L, 0L), X3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), X4 = c(0L, 153L, 0L, 153L, 3354L, 138L, 322L, 0L, 3682L, 
63L, 98L, 0L), X5 = c(1184L, 0L, 73L, 0L, 651L, 101L, 0L, 101L, 
3302L, 0L, 98L, 0L), X6 = c(0L, 153L, 73L, 153L, 903L, 273L, 
272L, 273L, 83L, 293L, 0L, 230L), X7 = c(5765L, 153L, 0L, 0L, 
5174L, 558L, 322L, 420L, 6485L, 165L, 0L, 0L)), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")

edited Jul 13 '20 at 19:24

answered Jul 13 '20 at 18:55

Ian Campbell

19,690
9
20
44

I get this error when I run that code: Error in seq.default(1, ceiling(nrow(data)/4)) : 'to' must be of length 1 – Josh_PL Jul 13 '20 at 19:00
Make sure that you replace `data`, both in the beginning (line2), and in that line, with the name of your actual data.frame. – Ian Campbell Jul 13 '20 at 19:01
Ahhh I missed it in line 2. Now it is saying this error: Error in across(everything(), sum) : could not find function "across" – Josh_PL Jul 13 '20 at 19:07
Check `packageVersion("dplyr")`. As stated in the code, it needs to be `>=1.0.0`. Or, alternatively, you could use `summarise_all(sum)`. – Ian Campbell Jul 13 '20 at 19:08
Sweet! Thank you! is there an easy way to save that output as a variable that I can then play with? – Josh_PL Jul 13 '20 at 19:16
Absolutely, simply change line 2 to `result %`. The result will be saved in a data.frame named `result`. – Ian Campbell Jul 13 '20 at 19:16
Sorry if this is silly, but for changing line 2, do you just tag it on in front of the mute or change the whole line, or does that become a line itself – Josh_PL Jul 13 '20 at 19:21
Remember that the `%>%` operator pipes the left hand side into the first argument of the right hand side. In this case, the data.frame is being piped into `mutate`. After all the evaluation on the the right side of `% ...`. – Ian Campbell Jul 13 '20 at 19:23
Ahhhh that makes sense! thank you! I appreciate it! – Josh_PL Jul 13 '20 at 19:26
Hey Ian, I have another question about doing something similar. I have to do another mutation of the same data set. But what I need to do is leave row 1 unchanged, aggregate the next three in a similar fashion. I have tried playing with this to make it work and I just cant quite get it. – Josh_PL Jul 14 '20 at 14:42
Hi Josh, I think that sounds like a new question. This question is positively scored, so you shouldn't have an issue asking too many questions. – Ian Campbell Jul 14 '20 at 14:44
Thanks! I posted another question but I found that you have been exceedingly helpful and I greatly appreciate it! Basically I want to do the same thing, here all four rows were aggregated into 1 new row. What I would like to do now is for every for rows, I would like to leave the first row unchanged and then aggregate the next three rows. So leave row 1 alone. Aggregate rows 2-4. Leave row 5 alone. Aggregate rows 6-8, etc. – Josh_PL Jul 14 '20 at 14:47

score 3 · Answer 2 · answered Jul 13 '20 at 18:56

If we want to do a group by every 4 rows, create a grouping column with gl and use aggregate to get the sum of all the columns (assuming they are all numeric) with respect to the grouping column

df1$grp <- as.integer(gl(nrow(df1), 4, nrow(df1)))
aggregate(.~ grp, df1, FUN = sum,  na.rm = TRUE)
#  grp   X2 X3   X4   X5   X6   X7
#1   1 3830  0  306 1257  379 5918
#2   2 3815  0 3814  853 1721 6474
#3   3  165  0 3843 3400  606 6650

data

df1 <- structure(list(X2 = c(3830L, 0L, 0L, 0L, 3815L, 0L, 0L, 0L, 0L, 
165L, 0L, 0L), X3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), X4 = c(0L, 153L, 0L, 153L, 3354L, 138L, 322L, 0L, 3682L, 
63L, 98L, 0L), X5 = c(1184L, 0L, 73L, 0L, 651L, 101L, 0L, 101L, 
3302L, 0L, 98L, 0L), X6 = c(0L, 153L, 73L, 153L, 903L, 273L, 
272L, 273L, 83L, 293L, 0L, 230L), X7 = c(5765L, 153L, 0L, 0L, 
5174L, 558L, 322L, 420L, 6485L, 165L, 0L, 0L)), class = "data.frame", 
row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

score 2 · Answer 3 · answered Jul 13 '20 at 20:11

data.table version of akrun's aggregate answer

library(data.table)
setDT(df)

df[, lapply(.SD, sum), by = .(grp = df[, gl(.N, 4, .N)])]
#    df   X2 X3   X4   X5   X6   X7
# 1:  1 3830  0  306 1257  379 5918
# 2:  2 3815  0 3814  853 1721 6474
# 3:  3  165  0 3843 3400  606 6650

score 0 · Answer 4 · answered Jul 13 '20 at 18:57

You can try:

library(tidyverse)
#Data

df <- structure(list(X2 = c(3830L, 0L, 0L, 0L, 3815L, 0L, 0L, 0L, 0L, 
165L, 0L, 0L), X3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), X4 = c(0L, 153L, 0L, 153L, 3354L, 138L, 322L, 0L, 3682L, 
63L, 98L, 0L), X5 = c(1184L, 0L, 73L, 0L, 651L, 101L, 0L, 101L, 
3302L, 0L, 98L, 0L), X6 = c(0L, 153L, 73L, 153L, 903L, 273L, 
272L, 273L, 83L, 293L, 0L, 230L), X7 = c(5765L, 153L, 0L, 0L, 
5174L, 558L, 322L, 420L, 6485L, 165L, 0L, 0L)), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")

#Create index
index <- seq(1,dim(df)[1],by=4)
index2 <- 1:length(index)
#Create empty var
df$index <- NA
df$index[index] <- index2
#Now fill
df %>% fill(index) %>% group_by(index) %>% summarise_all(sum,na.rm=T)

# A tibble: 3 x 7
  index    X2    X3    X4    X5    X6    X7
  <int> <int> <int> <int> <int> <int> <int>
1     1  3830     0   306  1257   379  5918
2     2  3815     0  3814   853  1721  6474
3     3   165     0  3843  3400   606  6650

Combining every four lines into 1

4 Answers4

data

Linked