0

Sample data:

library(dplyr)

id <- rep(LETTERS[1:5], each = 10)
x <- round(runif(50, -500, 200), digits = 0)
y <- round(runif(50, -700, 700), digits = 0)
z <- round(runif(50, 250, 300), digits = 0)

df.1 <- data.frame(id = id, x = x, y = y, z = z)
> summary(df.1)
 id           x                y                 z        
 A:10   Min.   :-497.0   Min.   :-665.00   Min.   :251.0  
 B:10   1st Qu.:-283.2   1st Qu.:-349.50   1st Qu.:261.2  
 C:10   Median :-128.0   Median : -33.50   Median :274.5  
 D:10   Mean   :-145.4   Mean   : -39.58   Mean   :275.3  
 E:10   3rd Qu.: -15.0   3rd Qu.: 293.25   3rd Qu.:288.0  
        Max.   : 171.0   Max.   : 696.00   Max.   :299.0  

What I'm trying to achieve are:

  1. put each id into its own dataframe
  2. create a new column called "direction" which would be a response to conditions below

a - identify the column with the widest range among x, y, z b - within the identified column, calculate direction by whether the next row value is bigger than the current row value - TRUE and FALSE return

i.e where y has the maximum range

  id    x    y   z direction
1  A -320   31 251      TRUE
2  A -199 -530 276     FALSE
3  A -228  390 264      TRUE
4  A -158  363 268      TRUE
5  A -308  150 267     FALSE
6  A  -47  345 261        NA

It is really important that the direction is calculated on the column that has the maximum range. In the sample data, it's likely column y is always the one with the maximum range, but in my real data it could be any column..

I imagine it'd involve mutate and ifelse?! but not sure how I'd go about it... I normally would use extensive for loop and only started using dplyr last week or two.. Trying not to fall back on to messy for loop and severely nested codes..

Appreciate your help very much! Thanks!

    for (i in 1:length(unique(id)) {

    x <- 
      df.1 %>% 
      filter(id == unique(id)[i] %>%
      mutate(direction = ifelse())

    assign(unique(id)[i], x)

      }
  • Sequentially named variables are terrible. Use a [list of data frames](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames), e.g., just `df_list = split(df, df$id)`. – Gregor Thomas Jun 12 '17 at 23:05

1 Answers1

0

put each id into its own dataframe

df_list = split(df.1, df.1$id)

create a new column called "direction" which would be a response to conditions below

  • identify the column with the widest range among x, y, z b
  • within the identified column, calculate direction by whether the next row value is bigger than the current row value
  • TRUE and FALSE return

Let's write a function that does this to one data frame:

foo = function(df) {
  # identify column with widest range within x, y, z
  sub_df = df[c("x", "y", "z")]
  ranges = sapply(sub_df, max) - sapply(sub_df, min)
  widest = which.max(ranges)
  # see which direction it goes
  direction = diff(sub_df[[widest]]) < 0
  # add this as a column to whole df
  df$direction = c(direction, NA)
  return(df)
}

We can then apply this function to each data frame:

df_list = lapply(df_list foo)

Complete demo here. I made your data little smaller just to keep it compact:

set.seed(47)
id <- rep(LETTERS[1:3], each = 6)
x <- round(runif(18, -500, 200), digits = 0)
y <- round(runif(18, -700, 700), digits = 0)
z <- round(runif(18, 250, 300), digits = 0)
df.1 <- data.frame(id = id, x = x, y = y, z = z)

df_list = split(df.1, df.1$id)

df_list = lapply(df_list, foo)
df_list
# $A
# id    x    y   z direction
# 1  A  184 -600 262     FALSE
# 2  A -238  -44 299      TRUE
# 3  A   33 -451 274     FALSE
# 4  A   76   80 284      TRUE
# 5  A  -99   22 253      TRUE
# 6  A  -16 -513 269        NA
# 
# $B
# id    x    y   z direction
# 7   B -228  265 280      TRUE
# 8   B -172 -168 297      TRUE
# 9   B -120 -653 268     FALSE
# 10  B  147 -648 260     FALSE
# 11  B -403   51 283     FALSE
# 12  B   -9  419 298        NA
# 
# $C
# id    x    y   z direction
# 13  C -386  348 269      TRUE
# 14  C  -80 -183 293     FALSE
# 15  C -146  -45 259      TRUE
# 16  C  131 -429 289     FALSE
# 17  C -220  556 253      TRUE
# 18  C -478  -84 252        NA

I have not used dplyr. dplyr is good at many things, but doing something to a column where the column depends on a condition is tough in dplyr, writing a function like foo above is easier. That said, you could adjust the function a little bit and do something like this:

library(dplyr)
library(tidyr)
df.1 %>% group_by(id) %>% 
  do(a = foo(.)) %>%  
  ungroup() %>% 
  unnest()

And still split at the end if you want the data frame separated, but the lapply is so easy (and doesn't require extra packages), whereas this seems more complicated for no gain.

Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257
  • Thank you! This is so lovely! I should learn to write my own functions more :D Thanks you so much again! – user5813583 Jun 12 '17 at 23:37
  • it works perfect when I don't have a id column?! but since my id column is factor, max not meaningful for factor error come up?! but I reckon I can take it on from here :D Thanks!! – user5813583 Jun 12 '17 at 23:49
  • foo = function(df) { # identify column with the widest range nums = sapply(df, is.numeric) ranges = sapply(df[ , nums], max) - sapply(df[ , nums], min) widest = which.max(ranges) # see which direction it goes direction = diff(df[[widest]]) < 0 # add this as a column df$direction = c(direction, NA) return(df) } – user5813583 Jun 12 '17 at 23:52