Sorting Data in R by multiple Parameters

Question

I have the following Problem: I have a dataset including 3 different columns (there are more, but for the analysis they are not relevant). Here is an example Dataset (the origin dataset has much more observations):

Date               Company             Return
March              A                   0.03
March              A                   0.02
March              B                   0.01
April              B                   0.02       
April              A                   0.01
May                C                   0.02
June               B                   0.03

Now I want to find the biggest return for every month, but each company should show 3 times at max in the final output. I tried some for loops to create subsets for each month, but then I have no idea how I can extract the biggest return for each month while still have the restriction of each company showing up less than 3 times. Important is that when a company was returned three times in the output, the company with the next highest return has to be picked for the certain month. Every month hast to be returned one time.

I think you need to clarify what you're asking for. Can you provide your desired output for this dataset? — iod, Dec 06 '18 at 14:11
Possible duplicate of [Getting the top values by group](https://stackoverflow.com/questions/27766054/getting-the-top-values-by-group) — Ferroao, Dec 06 '18 at 14:43
The output should give me the biggest return for each month, with each company showing up max three times. When a company shows up three times, the company with the next highest return should be picked for the missing months. — Mexem, Dec 07 '18 at 10:33
Say that there are 12 months and 4 companies, c = 1,2,3,4, with returns R(c)=c for each month. So then the output should be 4,4,4,3,3,3,2,2,2,1,1,1? That is, in your problem the order of how we look for maxima clearly matters, and as I understand it has to be done chronologically rather than by somehow taking into account the return values, right? — Julius Vainora, Dec 08 '18 at 15:27

ebeneditos · Answer 1 · 2018-12-10T12:22:39.590

If you want to maximise the overall return, then you would need to move to an optimisation approach, with the formulation:

Objective function to maximise: Sum of returns
Monthly restriction: Each month must appear once in the solution
Company restriction: Each company can appear maximum 3 times in the solution

This could be done with library lpSolveAPI:

library(lpSolveAPI)

# Create data.table
dt <- data.frame(Date = c("March", "March", "March", "April", "April", "May", "June"),
                 Company = c("A", "A", "B", "B", "A", "C", "B"),
                 Return = c(0.03, 0.02, 0.01, 0.02, 0.01, 0.02, 0.03))

# Objective
obj <- -dt$Return

# Constraints
constraints <- list()

# Each month must appear once in the solution
for (month in unique(dt$Date)){
  constraints[[paste0('month', month)]] <- list(xt = as.numeric(dt$Date == month),
                                                type = "=",
                                                rhs = 1)
}

# Each company can appear maximum 3 times in the solution
for (com in unique(dt$Company)){
  constraints[[paste0('company', com)]] <- list(xt = as.numeric(dt$Company == com),
                                                type = "<=",
                                                rhs = 3)
}

# Build model
lprec <- make.lp(0, ncol = nrow(dt))
set.type(lprec, columns = seq(1,nrow(dt)), type = "binary")

set.objfn(lprec, obj = obj)

for (constraint in constraints){
  add.constraint(lprec, xt = constraint$xt, type = constraint$type, rhs = constraint$rhs)
}

# Compute Solution
solve(lprec)

# Visualize solution
solution <- dt[get.variables(lprec)==1,]

solution

#    Date Company Return
# 1 March       A   0.03
# 4 April       B   0.02
# 6   May       C   0.02
# 7  June       B   0.03

First, thank you for your response. The Code you suggested looks great, but as you pointed out, the overall return has to be maximized. You helped me figuring out why my code, which had a similar approach, did not worked correctly. If you have a quick recommendation for the optimization, i would be very grateful if you could help me there. Anyway, thank you for your effort! — Mexem, Dec 09 '18 at 19:10
I did not understand that from your question. Just edited with a possible optimization approach. You can refer to `lpSolveAPI` documentation for further details. — ebeneditos, Dec 09 '18 at 19:36
@mexem, let me know if this works for you and I'll award the bounty. — iod, Dec 09 '18 at 23:19

score 0 · Answer 2 · answered Dec 06 '18 at 14:05

like this?

> library(tidyverse)
> 
> dataset<-data.frame(Date=c("March","March","March","April","April","May","June"),Company=c("A","A","B","B","A","C","B"),Return=c(0.03,0.02,0.01,0.02,0.01,0.02,0.03))

> dataset %>% group_by(Date,Company) %>% summarise(max_ret=max(Return))
# A tibble: 6 x 3
# Groups:   Date [?]
Date  Company max_ret
<fct> <fct>     <dbl>
1 April A          0.01
2 April B          0.02
3 June  B          0.03
4 March A          0.03
5 March B          0.01
6 May   C          0.02

score 0 · Answer 3 · answered Dec 06 '18 at 14:07

0

First let's create a data set with more than 3 cases of any company, just so we can test it:

df1<-structure(list(Date = c("March", "March", "March", "April", "April", 
     "May", "June", "July"), Company = c("A", "A", "B", "B", "A", 
     "C", "B", "B"), Return = c(0.03, 0.02, 0.01, 0.02, 0.01, 0.02, 
     0.03, 0.02)), .Names = c("Date", "Company", "Return"), row.names = c(NA, 
     -8L), class = "data.frame")

Now, using dplyr:

df1 %>% group_by(Company) %>% 
  arrange(desc(Return)) %>% 
  filter(row_number() %in% 1:3) %>% 
  group_by(Date) %>% 
  filter(Return==max(Return))

First we keep only the top three results for each company, thus assuring that no matter what, our final output will not have more than three instances for each company. Then we group by Date and just keep the top Return for each month.

# A tibble: 5 x 3
# Groups:   Date [5]
  Date  Company Return
  <chr> <chr>    <dbl>
1 March A         0.03
2 June  B         0.03
3 April B         0.02
4 May   C         0.02
5 July  B         0.02

answered Dec 06 '18 at 14:07

iod

6,861
2
13
30

I now tested it and it actualy works and has the output i want, except that i want to have a return for each day. The function actually stops after it picked three companies, and doesnt return anything for months where this company has the highest return too. In this case it should pick the company with the next highest return. – Mexem Dec 06 '18 at 21:25
Sorry, I tried coming up with something that will do all the stuff you're trying to achieve at once, but I came up against a wall. I got as far as this: `df1 %>% group_by(Date,Company) %>% filter(Return==max(Return)) %>% group_by(Date) %>% mutate(rank=row_number(), Companyrank=paste0(rank,Company)) %>% select(-Company,-rank) %>% spread(Companyrank,Return)` which gives a table with the companies ranked in order for each date, but I'm not sure how to move forward in a generic manner. – iod Dec 07 '18 at 14:04
Thats the Point were i was clueless earlier too. My Solution was to Save the companys that come out from the first search/filter in a new data frame and substract the companies from the Origin data frame that show up three times in the new data frame. Then i applied the suggested filter again on this data frame. The problem there is that i have no idea how to go further from there, because i have to apply another filter that accounts the number of each company that i have picked before. I will come up with a more percise desciption and some code later. – Mexem Dec 07 '18 at 16:21

Sorting Data in R by multiple Parameters

3 Answers3