Simple lookup to insert values in an R data frame

Question

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:

Case     zip     market
1        44485   NA
2        44488   NA
3        43210   NA

There are over 3.5 million records.

Then, I have a second data frame, 'zipcodes'.

market    zip
1         44485
1         44486
1         44488
...       ... (100 zips in market 1)
2         43210
2         43211
...       ... (100 zips in market 2, etc.)

I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.

you are looking for `merge`. search SO for `[r] merge` and you will find it — Ricardo Saporta, Jul 24 '13 at 20:38
@TylerRinker, I use `merge.data.table` on 400 Million rows often, and its quite fast. Good call on using `qdap` though — Ricardo Saporta, Jul 25 '13 at 03:59
Oh wow never worked with data that large. Thanks for the experience informed response. — Tyler Rinker, Jul 25 '13 at 04:10

score 14 · Answer 1 · answered Jul 24 '13 at 20:48

14

Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:

merge(alldata[, c("Case", "zip")], zipcodes, by="zip")

The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").

answered Jul 24 '13 at 20:48

Dan Garant

723
5
12

Won't this introduce NAs for zip codes that are not in alldata? – Rodrigo Mar 01 '18 at 23:42
@Rodrigo By default, you'll get only rows with matches in both data frames. That can be controlled with the `all`, `all.x`, and `all.y` options – Dan Garant Mar 02 '18 at 13:13

juandelsur · Answer 2 · 2018-02-14T14:26:18.403

9

Another option that worked for me and is very simple:

alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])

edited Feb 14 '18 at 14:26

answered Jul 28 '17 at 13:56

juandelsur

753
7
6

So simple yet so great! – vashts85 Nov 01 '17 at 22:48
2

It would be great if you could explain what's going on there. – Rodrigo Mar 01 '18 at 23:43
From R docs: "`match` returns a vector of the positions of (first) matches of its first argument in its second". So basically `match` returns a vector of right indices to access zipcodes$market to get the right market values for alldata$market. – Aelian Oct 17 '20 at 06:11

Tyler Rinker · Answer 3 · 2014-05-08T20:47:41.853

3

With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:

library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])

Or

alldata$zip %l% zipcodes[, 2:1]

edited May 08 '14 at 20:47

answered Jul 24 '13 at 22:13

Tyler Rinker

99,090
56
292
477

Tyler, I tested this on a bit bigger data and the step `lookup(.)` proved time consuming (I stopped it after a while). I've it as a [**gist**](https://gist.github.com/arunsrinivasan/ee2d9ef43bdc02c32958). Am I doing it right? With a quick debug, the time consuming portion seems to be `sapply(.)` inside `recoder` function. – Arun May 08 '14 at 22:58
@Arun Thanks for the gist. I added some improvements in the [dev version](https://github.com/trinker/qdapTools) based on your findings that makes significant speed boosts. – Tyler Rinker May 09 '14 at 04:30
Just out of curiosity did you bench `data.table` to see how it compares. I have read `data.table` has made some nice improvements in lookups including character lookups. – Tyler Rinker May 09 '14 at 05:37
1

I've edited the gist with two ways. 1) A join (although it is really not needed here) and 2) using `match` and `:=`, would be nicer to benchmark on huge data though. HTH. – Arun May 09 '14 at 05:52
1

Yes, quite a few welcoming changes from 1.9.0+. This should be convincing: `require(data.table); set.seed(1L); x = sample(paste0("V", 1:1e6)); system.time(ans1 – Arun May 09 '14 at 06:01

score 2 · Answer 4 · answered May 18 '17 at 10:14

2

Here's the dplyr way of doing it:

library(tidyverse)
alldata %>%
  select(-market) %>%
  left_join(zipcodes, by="zip")

which, on my machine, is roughly the same performance as lookup.

answered May 18 '17 at 10:14

James Brusey

171
1
8

score 0 · Answer 5 · answered Apr 14 '21 at 16:31

The syntax of match is a bit clumsy. You might find the lookup package easier to use.

alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
##   Case   zip market
## 1    1 44485      1
## 2    2 44488      1
## 3    3 43210      2

Simple lookup to insert values in an R data frame

5 Answers5

Linked