13

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:

Case     zip     market
1        44485   NA
2        44488   NA
3        43210   NA

There are over 3.5 million records.

Then, I have a second data frame, 'zipcodes'.

market    zip
1         44485
1         44486
1         44488
...       ... (100 zips in market 1)
2         43210
2         43211
...       ... (100 zips in market 2, etc.)

I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.

Kevin Wright
  • 2,077
  • 20
  • 24
Dino Fire
  • 399
  • 2
  • 5
  • 9

5 Answers5

14

Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:

merge(alldata[, c("Case", "zip")], zipcodes, by="zip")

The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").

Dan Garant
  • 723
  • 5
  • 12
  • Won't this introduce NAs for zip codes that are not in alldata? – Rodrigo Mar 01 '18 at 23:42
  • @Rodrigo By default, you'll get only rows with matches in both data frames. That can be controlled with the `all`, `all.x`, and `all.y` options – Dan Garant Mar 02 '18 at 13:13
9

Another option that worked for me and is very simple:

alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])
juandelsur
  • 753
  • 7
  • 6
  • So simple yet so great! – vashts85 Nov 01 '17 at 22:48
  • 2
    It would be great if you could explain what's going on there. – Rodrigo Mar 01 '18 at 23:43
  • From R docs: "`match` returns a vector of the positions of (first) matches of its first argument in its second". So basically `match` returns a vector of right indices to access zipcodes$market to get the right market values for alldata$market. – Aelian Oct 17 '20 at 06:11
3

With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:

library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])

Or

alldata$zip %l% zipcodes[, 2:1]
Tyler Rinker
  • 99,090
  • 56
  • 292
  • 477
  • Tyler, I tested this on a bit bigger data and the step `lookup(.)` proved time consuming (I stopped it after a while). I've it as a [**gist**](https://gist.github.com/arunsrinivasan/ee2d9ef43bdc02c32958). Am I doing it right? With a quick debug, the time consuming portion seems to be `sapply(.)` inside `recoder` function. – Arun May 08 '14 at 22:58
  • @Arun Thanks for the gist. I added some improvements in the [dev version](https://github.com/trinker/qdapTools) based on your findings that makes significant speed boosts. – Tyler Rinker May 09 '14 at 04:30
  • Just out of curiosity did you bench `data.table` to see how it compares. I have read `data.table` has made some nice improvements in lookups including character lookups. – Tyler Rinker May 09 '14 at 05:37
  • 1
    I've edited the gist with two ways. 1) A join (although it is really not needed here) and 2) using `match` and `:=`, would be nicer to benchmark on huge data though. HTH. – Arun May 09 '14 at 05:52
  • 1
    Yes, quite a few welcoming changes from 1.9.0+. This should be convincing: `require(data.table); set.seed(1L); x = sample(paste0("V", 1:1e6)); system.time(ans1 – Arun May 09 '14 at 06:01
2

Here's the dplyr way of doing it:

library(tidyverse)
alldata %>%
  select(-market) %>%
  left_join(zipcodes, by="zip")

which, on my machine, is roughly the same performance as lookup.

James Brusey
  • 171
  • 1
  • 8
0

The syntax of match is a bit clumsy. You might find the lookup package easier to use.

alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
##   Case   zip market
## 1    1 44485      1
## 2    2 44488      1
## 3    3 43210      2
Kevin Wright
  • 2,077
  • 20
  • 24