5

I have a data frame that has the median salary of every US zip code as well as the all fast food restaurants located within that zip code. Here is a sample of that data frame:

Row_NUM ZIP   MEDIAN RESTAURANT
26800   1001  56663  McDonald's
33161   1007  79076  McDonald's
23706   1008  63980  McDonald's
23709   1008  63980  McDonald's
30007   1008  63980  Taco Bell
30008   1008  63980  McDonald's
30009   1011  63476  McDonald's
24429   1013  36578  McDonald's
15323   1020  50058  KFC
29196   1020  50058  McDonald's
33127   1020  50058  McDonald's
39362   1020  50058  Wendy's
44914   1020  50058  Taco Bell
2542    1027  58573  Burger King 
35242   1027  58573  McDonald's

I want to do two things. First, I want to create a new data frame that has only unique zip codes, the median salary of that zip code, and the total number of restaurants in that zip code. So, for this data frame sample:

ZIP    MEDIAN  TOTAL_RESTAURANTS
1001   56663   1
1007   79076   1
1008   63980   4

Second, I want to create a data frame that has only the unique zip code and a column with the total number of each type of fast food restaurant. So for this data frame sample:

ZIP    MEDIAN  TOTAL_MCDONALDS  TOTAL_TACOBELL  TOTAL_KFC
1001   56663   1                0               0
1007   79076   1                0               0
1008   63980   3                1               0

I thought the below code would work, but it only gives me the zip code and total restaurant, and I am not sure how to modify it to include the other three columns.

df <- ddply(df,~ZIP, summarise,TOTAL_RESTAURANTS=length(RESTAURANT))

Any help would be greatly appreciated.

Edit: Here is what data types I am working with in my data frame.

    str(df)
    data.frame':    50002 obs. of  3 variables:
 $ ZIP       : int  44126 24014 77011 2190 48509 21061 43213 70130 31907 19422 ...
 $ MEDIAN    : int  54496 50175 27113 74205 50895 62408 36734 47591 38710 103683 ...
 $ RESTAURANT: Factor w/ 10 levels "McDonald's","Burger King",..: 2 2 2 2 2 2 2 2 2 2 ...

Here is my session info:

    R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] graphics  grDevices utils     datasets  grid      stats     methods   base     

other attached packages:
 [1] lubridate_1.3.3 extrafont_0.17  jsonlite_0.9.16 dplyr_0.4.2     tidyr_0.2.0     tableplot_0.3-5 reshape2_1.4.1  RCurl_1.95-4.7 
 [9] bitops_1.0-6    gplots_2.17.0   ggthemes_2.2.1  ggplot2_1.0.1  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0        Rttf2pt1_1.3.3     magrittr_1.5       MASS_7.3-43        munsell_0.4.2      colorspace_1.2-6   R6_2.1.1          
 [8] stringr_1.0.0      plyr_1.8.3         caTools_1.17.1     tools_3.2.2        parallel_3.2.2     gtable_0.1.2       KernSmooth_2.23-15
[15] DBI_0.3.1          extrafontdb_1.0    gtools_3.5.0       lazyeval_0.1.10    assertthat_0.1     digest_0.6.8       memoise_0.2.1     
[22] labeling_0.3       gdata_2.17.0       stringi_0.5-5      scales_0.3.0       proto_0.3-10 
Spencer
  • 131
  • 8

1 Answers1

7

Uisng dplyr and reshape2, this would be one way to go. Since you want to summarise the data by ZIP, you can use the variable to group the data. One thing I am not clear is whether MEDIAN values are identical for each ZIP or not. Here, I assumed that you may have different values. Hence, I used median(). Using n(), you can find out how many fast food shops exist.

summarize(group_by(mydf, ZIP), mid = median(MEDIAN), total = n())

#If you have an identical MEDIAN for each ZIP, you could do;
summarize(group_by(mydf, ZIP), mid = first(MEDIAN), total = n())

#    ZIP   mid total
#  (int) (dbl) (int)
#1  1001 56663     1
#2  1007 79076     1
#3  1008 63980     4
#4  1011 63476     1
#5  1013 36578     1
#6  1020 50058     5
#7  1027 58573     2

For the second part, you can use dcast(). You want to see how many fast food stores exist by the types of fast food store. By the combination of ZIP and MEDIAN, you ask R to check how many shops (RESTAURANT) exist.

dcast(mydf, ZIP + MEDIAN ~ RESTAURANT, length, value.var = "RESTAURANT")

#   ZIP MEDIAN BurgerKing KFC McDonald's TacoBell Wendy's
#1 1001  56663          0   0          1        0       0
#2 1007  79076          0   0          1        0       0
#3 1008  63980          0   0          3        1       0
#4 1011  63476          0   0          1        0       0
#5 1013  36578          0   0          1        0       0
#6 1020  50058          0   1          2        1       1
#7 1027  58573          1   0          1        0       0

If you use data.table, you could do the following.

library(data.table)
setDT(mydf)[, list(mid = first(MEDIAN), total = .N), by = ZIP][]
# If you calculate median
setDT(mydf)[, list(mid = as.double(median(MEDIAN)), total = .N), by = ZIP][]

dcast(setDT(mydf), ZIP + MEDIAN ~ RESTAURANT, fun = length, value.var = "RESTAURANT")

DATA

mydf <-structure(list(Row_NUM = c(26800L, 33161L, 23706L, 23709L, 30007L, 
30008L, 30009L, 24429L, 15323L, 29196L, 33127L, 39362L, 44914L, 
2542L, 35242L), ZIP = c(1001L, 1007L, 1008L, 1008L, 1008L, 1008L, 
1011L, 1013L, 1020L, 1020L, 1020L, 1020L, 1020L, 1027L, 1027L
), MEDIAN = c(56663L, 79076L, 63980L, 63980L, 63980L, 63980L, 
63476L, 36578L, 50058L, 50058L, 50058L, 50058L, 50058L, 58573L, 
58573L), RESTAURANT = structure(c(3L, 3L, 3L, 3L, 4L, 3L, 3L, 
3L, 2L, 3L, 3L, 5L, 4L, 1L, 3L), .Label = c("BurgerKing", "KFC", 
"McDonald's", "TacoBell", "Wendy's"), class = "factor")), .Names = c("Row_NUM", 
"ZIP", "MEDIAN", "RESTAURANT"), class = "data.frame", row.names = c(NA, 
-15L))
jazzurro
  • 21,610
  • 35
  • 58
  • 70
  • 2
    @Spencer You may want to read [this question](http://stackoverflow.com/questions/22801153/dplyr-error-in-n-function-should-not-be-called-directly). You want to make sure that you upload plyr first, then dplyr. In your case, this may not be the case. – jazzurro Oct 15 '15 at 00:28
  • So when I run the summarize code it gives me the following error: `Error in n() : This function should not be called directly` – Spencer Oct 15 '15 at 00:30
  • @jazzurro Ok so following the link you posted solved that error, but produced a new one: `Error: loss of precision when attempting to convert a numeric to an integer` I also edited the post to show what data types I am dealing with. – Spencer Oct 15 '15 at 00:44
  • @Spencer `mydf` contains both ZIP and MEDIAN as integer, which is probably identical to what you have. I cannot reproduce the error message on my side. – jazzurro Oct 15 '15 at 00:51
  • @Spencer Time is up on my side. I gotta go now. If you are still in trouble, I am happy to help you out later. – jazzurro Oct 15 '15 at 01:00
  • Ya I saw that, which is why this new error does not really help me much. I added my session info to the original question. I will say that if I take out the mid = median(MEDIAN) portion, the code runs just fine (but doesn't include the median salary obviously) – Spencer Oct 15 '15 at 01:00
  • 1
    @Jazzurro The mid = first(MEDIAN) option that you provided works. Thank you so much!! – Spencer Oct 15 '15 at 01:09
  • @Spencer Glad to hear that! – jazzurro Oct 15 '15 at 05:42