2

I have 2 extracted data sets from a dataset called babies2009( 3 vectors count, name, gender )

One is girls2009 containing all the girls and the other boys2009. I want to find out what similar names exist between boys and girls.

I tried this

common.names = (boys2009$name %in% girls2009$name)

When I try

babies2009[common.names, ] [1:10, ]

all I get is the girl names not the common names.

I have confirmed that both data sets indeed contain boys and girls respectively by doing taking a 10 sample...

boys2009 [1:10,]
girsl2009 [1:10,]

How else can I compare the 2 datasets and determine what values they both share. Thanks,

Waldir Leoncio
  • 9,134
  • 14
  • 68
  • 94
akz
  • 1,625
  • 2
  • 14
  • 13
  • You will get much better answers if you make your answers reproducible: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Ari B. Friedman Sep 18 '11 at 03:05

3 Answers3

5

common.names = (boys2009$name %in% girls2009$name) gives you a logical vector of length length(boys2009$name). So when you try selecting from a much longer data.frame babies2009[common.names, ] [1:10, ], you wind up with nonsense.

Solution: use that logical vector on the proper data.frame!

boys2009 <- data.frame( names=c("Billy","Bob"),data=runif(2), gender="M" , stringsAsFactors=FALSE)
girls2009 <- data.frame( names=c("Billy","Mae","Sue"),data=runif(3), gender="F" , stringsAsFactors=FALSE)
babies2009 <- rbind(boys2009,girls2009)

common.names <- (boys2009$name %in% girls2009$name)

> boys2009[common.names, ]$names
[1] "Billy"
Ari B. Friedman
  • 66,857
  • 33
  • 169
  • 226
2

Since you want similarities but did not specify exact matches, you should consider agrep

sapply(boys2009$name , agrep,  girls2009$name, max = 0.1)

You can adjust the max.distance argument to suit your needs.

IRTFM
  • 240,863
  • 19
  • 328
  • 451
2

How about using set functions:

list(
    `only boys` = setdiff(boys2009$name, girls2009$name),
    `common` = intersect(boys2009$name, girls2009$name),
    `only girls` = setdiff(girls2009$name, boys2009$name)
)
Marek
  • 45,585
  • 13
  • 89
  • 116