8

First of all I am sorry if my formatting is bad, this is my first time posting, (also new to programming & R)

I am trying to merge two data frames together on string variables. I am merging university names, which might not match up perfectly, so I was hoping to merge using a fuzzy or approximate string matching function. I was happy when I found the ‘fuzzyjoin’ package.

from cranR: stringdist_join: Join two tables based on fuzzy string matching of their columns

stringdist_join(x, y, by = NULL, max_dist = 2, method = c("osa", "lv",
  "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw","soundex"), mode = "inner", ignore_case = FALSE, distance_col = NULL, ...)

my code:

stringdist_left_join(new, institutions, by = c("tm_9_undergradu" = "Institution.Name"))

Error:

Error in dists[include] <- stringdist::stringdist(v1[include], v2[include],  : 
NAs are not allowed in subscripted assignments

I know that there are some NA's in these columns, but I am not sure how I could remove them as I need them there as well. I know it other join & merge functions the NA's will simply be ignored. Does anyone know a way to get around this error for this package or to do an approximate join on strings another way. Thank you for your help.

Arthur Yip
  • 4,223
  • 21
  • 42
Brian
  • 93
  • 5
  • 1
    This would be easier for people to try to answer if you included the data or some of it, and what you expect the output to be. You can get code to create the dataframe by using the `dput` command. For example `dput(new)` and `dput(institutions)`. Without seeing the data my suggestion would be to limit the data to cases where the join variable is not NA. – Kerry Jackson Nov 01 '18 at 21:13
  • Thank you for your help Kerry, I was able to just remove the the rows with NA's in that column and the rejoin them after. I was hoping for a way to do this within the function but appreciate the help. – Brian Nov 05 '18 at 14:51
  • 1
    @Brian did you ever find a better solution? This is my exact question, but I have multiple columns I'm joining. I keep getting this error. – epi_n00b Mar 06 '19 at 19:37
  • @epi_n00b I ended up having to remove the rows with NA's beforehand and then binding them back on after the join. I couldn't find any way to make the function ignore them. – Brian Mar 08 '19 at 13:27

2 Answers2

4

This answer worked for me and is from GitHub

Step 1: figure out which Df has the NAs

`which(is.na(df1))
 which(is.na(df2))`

Step 2: replace NAs with something else. df1[is.na(df1)] <- "empty_string"

Step 3: run the join (the code I was working with when I got the error)

`test1 <- msa_table %>%
   as_tibble() %>% 
   unlist() %>%
   mutate(msa = sub("\\(.*)","", as.character(msa)) %>% 
   stringdist_full_join(msa_table, df1, by = 'msa', max_dist = 2)` 

The result for me was not having the same error, but still having NAs in my tables.

Hope this helps! Also, to be clear: this solution came from Anton Prokopyev '@prokopyev' on GitHub.

Luke Holcomb
  • 157
  • 7
2

Try

`test1 <- msa_table %>%
    as_tibble() %>% 
    unlist() %>%
    mutate(msa = stringr::str_squish(msa)) %>% 
    stringdist_full_join(msa_table, df1, by = 'msa', max_dist = 2)`