0

I try to bind rows which have common text.

I use this from this:

df1 <- data.frame(freetext = c("open until monday night", "one more time to insert your coin"), numid = c(291,312))
df2 <- data.frame(freetext = c("open until monday night a day before", "one more time to insert your coin but I should mention"), id = c(2,1))
fuzzyjoin::stringdist_inner_join(df1, df2, by = 'freetext', max_dist = 10)

However I receive this output:

freetext.x numid      freetext.y id        
<0 rows> (or 0-length row.names)

What should I update?

foc
  • 907
  • 1
  • 9
  • 25

1 Answers1

1

None of the strings get matched by fuzzyjoin because your max_dist is too small. For instance, the distance between the strings "open until monday night" and "open until monday night a day before" is 13 (i.e. the number of characters you need to add to or change in the first one to obtain the second one). Setting max_dist = 13 gives you that match:

fuzzyjoin::stringdist_inner_join(df1, df2, by = 'freetext', max_dist = 13)

#                freetext.x numid                           freetext.y id
# 1 open until monday night   291 open until monday night a day before  2

Increasing max_dist even more will give you other matches as well.

count orlok
  • 967
  • 1
  • 13
  • Thank you. How do you calculate the max_dist for this and you know it is 13? – foc Jul 06 '20 at 09:20
  • 1
    I just counted the number of characters: the string `"open until monday night"` has 23 characters, and `"open until monday night a day before"` has 36 characters... so to get from the first to the second string, we need to add 13 characters. That's the `max_dist`. See [this link](https://en.wikipedia.org/wiki/Levenshtein_distance) for more details. – count orlok Jul 06 '20 at 12:35