1

I have a List[String] with unicode characters, e.g.

val languages = List("Deutsch","english","español")
val results = languages.filter(_.contains("espan"))
results: List[String] = List()

but I want it to find List(español). Is there a good/fast way to implement this for characters like the German ä,ö,ü as well?

elmalto
  • 966
  • 1
  • 12
  • 22
  • Implement what, exactly? Partial matching? Translating some unicode characters to the english alplhabet? When should `List("español")` be returned, and when should it not be returned? – Michael Zajac Nov 06 '14 at 02:15
  • I want it returned on `espan` as well as on `españ` – elmalto Nov 06 '14 at 02:21
  • Have a look at this: http://stackoverflow.com/questions/1008802/converting-symbols-accent-letters-to-english-alphabet – dhg Nov 06 '14 at 03:48
  • possible duplicate of [How do I remove diacritics (accents) from a string in .NET?](http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net). There are Java solutions in the answers that can be used. – The Archetypal Paul Nov 06 '14 at 07:40

1 Answers1

0

Like this:

scala> import java.text.Normalizer
import java.text.Normalizer

scala>  def removeDiacritics(in: String) : String = {
     |     // Separate accents from characters and then remove non-unicode characters
     |     Normalizer.normalize(in, Normalizer.Form.NFD).replaceAll("\\p{M}", "")
     |   }
removeDiacritics: (in: String)String

scala>  val languages = List("Deutsch","english","español")
languages: List[String] = List(Deutsch, english, español)

scala>   val results = languages.map(removeDiacritics).filter(_.contains("espan"))
results: List[String] = List(espanol)

scala> 

The solution here provides a "removeDiacritics" function that can be used with mapping a list before you do the contains("espan"). The key is understanding that the normalizer will separate diacritics from the alphabetic character while the pattern \p{M} matches anything not unicode which the diacritics aren't.

One side effect of this is that the string without the diacritics is returned. You might not want that but I'll leave it as an exercise to you to return the original now that you can do the comparison without the diacritics.

Reid Spencer
  • 2,616
  • 25
  • 33