1

Using regex expression, how can I retrieve only words, while ignoring any other symbols like commas, numbers, etc.?

val words = text.split("\b([-A-Za-z])+\b")

For example:

This is a nice day, my name is...

I want to get:

This, is, a, nice, day, my, name, is

while ignoring , and ....

ScalaBoy
  • 2,606
  • 5
  • 28
  • 64

3 Answers3

2

Split the string on non-letter:

val words = text.split("[^-A-Za-z]+")
Toto
  • 83,193
  • 59
  • 77
  • 109
  • Could you please explain what the symbol `+` means? – ScalaBoy Sep 29 '18 at 09:25
  • 1
    @ScalaBoy: It means 1 or more occurrence of preceding character. See https://stackoverflow.com/q/22937618/372239 and https://www.regular-expressions.info/ for more informations. – Toto Sep 29 '18 at 09:28
2

To extract all words including hyphenated words, you may use

"""\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b""".r.findAllIn(s)

To support all Unicode letters, use \p{L} instead of the [a-zA-Z] character class:

val s = "This is a nice day, my name is..."
val res = """\b\p{L}+(?:-\p{L}+)*\b""".r.findAllIn(s)
println(res.toList)
// => List(This, is, a, nice, day, my, name, is)

See the Scala demo.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
0
val p ="""[[a-z][A-Z]]+""".r

In REPL:

scala> val text = "This is a nice day, my name is..."
text: String = This is a nice day, my name is...

scala> p.findAllIn(text).toArray
res24: Array[String] = Array(This, is, a, nice, day, my, name, is)

scala> val text = "This is a nice_day, my_name is..."
text: String = This is a nice_day, my_name is...

scala> p.findAllIn(text).toArray
res26: Array[String] = Array(This, is, a, nice, day, my, name, is)
RAGHHURAAMM
  • 1,113
  • 4
  • 14