15

I came across some regular expressions that contain [^\\p{L}]. I understand that this is using some form of a Unicode category, but when I checked the documentation, I found only the following "L" categories:

Lu  Uppercase letter    UPPERCASE_LETTER
Ll  Lowercase letter    LOWERCASE_LETTER
Lt  Titlecase letter    TITLECASE_LETTER
Lm  Modifier letter     MODIFIER_LETTER
Lo  Other letter        OTHER_LETTER

What is L in this context?

tchrist
  • 74,913
  • 28
  • 118
  • 169
uTubeFan
  • 6,314
  • 11
  • 38
  • 65

2 Answers2

18

Taken from this link: http://www.regular-expressions.info/unicode.html

Check the Unicode Character Properties section.

\p{L} matches a single code point in the category "letter". If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300 is in the category "mark".

Favonius
  • 13,499
  • 3
  • 52
  • 94
  • Thanks and +1 to you, too. Your comment on my comment/question to @Ned Batchelder's answer is appreciated. – uTubeFan May 11 '11 at 19:35
  • 1
    For an "official" reference to the "L" category, see here: http://unicode.org/reports/tr18/#General_Category_Property – CodeClimber Jun 15 '16 at 12:32
3

I don't see any explicit mention of it, but an example on this page indicates that \\p{L} means any letter:

Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters.

Matthias
  • 6,835
  • 6
  • 51
  • 84
Ned Batchelder
  • 323,515
  • 67
  • 518
  • 625
  • That's what I thought, too, but then why does the following regex replace (with a space) **everything** that's **not** a letter? `String.replaceAll("[^\\p{L}]", " ")` – uTubeFan May 11 '11 at 19:32
  • 2
    @uTubeFan: See you are using *negation* in `^\\p{L}`. So when I do something like this `"Test akd ^^%!~+_)".replaceAll("[^\\p{L}]", " ")` then it will output `Test akd `. On the contrary if you do something like this `"Test akd ^^%!~+_)".replaceAll("[\\p{L}]", " ");` then the output will be ` ^^%!~+_)` – Favonius May 11 '11 at 19:42
  • @Favonius Thanks! So, can I conclude from this that `^%!~+_` are **not** considered letters? (I am basically looking to replace all non-letters (except apostrophe `'` as in `wasn't`) with a space, any suggestion?) – uTubeFan May 11 '11 at 19:47
  • @uTubeFan: Just saw your previous comment. Anyway you saved the work for me :) – Favonius May 11 '11 at 19:58