Existing answers are either
- incorrect: they think that
char
is a separate character (code point), while it is a UTF-16 word which can be a half of a surrogate pair, or
- use libraries which is not bad itself but requires adding dependencies to your project, or
- use Java 8 Streams which is perfectly valid but not always possible.
Let's look at surrogate characters (every such character consist of two UTF-16 words — Java char
s) and can have upper and lowercase variants:
IntStream.rangeClosed(0x01_0000, 0x10_FFFF)
.filter(ch -> Character.toUpperCase(ch) != Character.toLowerCase(ch))
.forEach(ch -> System.out.print(new String(new int[] { ch }, 0, 1)));
Many of them may look like 'tofu' (□) for you but they are mostly valid characters of rare scripts and some typefaces support them.
For example, let's look at Deseret Small Letter Long I (), U+10428, "\uD801\uDC28"
:
System.out.println("U+" + Integer.toHexString(
"\uD801\uDC28".codePointAt(0)
)); // U+10428
System.out.println("U+" + Integer.toHexString(
Character.toTitleCase("\uD801\uDC28".codePointAt(0))
)); // U+10400 — ok! capitalized character is another code point
System.out.println("U+" + Integer.toHexString(new String(new char[] {
Character.toTitleCase("\uD801\uDC28".charAt(0)), "\uD801\uDC28".charAt(1)
}).codePointAt(0))); // U+10428 — oops! — cannot capitalize an unpaired surrogate
So, a code point can be capitalized even in cases when char
cannot be.
Considering this, let's write a correct (and Java 1.5 compatible!) capitalizer:
@Contract("null -> null")
public static CharSequence capitalize(CharSequence input) {
int length;
if (input == null || (length = input.length()) == 0) return input;
return new StringBuilder(length)
.appendCodePoint(Character.toTitleCase(Character.codePointAt(input, 0)))
.append(input, Character.offsetByCodePoints(input, 0, 1), length);
}
And check whether it works:
public static void main(String[] args) {
// ASCII
System.out.println(capitalize("whatever")); // w -> W
// UTF-16, no surrogate
System.out.println(capitalize("что-то")); // ч -> Ч
// UTF-16 with surrogate pairs
System.out.println(capitalize("\uD801\uDC28")); // ->
}
See also: