Comparing 2 strings
The Java platform uses the Unicode Standard to define its characters. The Unicode Standard once defined characters as fixed-width, 16-bit values in the range U+0000 through U+FFFF. The U+
prefix signifies a valid Unicode character value as a hexadecimal number. The Java language conveniently adopted the fixed-width standard for the char
type. Thus, a char
value could represent any 16-bit Unicode character.
Most programmers are familiar with the length
method. The following code counts the number of char
values in a sample string. Notice that the sample String
object contains a few simple characters and several characters defined with the Java language's \u
notation. The \u
notation defines a 16-bit char
value as a hexadecimal number and is similar to the U+
notation used by the Unicode Standard.
private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
System.out.printf("char count: %d\n", charCount);
The length
method counts the number of char
values in a String
object. The sample code prints this:
char count: 7
When Unicode version 4.0 defined a significant number of new characters above U+FFFF, the 16-bit char
type could no longer represent all characters. Starting with the Java 2 Platform, Standard Edition 5.0 (J2SE 5.0), the Java platform began to support the new Unicode characters as pairs of 16-bit char
values called a surrogate pair
. Two char
units act as a surrogate representation of Unicode characters in the range U+10000 through U+10FFFF. Characters in this new range are called supplementary characters
.
To find out how many Unicode character code points are in a string, use the codePointCount
method:
private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
int characterCount = testString.codePointCount(0, charCount);
System.out.printf("character count: %d\n", characterCount);
This example prints this:
character count: 6
The testString
variable contains two interesting characters, which are a Japanese character meaning "learning" and a character named GOTHIC LETTER AHSA
. The Japanese character has Unicode code point U+5B66, which has the same hexadecimal char
value \u5B66. The Gothic letter's code point is U+10330. In UTF-16, the Gothic letter is the surrogate pair \uD800\uDF30. The pair represents a single Unicode code point, and so the character code point count of the entire string is 6 instead of 7.