Knowledge

Tuesday, October 10, 2006

Comparing 2 strings

The Java platform uses the Unicode Standard to define its characters. The Unicode Standard once defined characters as fixed-width, 16-bit values in the range U+0000 through U+FFFF. The U+ prefix signifies a valid Unicode character value as a hexadecimal number. The Java language conveniently adopted the fixed-width standard for the char type. Thus, a char value could represent any 16-bit Unicode character.

Most programmers are familiar with the length method. The following code counts the number of char values in a sample string. Notice that the sample String object contains a few simple characters and several characters defined with the Java language's \u notation. The \u notation defines a 16-bit char value as a hexadecimal number and is similar to the U+ notation used by the Unicode Standard.

private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
System.out.printf("char count: %d\n", charCount);

The length method counts the number of char values in a String object. The sample code prints this:

char count: 7

When Unicode version 4.0 defined a significant number of new characters above U+FFFF, the 16-bit char type could no longer represent all characters. Starting with the Java 2 Platform, Standard Edition 5.0 (J2SE 5.0), the Java platform began to support the new Unicode characters as pairs of 16-bit char values called a surrogate pair. Two char units act as a surrogate representation of Unicode characters in the range U+10000 through U+10FFFF. Characters in this new range are called supplementary characters.

To find out how many Unicode character code points are in a string, use the codePointCount method:

private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
int characterCount = testString.codePointCount(0, charCount);
System.out.printf("character count: %d\n", characterCount);

This example prints this:

character count: 6

The testString variable contains two interesting characters, which are a Japanese character meaning "learning" and a character named GOTHIC LETTER AHSA. The Japanese character has Unicode code point U+5B66, which has the same hexadecimal char value \u5B66. The Gothic letter's code point is U+10330. In UTF-16, the Gothic letter is the surrogate pair \uD800\uDF30. The pair represents a single Unicode code point, and so the character code point count of the entire string is 6 instead of 7.

0 Comments:

Post a Comment

<< Home