Grapheme
Description
A <Grapheme> in VAST represents a user-perceived character and is the basic unit of a VAST <UnicodeString>.
There are only a few programming languages with Unicode Support that consider a character to represent the written expression of a character in the way a user might see it on a screen, rather than a digital code point. While the visual expression would be called a Glyph, the digital expression of this concept is called a "grapheme cluster".
In VAST, we call this a
<Grapheme>. The
<Grapheme> is logically composed of one or more
<UnicodeScalar>s. It is identified using
extended grapheme cluster boundary algorithms from
Text Segmentation in the
Unicode Standard.
The VAST <Grapheme> will abstract the details of how to group enough Unicode scalars together to form what we would think of as a character on the screen. It also abstracts the details regarding normalization. Normalization is problematic simply because it can create multiple binary representations of what is really the same string or character. Therefore, concepts like comparison and hashing will give incorrect results, if unhandled.
However, <Grapheme> handles these details transparently. It detects and ensures a common normalized form for various operations where these differences would matter, so the user can focus on programming, not worrying about what normalization form a string is in.
<Grapheme> is the best Unicode analog to the standard Smalltalk <Character> class. A ü always maps to one <Grapheme>, even though as described, it may be logically composed of 1 or 2 Unicode scalars. The original form is maintained for various reasons and do not implicitly convert a <Grapheme>'s internals from one encoded normalization form to another.
A <Grapheme> is also able to describe many of the same properties from the Unicode Character Database that a <UnicodeScalar> can.
Class State
• asciiGraphemes: <Array> of <Grapheme> objects. The first 128 code points in unicode exactly match ASCII.
Because of their frequency of use, the virtual machine will refer to objects in this grapheme cache to reduce object allocation and increase performance.
• crlf: <Grapheme>. crlf is the only grapheme in the ASCII range that is composed of 2 code points (Cr Lf) instead of 1. It is cached in its own variable slot.
Creation
A <Grapheme> is typically created automatically while cursoring through the #graphemes view of a <UnicodeString>, <String> or <ByteArray> object. It also is created by the normal iteration methods of a <UnicodeString> since the elements are <Grapheme>s.
| firstGrapheme |
firstGrapheme := 'abc' asUnicodeString graphemes next.
self assert: [firstGrapheme = $a].
However, you can manually create a <Grapheme> by using the APIs from the Creation categories on the class side. Additionally, there are #asGrapheme extension methods provided in the system. Below are some examples of both.
"From Integer, which is interpreted as the unicode code point value"
self assert: [97 asGrapheme = (Grapheme value: 97)].
"From Character, which performs any necessary code page conversion
if the value is > 7-bit ascii range"
self assert: [97 asCharacter asGrapheme = (Grapheme value: 97)].
"From UTF-8 bytes"
self assert: [(Grapheme utf8: #[97]) = $a].
"From UTF-16 bytes (platform/little/big-endian)"
self assert: [(Grapheme utf16LE: #[97 0]) = (Grapheme utf16BE: #[0 97])].
self assert: [(Grapheme utf16LE: #[97 0]) = $a].
"From UTF-32 bytes (platform/little/big-endian)"
self assert: [(Grapheme utf32LE: #[97 0 0 0]) = (Grapheme utf32BE: #[0 0 0 97])].
self assert: [(Grapheme utf32LE: #[97 0 0 0]) = $a].
"The #value: API can accept an <Integer> or a <Character>"
self assert: [(Grapheme value: 97) = $a].
self assert: [(Grapheme value: $a) = $a].
"The #value: API can also accept a <String> or <UnicodeString> describing the grapheme in escaped syntax"
"See Grapheme class>>value: method comments for a complete list of escapes"
self assert: [(Grapheme value: '\r\n') = Grapheme crlf].
self assert: [
(Grapheme value: '\u{1F62E}\u{200D}\u{1F4A8}') name = 'FACE WITH OPEN MOUTH,ZERO WIDTH JOINER,DASH SYMBOL'].
"Factory methods for commonly used graphemes"
self assert: [Grapheme cr = Character cr].
self assert: [Grapheme lf = Character lf].
self assert: [Grapheme crlf = String crlf graphemes first].
"Special replacement character, which is used anywhere unicode content must be repaired"
self assert: [(UnicodeString utf8: #[255] repair: true) graphemes first = Grapheme replacementCharacter]
Properties
A <Grapheme> has many properties that are defined by the Unicode Standard. These can be found in the Properties category on the instance side.
• #isAscii - Boolean indicating if the grapheme is in ASCII range
• #isAsciiAlphabetic - Boolean indicating if the grapheme is an ASCII alphabetic char
• #isAsciiAlphaNumeric - Boolean indicating if the grapheme is an ASCII alphabetic or numeric char
• #isAsciiControl - Boolean indicating if the grapheme is an ASCII control char
• #isAsciiDigit - Boolean indicating if the grapheme is an ASCII digit
• #isAsciiGraphic - Boolean indicating if the grapheme is an ASCII graphic character
• #isAsciiHexDigit - Boolean indicating if the grapheme is an ASCII hex digit
• #isAsciiLowercase - Boolean indicating if the grapheme is an ASCII lowercase char
• #isAsciiPunctuation - Boolean indicating if the grapheme is an ASCII punctuation char
• #isAsciiUppercase - Boolean indicating if the grapheme is an ASCII uppercase char
• #isAsciiWhitespace - Boolean indicating if the grapheme is an ASCII whitespace char
• #isCased - Boolean indicating whether the grapheme is either lowercase, uppercase, or titlecase.
• #isLetter - Boolean indicating whether the grapheme is a letter.
• #isLowercase - Boolean indicating if the grapheme is considered lowercase
• #isNewline - Boolean indicating if the grapheme represents whitespace, including newlines
• #isNumeric - Boolean indicating whether the grapheme has the general category for numbers
• #isUppercase - Boolean indicating if the grapheme is considered uppercase
• #isWhitespace - Boolean indicating if the grapheme represents whitespace, including newlines
• #name - Human readable name of the grapheme, a concatenation of all its unicode scalar names
• #names - Collection of the grapheme's unicode scalar names
Views
The following views are available for Unicode Scalars. These can be found on the Views category on the instance side.
• #unicodeScalars - Unicode scalar view of the grapheme
• #utf8 - UTF-8 encoded view of the grapheme
• #utf16 - UTF-16 platform-endian encoded view of the grapheme
• #utf16LE - UTF-16 little-endian encoded view of the grapheme
• #utf16BE - UTF-16 big-endian encoded view of the grapheme
• #utf32 - UTF-32 encoded view of the grapheme
• #utf32LE - UTF-32 little-endian encoded view of the grapheme
• #utf32BE - UTF-32 big-endian encoded view of the grapheme
Equality/Comparison
A <Grapheme> should be compared to other objects using equality, not identity. The first 128 unicode graphemes in Unicode (i.e., ASCII) are cached in the asciiGraphemes class instance variable. The crlf grapheme is cached in the crlf class instance variable. Because of this, Grapheme crlf or any grapheme in the range [0, 127] will compare by identity, but this is not a property that should be used in code.
"Yes, this is true here..."
self assert: [97 asGrapheme == 97 asGrapheme].
"...but don't ever count on it being true everywhere"
self assert: [300 asGrapheme ~~ 300 asGrapheme].
"Equality/Compare"
self assert: [97 asGrapheme = 97 asGrapheme].
self assert: [97 asGrapheme < 98 asGrapheme].
self assert: [97 asGrapheme <= 97 asGrapheme].
self assert: [98 asGrapheme >= 97 asGrapheme].
self assert: [98 asGrapheme > 97 asGrapheme].
"Special compare method for [-1, 0, 1]"
self assert: [(97 asGrapheme compareTo: 98 asGrapheme) = -1].
self assert: [(97 asGrapheme compareTo: 97 asGrapheme) = 0].
self assert: [(98 asGrapheme compareTo: 97 asGrapheme) = 1].
Conversion
Unicode Component
A <Grapheme> can convert itself to a <UnicodeString>.
self assert: [Grapheme cr asUnicodeString = UnicodeString cr].
UTF Encoding
A <Grapheme> can convert itself to any of the UTF encodings. Platform-endian accessors are provided in the Conversion category on the instance side, but any endian encoding can be accomplished with views.
"Platform-endian"
self assert: [Grapheme cr asUtf8 asByteArray = #[13]].
self assert: [Grapheme cr asUtf16 = (Utf16 with: 13)].
self assert: [Grapheme cr asUtf32 = (Utf32 with: 13)].
"Little/Big endian via views"
self assert: [Grapheme cr utf16LE asByteArray = #[13 0]].
self assert: [Grapheme cr utf16BE asByteArray = #[0 13]].
self assert: [Grapheme cr utf32LE asByteArray = #[13 0 0 0]].
self assert: [Grapheme cr utf32BE asByteArray = #[0 0 0 13]].
Case Mapping
A <Grapheme> can convert itself to its uppercase, lowercase and titlecase form. The result of these conversions is not a <Grapheme>, but rather a <UnicodeString>. Case mapping can change the number of unicode scalars. Depending on how they combine together, this may change the number of <Grapheme>s.
Uppercase
Here is an example that shows why case mapping answers a <UnicodeString>. Consider calling #asUppercase on the German sharp S 16rDF asGrapheme. When this is uppercased, it becomes two graphemes 16r53 (LATIN CAPITAL LETTER S) and 16r53 (LATIN CAPITAL LETTER S).
self assert: [16rDF asGrapheme asUppercase = 'SS' asUnicodeString]
Lowercase
This example was given in the class documentation for <UnicodeScalar>. This example produced two unicode scalars when 16r0130 asUnicodeScalar (LATIN CAPITAL LETTER I WITH DOT ABOVE) was lowercased. However, these two unicode scalars produce one user-perceived character or <Grapheme>. If this lowercase form were rendered as a Glyph on-screen, the user would typically see a small letter i with a combining dot above.
self assert: [16r0130 asGrapheme asLowercase = (Grapheme value: #(16r0069 16r0307)) asUnicodeString].
Titlecase
There are several unicode characters that require special handling when they are used as the initial "character" in the word. One example is 16rFB01 asGrapheme (LATIN SMALL LIGATURE FI). When this is titlecased, it becomes two graphemes 16r46 (LATIN CAPITAL LETTER F) and 16r69 (LATIN SMALL LETTER I)
self assert: [16rFB01 asGrapheme asTitlecase = 'Fi' asUnicodeString]
Character Compatibility
A <Grapheme> can be directly compared with a <Character> object. This is possible because the Smalltalk primitives that implement unicode have general awareness of <Character> and try to work with them where possible.
This compatibility is carried out in three different ways.
• Primitives: The virtual machine primitives can quickly detect and convert a <Character> if it is in the 7-bit
ASCII range. If it is outside this range, this means that the character represents a value from a particular code page, for which there are many code pages, and they can differ wildly in the upper 128 bytes. Because code page conversion is required in this case, a primitive failure is triggered.
• Primitive Fail Handlers: The primitive failure handlers in Smalltalk detect if the argument was a <Character> and code page converts the character to a <Grapheme> and tries the primitive call again (this time with a grapheme argument).
• Smalltalk Methods: Compatibility methods are provided in the Compatibility category of methods.
Important Note
• Compatibility relationship is uni-directional. A <Character> does not have direct knowledge of <Grapheme>.
• A <Grapheme> is NOT an immediate object like <Character>, it is not good practice to use identity ==
Here are the various ways that a <Grapheme> and a <Character> can work together.
"="
self assert: [97 asGrapheme = 97 asCharacter].
"<"
self assert: [97 asGrapheme < 98 asCharacter].
"<="
self assert: [97 asGrapheme <= 97 asCharacter].
">"
self assert: [98 asGrapheme > 97 asCharacter].
">="
self assert: [98 asGrapheme >= 97 asCharacter].
"hash"
self assert: [97 asGrapheme hash = 97 asCharacter hash].
"Only guarantee that 7-bit ascii will hash and = the same
for Character"
| s c |
s := 97 asGrapheme.
c := 97 asCharacter.
self assert: [s hash = c hash].
self assert: [s = c].
s := 159 asGrapheme. "159 unicode code point"
c := 159 asCharacter. "159 double-byte char value"
self assert: [s hash ~= c hash].
self assert: [s ~= c].
Class Methods
backspace
Answer the grapheme for a backspace.
Answers:
<Grapheme>
codePoint:
Create a new extended grapheme cluster by converting @anInteger
to an extended grapheme cluster (EGC). @anInteger is considered
to be a unicode scalar value.
Examples:
#'From unicode scalar value'.
self assert: [(Grapheme codePoint: 16r1F600) value = 16r1F600].
Arguments:
anInteger - <Integer> Unicode scalar value
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if can not convert anInteger to a Grapheme
cr
Answer the grapheme containing a carriage return.
Answers:
<Grapheme>
crlf
Answer the grapheme containing a carriage return and a linefeed.
@Note - In Graphemes (digital representation of a user-perceived character), the
crlf is represented a 1 grapheme
Answers:
<Grapheme>
escape
Answer the grapheme for an escape.
Answers:
<Grapheme>
escaped:
Create a new extended grapheme cluster by converting @aStringObject
to an extended grapheme cluster (EGC).
Escaped Strings:
If @aStringObject is a String or UnicodeString, then the following escapes
will be parsed to create an extended grapheme cluster:
Escapes:
\x53 7-bit character code (exactly 2 digits, up to 0x7F)
\u{1F600} 24-bit Unicode character code (up to 6 digits)
\n Newline (This is the Lf character)
\r Carriage return (This is the Cr character)
\t Tab
\\ Backslash
\0 Nul
Examples:
#'From a single element string object'.
self assert: [(Grapheme escaped: '\x53') = $S asGrapheme].
self assert: [(Grapheme escaped: '\u{1F600}') name = 'GRINNING FACE'].
self assert: [(Grapheme escaped: '\t') = Grapheme tab].
self assert: [(Grapheme escaped: '\r\n') = Grapheme crlf].
Arguments:
aStringObject - <UnicodeString> unicode string containing escape characters
(Compat) <String> Smalltalk code-page string containing escape characters (requires conversion if outside ascii range)
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if can not convert aStringObject to a Grapheme
lf
Answer the grapheme containing a line feed.
Answers:
<Grapheme>
newPage
Answer the grapheme for a new page.
Answers:
<Grapheme>
replacementCharacter
Answer the grapheme for the unicode replacement character.
Answers:
<Grapheme>
space
Answer the grapheme for space
Answers:
<Grapheme>
tab
Answer the grapheme for tab
Answers:
<Grapheme>
utf16:
Answer the grapheme constructed from @aByteObject which should be UTF-16 platfrom-endian encoded data.
@aByteObject is validated before any attempt is made to create a unicode string from its data.
Examples:
self assert: [(Grapheme utf16: 'a' utf16 contents) = $a].
Arguments:
aByteObject - <String | ByteArray> or byte shaped object
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16
utf16BE:
Answer the grapheme constructed from @aByteObject which should be UTF-16 big-endian encoded data.
Examples:
self assert: [(Grapheme utf16BE: #[0 97]) = $a].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16BE
utf16LE:
Answer the grapheme constructed from @aByteObject which should be UTF-16 little-endian encoded data.
Examples:
self assert: [(Grapheme utf16LE: #[97 0]) = $a].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16LE
utf32:
Answer the grapheme constructed from @aByteObject which should be UTF-32 platfrom-endian encoded data.
@aByteObject is validated before any attempt is made to create a unicode string from its data.
Examples:
self assert: [(Grapheme utf32: 'a' utf32 contents) = $a].
Arguments:
aByteObject - <String | ByteArray | Utf32> or byte shaped object
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32
utf32BE:
Answer the grapheme constructed from @aByteObject which should be UTF-32 big-endian encoded data.
Examples:
self assert: [(Grapheme utf32BE: #[0 0 0 97]) = $a].
Arguments:
aByteObject - <String | ByteArray | Utf32>
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32BE
utf32LE:
Answer the grapheme constructed from @aByteObject which should be UTF-32 little-endian encoded data.
Examples:
self assert: [(Grapheme utf32LE: #[97 0 0 0]) = $a].
Arguments:
aByteObject - <String | ByteArray | Utf32>
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32LE
utf8:
Answer the extended grapheme cluster constructed from @aByteObject which should be UTF-8 encoded data.
@ByteObject is validated before any attempt is made to create the grapheme from it.
Examples:
self assert: [(Grapheme utf8: #[97]) = $a]
Arguments:
aByteObject - <String | ByteArray>
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-8
value:
Create a new extended grapheme cluster by converting @anObject
to an extended grapheme cluster (EGC).
Escaped Strings:
If @anObject is a String or UnicodeString, then the following escapes
will be parsed to create an extended grapheme cluster:
Escapes:
\x53 7-bit character code (exactly 2 digits, up to 0x7F)
\u{1F600} 24-bit Unicode character code (up to 6 digits)
\n Newline (This is the Lf character)
\r Carriage return (This is the Cr character)
\t Tab
\\ Backslash
\0 Nul
Examples:
#'From unicode scalar value'.
self assert: [(Grapheme value: 16r1F600) value = 16r1F600].
#'From unicode scalar object'.
self assert: [(Grapheme value: 16r1F600 asUnicodeScalar) unicodeScalars first = 16r1F600 asUnicodeScalar].
#'From array of unicode scalar object/values'.
self assert: [(Grapheme value: { 16r65. 16r301 asUnicodeScalar }) utf32 contents = (Utf32 with: 16r65 with: 16r301)].
#'From a Character object'.
self assert: [(Grapheme value: $a) asciiValue = $a value].
#'From a single element string object'.
self assert: [(Grapheme value: '\x53') = $S asGrapheme].
self assert: [(Grapheme value: '\u{1F600}') name = 'GRINNING FACE'].
self assert: [(Grapheme value: '\t') = Grapheme tab].
self assert: [(Grapheme value: '\r\n') = Grapheme crlf].
Arguments:
anObject - <Integer> Unicode code point
<UnicodeScalar> unicode scalar
<Array> of <Integer | UnicodeScalar> array of unicode scalars
<UnicodeString> unicode string containing escape characters
(Compat) <Character> Smalltalk code-page character (requires conversion if outside ascii range)
(Compat) <String> Smalltalk code-page string containing escape characters (requires conversion if outside ascii range)
Answers:
<Grapheme>
Raises:
<Exception> EsPrimErrValueOutOfRange if can not convert anObject to a Grapheme
Instance Methods
<
Answer a Boolean indicating true if the receiver is less
than aGrapheme; answer false otherwise.
Arguments:
aGrapheme - <Grapheme> or <Character> for compatibility
Answers:
<Boolean>
<=
Answer a Boolean indicating true if the receiver is less or equal
than aGrapheme; answer false otherwise
Arguments:
aGrapheme - <Grapheme> or <Character> for compatibility
Answers:
<Boolean>
=
Answer a Boolean indicating true if the receiver is equal
to aGrapheme; answer false otherwise.
Examples:
self assert: [Grapheme cr = Grapheme cr].
self assert: [Grapheme cr = Character cr].
Arguments:
aGrapheme - <Grapheme> or <Character> for compatibility
Answers:
<Boolean>
>
Answer a Boolean indicating true if the receiver is greater
than aGrapheme; answer false otherwise
Arguments:
aGrapheme - <Grapheme> or <Character> for compatibility
Answers:
<Boolean>
>=
Answer a Boolean indicating true if the receiver is greater
than or equal to aGrapheme; answer false otherwise
Arguments:
aGrapheme - <Grapheme> or <Character> for compatibility
Answers:
<Boolean>
asciiValue
Answer the ASCII encoding value of this grapheme, if it is ascii.
'\r\n' will be normalized to \n
Answers:
<Integer>
asGrapheme
Answer self
Answers:
<Grapheme>
asInteger
Answer an Integer representing the numeric value of the
receiver.
asLowercase
Answers a lowercased version of this grapheme.
Case conversion can result in multiple scalars or graphemes,
therefore the result must be expressed as a UnicodeString.
For example, the character 'Ä°' (16r0130 asUnicodeScalar LATIN CAPITAL LETTER I WITH DOT ABOVE)
becomes two scalars (16r0069 LATIN SMALL LETTER I, 16r0307 COMBINING DOT ABOVE)
when converted to lowercase (but still a single grapheme).
Examples:
self assert: [$A asGrapheme asLowercase = 'a' asUnicodeString].
self assert: [16r0130 asGrapheme asLowercase = (UnicodeString value: { 16r0069 asUnicodeScalar. 16r0307 asUnicodeScalar. })].
Answers:
<UnicodeString>
asNFC
Answer an NFC normalized copy of this grapheme.
If the grapheme is already normalized, then answer the receiver.
Otherwise, answer a new grapheme.
Examples:
'LATIN SMALL LETTER E, COMBINING ACUTE ACCENT -> LATIN SMALL LETTER E WITH ACUTE'.
self assert: [(Grapheme value: #(16r65 16r301)) asNFC unicodeScalars first value = 16rE9]
Answers:
<Grapheme>
asNFD
Answer an NFD normalized copy of this grapheme.
If the grapheme is already normalized, then answer the receiver.
Otherwise, answer a new grapheme.
Examples:
'LATIN SMALL LETTER E WITH ACUTE -> LATIN SMALL LETTER E, COMBINING ACUTE ACCENT'.
self assert: [16rE9 asGrapheme asNFD unicodeScalars contents = { 16r65 asUnicodeScalar. 16r301 asUnicodeScalar}]
Answers:
<Grapheme>
asNFKC
Answer an NFKC normalized copy of this grapheme.
If the grapheme is already normalized, then answer the receiver.
Otherwise, answer a new grapheme.
Examples:
'SUPERSCRIPT TWO -> DIGIT TWO'.
self assert: [16rB2 asGrapheme asNFKC = 16r32 asGrapheme]
Answers:
<Grapheme>
asNFKD
Answer an NFKD normalized copy of this grapheme.
If the grapheme is already normalized, then answer the receiver.
Otherwise, answer a new grapheme.
Examples:
'SUPERSCRIPT TWO -> DIGIT TWO'.
self assert: [16rB2 asGrapheme asNFKD = 16r32 asGrapheme]
Answers:
<Grapheme>
asString
Answer the receiver as a <UnicodeString> instance.
NOTE: Regardless of the selector, this method returns a UnicodeString
instead of a String, to ease the interplay between Graphemes and UnicodeStrings.
If you want a single byte string you can use #asSBString or asUtf8.
Answers:
<UnicodeString>
asTitlecase
Answers an titlecased version of this grapheme.
Case conversion can result in multiple scalars or graphemes,
therefore the result must be expressed as a UnicodeString.
For example, the ligature 'ï¬' (16rFB01 LATIN SMALL LIGATURE FI)
becomes 'Fi' (16r0046 LATIN CAPITAL LETTER F, 16r0069 LATIN SMALL LETTER I)
when converted to titlecase.
Examples:
self assert: [$a asGrapheme asTitlecase = 'A' asUnicodeString].
self assert: [16rFB01 asGrapheme asTitlecase = (UnicodeString value: { 16r46 asUnicodeScalar. 16r69 asUnicodeScalar. })].
Answers:
<UnicodeString>
asUnicodeString
Answer the receiver as a <UnicodeString> instance.
Answers:
<UnicodeString>
asUppercase
Answers an uppercased version of this grapheme.
Case conversion can result in multiple scalars or graphemes,
therefore the result must be expressed as a UnicodeString.
For example, the German letter 'ß' becomes 'SS' when converted
to uppercase.
Examples:
self assert: [$a asGrapheme asUppercase = 'A' asUnicodeString].
self assert: [16rDF asGrapheme asUppercase = (UnicodeString value: { 16r53 asUnicodeScalar. 16r53 asUnicodeScalar. })].
Answers:
<UnicodeString>
asUtf16
Answer a <Utf16> that contains the utf-16 encoded bytes of the receiver.
Example:
self assert: [233 asGrapheme asUtf16 = (Utf16 with: 233)]
Answers:
<Utf16>
asUtf32
Answer a <Utf32> that contains the utf-32 encoded bytes of the receiver.
Example:
self assert: [233 asGrapheme asUtf32 = (Utf32 with: 233)]
Answers:
<Utf32>
asUtf8
Answer a <Utf8> that contains the utf-8 encoded bytes of the receiver.
Example:
self assert: [233 asGrapheme asUtf8 = (Utf8 with: 195 with: 169)]
Answers:
<Utf8>
codePoint
Compatibility: Extended Grapheme Clusters only have an expressible codePoint
if it is defined by 1 scalar.
This is for compatibility with Character>>codePoint.
Answers:
<Integer>
compareTo:
Orders the receiver relative to @aGrapheme.
Both the receiver and @aGrapheme will be gauranteed to have the same normalization
form before the comparison is made.
Fail if @aGrapheme is not a convertable <Grapheme> object.
Examples:
self assert: [(97 asGrapheme compareTo: 98 asGrapheme) = -1].
self assert: [(97 asGrapheme compareTo: 97 asGrapheme) = 0].
self assert: [(98 asGrapheme compareTo: 97 asGrapheme) = 1].
Arguments:
aGrapheme - <Grapheme>
Answers:
<Integer> -1 The receiver is less than @aGrapheme
0 The receiver is equal to @aGrapheme
1 The receiver is greater than @aGrapheme
digitValue
Answer an Integer corresponding to the numerical radix of
the receiver. Return 0-9 if the receiver is $0-$9, and
10-35 if it is $A-$Z; otherwise return -1.
NOTE: Since Graphemes might be composed of several scalars,
answer the digit value only if it is ASCII,
(so its composed by a single ASCII scalar).
Answers:
<Integer>
escaped
Answer a copy of the receiver that has been escaped using the following
rules.
Escaped Strings:
Tab is escaped as `\t`
Carriage return is escaped as `\r`.
Line feed is escaped as `\n`.
Backslash is escaped as '\\'
Any character in the 'printable ASCII' range `16r20` .. `16r7E` inclusive is not escaped.
All other characters are given hexadecimal Unicode escapes `\u{NNNNNN}` where
`NNNNNN` is a hexadecimal uppercase representation
Example:
self assert: [Grapheme tab escaped = '\t'].
self assert: [Grapheme crlf escaped = '\r\n'].
self assert: [$a asGrapheme escaped = 'a'].
self assert: [$\ asGrapheme escaped = '\\'].
self assert: [0 asGrapheme escaped = '\u{0}'].
self assert: [16r1F37A asGrapheme escaped = '\u{1F37A}'].
Answers:
<UnicodeString>
isAscii
Answers true if the receiver is within the ASCII range, false otherwise
Examples:
self assert: [$A asGrapheme isAscii].
self assert: [233 asGrapheme isAscii not].
Answers:
<Boolean>
isAsciiAlphabetic
Answers true if the receiver is an ASCII alphabetic character.
U+0041 'A' ..= U+005A 'Z', or
U+0061 'a' ..= U+007A 'z'.
Examples:
self assert: [$A asGrapheme isAsciiAlphabetic].
self assert: [233 asGrapheme isAsciiAlphabetic not].
Answers:
<Boolean>
isAsciiAlphaNumeric
Answers true if the receiver is an ASCII alphanumeric character:
U+0041 'A' ..= U+005A 'Z', or
U+0061 'a' ..= U+007A 'z', or
U+0030 '0' ..= U+0039 '9'.
Examples:
self assert: [$A asGrapheme isAsciiAlphaNumeric].
self assert: [$5 asGrapheme isAsciiAlphaNumeric].
self assert: [233 asGrapheme isAsciiAlphaNumeric not].
Answers:
<Boolean>
isAsciiControl
Answers true if the receiver is an ASCII control character:
U+0000 NUL ..= U+001F UNIT SEPARATOR, or
U+007F DELETE.
Note that most ASCII whitespace characters are control characters, but SPACE is not.
Examples:
self assert: [Grapheme cr isAsciiControl].
self assert: [Grapheme space isAsciiControl not].
Answers:
<Boolean>
isAsciiDigit
Answers true is an ASCII decimal digit:
U+0030 '0' ..= U+0039 '9'.
Examples:
self assert: [$5 asGrapheme isAsciiDigit].
self assert: [Grapheme space isAsciiDigit not].
Answers:
<Boolean>
isAsciiGraphic
Answers true is an ASCII graphic character:
U+0021 '!' ..= U+007E '~'.
Examples:
self assert: [$! asGrapheme isAsciiGraphic].
self assert: [16r9 asGrapheme isAsciiGraphic not].
Answers:
<Boolean>
isAsciiHexDigit
Answers true is an ASCII hexadecimal digit:
U+0030 '0' ..= U+0039 '9', or
U+0041 'A' ..= U+0046 'F', or
U+0061 'a' ..= U+0066 'f'.
Examples:
self assert: [$A asGrapheme isAsciiHexDigit].
self assert: [$G asGrapheme isAsciiHexDigit not].
Answers:
<Boolean>
isAsciiLowercase
Answers true is an ASCII lowercase character:
U+0061 'a' ..= U+007A 'z'.
Examples:
self assert: [$a asGrapheme isAsciiLowercase].
self assert: [$A asGrapheme isAsciiLowercase not].
Answers:
<Boolean>
isAsciiPunctuation
Answers true is an ASCII punctuation character:
U+0021 ..= U+002F ! <quote> # $ % & ' ( ) * + , - . /, or
U+003A ..= U+0040 : ; < = > ? @, or
U+005B ..= U+0060 [ \ ] ^ _ ` , or
U+007B ..= U+007E { | } ~
Examples:
self assert: [$! asGrapheme isAsciiPunctuation].
self assert: [$a asGrapheme isAsciiPunctuation not].
Answers:
<Boolean>
isAsciiUppercase
Answers true is an ASCII uppercase character:
U+0041 'A' ..= U+005A 'Z'.
Examples:
self assert: [$A asGrapheme isAsciiUppercase].
self assert: [$a asGrapheme isAsciiUppercase not].
Answers:
<Boolean>
isAsciiWhitespace
Answers true is an ASCII whitespace character:
U+0020 SPACE,
U+0009 HORIZONTAL TAB,
U+000A LINE FEED,
U+000C FORM FEED, or
U+000D CARRIAGE RETURN.
Note: This uses the WhatWG Infra Standard's definition of ASCII whitespace.
Examples:
self assert: [Grapheme space isAsciiWhitespace].
self assert: [$a asGrapheme isAsciiWhitespace not].
Answers:
<Boolean>
isCased
Answer true if the receiver changes under any form of case conversion,
false otherwise.
Examples:
self assert: [$a asGrapheme isCased].
self assert: [Grapheme space isCased not].
Answers:
<Boolean>
isDigit
Answer true if the receiver is a valid Smalltalk digit as described in
the ANSI Smalltalk Standard; otherwise answer false.
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Read #isSmalltalkDigit for more details.
Answers:
<Boolean>
isGrapheme
Answer true as a unicode grapheme object
Answers:
<Boolean>
isLetter
Answers true if the receiver represents a letter, false otherwise
Examples:
self assert: [$A asGrapheme isLetter].
self assert: [$5 asGrapheme isLetter not].
Answers:
<Boolean>
isLowercase
Answer true is the receiver is considered lowercase.
Lowercase characters change when converted to uppercase, but not
when converted to lowercase. The following characters are all lowercase
- 'eÌ' (16r0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
- 'и' (16r0438 CYRILLIC SMALL LETTER I)
- 'Ï€' (16r03C0 GREEK SMALL LETTER PI)
Examples:
self assert: [16rE2 asGrapheme name = 'LATIN SMALL LETTER A WITH CIRCUMFLEX'].
self assert: [16rE2 asGrapheme isLowercase].
self assert: [16rC5 asGrapheme name = 'LATIN CAPITAL LETTER A WITH RING ABOVE'].
self assert: [16rC5 asGrapheme isLowercase not].
Answers:
<Boolean>
isNewline
Answers true if the receiver represents a newline.
Examples:
self assert: [Grapheme cr isNewline].
'LINE SEPARATOR'.
self assert: [16r2028 asGrapheme isNewline].
self assert: [Grapheme crlf isNewline].
Answers:
<Boolean>
isNFC
Answer true if the receiver is NFC normalized, false otherwise.
Examples:
self assert: [16rE9 asGrapheme isNFC].
self assert: [16rE9 asGrapheme asNFD isNFC not].
Answers:
<Grapheme>
isNFD
Answer true if the receiver is NFD normalized, false otherwise.
Examples:
self assert: [16rE9 asGrapheme isNFD not].
self assert: [16rE9 asGrapheme asNFD isNFD].
Answers:
<Grapheme>
isNFKC
Answer true if the receiver is NFKC normalized, false otherwise.
Examples:
'SUPERSCRIPT TWO'.
self assert: [16rB2 asGrapheme isNFKC not].
'DIGIT TWO'.
self assert: [16r32 asGrapheme isNFKC].
Answers:
<Grapheme>
isNFKD
Answer true if the receiver is NFKD normalized, false otherwise.
Examples:
'SUPERSCRIPT TWO'.
self assert: [16rB2 asGrapheme isNFKD not].
'DIGIT TWO'.
self assert: [16r32 asGrapheme isNFKD].
Answers:
<Grapheme>
isNumeric
Answers true if the receiver has one of the general categories for numbers, false otherwise
Examples:
self assert: [$3 asGrapheme isNumeric].
self assert: [16r2070 asGrapheme name = 'SUPERSCRIPT ZERO'].
self assert: [16r2070 asGrapheme isNumeric].
self assert: [16r1F40 asGrapheme name = 'GREEK SMALL LETTER OMICRON WITH PSILI'].
self assert: [16r1F40 asGrapheme isNumeric not].
Answers:
<Boolean>
isSeparator
Compatibility: Captures a superset of Character>>isSeparator
isSmalltalkAlphaNumeric
Answer true if the receiver is a valid smalltalk lettor or digit, false otherwise
Answers:
<Boolean>
isSmalltalkDigit
Read #isDigit for more details
Answers:
<Boolean>
isSmalltalkLetter
Answer true if the receiver is a valid Smalltalk letter as described in the ANSI Smalltalk Standard; otherwise answer false.
letter ::= uppercaseAlphabetic | lowercaseAlphabetic | nonCaseLetter
uppercaseAlphabetic ::= 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S'| 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'
lowercaseAlphabetic ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'I' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
nonCaseLetter ::= '_'
It would be easier to simply send #isLetter, but we cannot do this because some country codes have characters that say they are letters
but are not valid Smalltalk syntactic letters. We also need to allow for the nonCaseLetter
isUppercase
Answer true is the receiver is considered uppercase.
Uppercase characters vary under case-conversion to lowercase,
but not when converted to uppercase. The following characters are
all uppercase.
- 'EÌ' (16r0045 LATIN CAPITAL LETTER E, 16r0301 COMBINING ACUTE ACCENT)
- 'И' (16r0418 CYRILLIC CAPITAL LETTER I)
- 'Î ' (16r03A0 GREEK CAPITAL LETTER PI)
Examples:
self assert: [16rC5 asGrapheme name = 'LATIN CAPITAL LETTER A WITH RING ABOVE'].
self assert: [16rC5 asGrapheme isUppercase].
self assert: [16rE2 asGrapheme name = 'LATIN SMALL LETTER A WITH CIRCUMFLEX'].
self assert: [16rE2 asGrapheme isUppercase not].
Answers:
<Boolean>
isWhitespace
Answers true if the receiver represents whitespace, including newlines,
false otherwise.
Examples:
self assert: [16r1680 asGrapheme name = 'OGHAM SPACE MARK'].
self assert: [16r1680 asGrapheme isWhitespace].
self assert: [16r1F40 asGrapheme name = 'GREEK SMALL LETTER OMICRON WITH PSILI'].
self assert: [16r1F40 asGrapheme isWhitespace not].
Answers:
<Boolean>
join:
Append the elements of the argument @aCollection, separating them by the receiver.
Examples:
self assert: [(Grapheme space join: #('VA' 'is' 'cool')) = 'VA is cool' asUnicodeString]
Arguments:
aCollection - <Collection>
Answers:
<String | Symbol>
name
Answer the name which is a comma-delimited concatenation of all the
unicode scalars names in the grapheme. Single-scalar graphemes will
have the same name as their unicode scalar equivalent.
Examples:
self assert: [16r388 asGrapheme name = 'GREEK CAPITAL LETTER EPSILON WITH TONOS'].
self assert: [233 asGrapheme asNFD name = 'LATIN SMALL LETTER E,COMBINING ACUTE ACCENT']
Answers:
<UnicodeString>
names
Answer the names of all the unicode scalars in the receiver an an Array
Examples:
self assert: [16r388 asGrapheme asNFD names = #('GREEK CAPITAL LETTER EPSILON' 'COMBINING ACUTE ACCENT')]
Answers:
<Array>
sameAs:
Answer whether the receiver is equal to aGrapheme, ignoring case.
Arguments:
aGrapheme - <Grapheme> or <Character> for compatibility
Answers:
<Boolean>
to:
Answer a collection of Graphemes with consecutive codepoints
starting from receiver's codepoint up to aGrapheme codepoint.
Arguments:
aGrapheme - <Grapheme>
Answers:
<Array>
unicodeScalars
Answer the unicode scalar view of the receiver.
A unicode scalar <UnicodeScalar> represents a 'unicode scalar value', which is similar to,
but not the same as, a 'unicode code point' as it will never represent high/low-surrogate
code points reserved for UTF-16 encoding.
Example:
| view |
view := $H asGrapheme unicodeScalars.
self assert: [view size = 1].
self assert: [view contents = (Array with: $H asUnicodeScalar)].
self assert: [view asByteArray = (ByteArray with: $H value)].
self assert: [view next = $H asUnicodeScalar].
self assert: [view atEnd]
Answers:
<UnicodeScalarView>
utf16
Answer the utf16 platform-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asGrapheme utf16.
self assert: [view size = 2].
self assert: [view next = 55356].
self assert: [view next = 57269].
self assert: [view atEnd]
Answers:
<Utf16View>
utf16BE
Answer the utf16 big-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asGrapheme utf16BE.
self assert: [view size = 2].
self assert: [view next = 15576].
self assert: [view next = 46559].
self assert: [view atEnd]
Answers:
<Utf16BigEndianView>
utf16LE
Answer the utf16 little-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asGrapheme utf16LE.
self assert: [view size = 2].
self assert: [view next = 55356].
self assert: [view next = 57269].
self assert: [view atEnd]
Answers:
<Utf16LittleEndianView>
utf32
Answer the utf32 view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asGrapheme utf32.
self assert: [view size = 1].
self assert: [view next = 16r1F3B5].
self assert: [view atEnd]
Answers:
<Utf32View>
utf32BE
Answer the utf32 big-endian view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asGrapheme utf32BE.
self assert: [view size = 1].
self assert: [view next = 3052601600].
self assert: [view asByteArray = #[0 1 243 181]].
self assert: [view atEnd]
Answers:
<Utf16BigEndianView>
utf32LE
Answer the utf32 little-endian view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asGrapheme utf32LE.
self assert: [view size = 1].
self assert: [view next = 127925].
self assert: [view asByteArray = #[181 243 1 0]].
self assert: [view atEnd]
Answers:
<Utf32LittleEndianView>
utf8
Answer the utf8 view of the receiver.
Each element in this view is a UTF-8 code unit. UTF-8 is an 8-bit
encoded form of unicode scalar values.
Example:
| view |
'LATIN SMALL LETTER E WITH ACUTE'.
view := 233 asGrapheme utf8.
self assert: [view size = 2].
self assert: [view next = 195].
self assert: [view next = 169].
self assert: [view atEnd]
Answers:
<Utf8View>
value
Compatibility: Extended Grapheme Clusters only have an expressible integer value
if it is defined by 1 scalar.
This is for compatibility with Character>>value.
Answers:
<Integer>