UnicodeScalar
Description
A <UnicodeScalar> represents all Unicode code points except for a special range reserved for UTF-16 encoding.
Code points are the unique values that the Unicode consortium assigns to the Unicode code space. (The code space is the complete range of possible Unicode values.) Depending on your definition of what a charactercould be, a code point can represent a character. There are many other classifications for code points as well.
Since Unicode scalars are a subset of code points, many languages consider a Unicode scalar to be the native character, and accessing a string by an index value would result in a Unicode scalar.
In VAST, our <UnicodeScalar> is able to describe many properties from the Unicode Character Database, such as a scalar's name, its case, is it alphabetic, is it numeric, etc. VAST <UnicodeView>s will make it easy to access Unicode scalars from <Graphemes> and <UnicodeString>s.
A <UnicodeScalar> provides several useful properties that the user can reflect on. See the Properties category on the instance side for the complete list.
A <UnicodeScalar> is not an immediate object. Do not use identity comparison (==), even though sometimes this might work because latin-1 (the first 256 code points of Unicode) are cached.
Class State
• latin1Scalars: <Array> of <UnicodeScalar> objects. The first 256 code points in unicode exactly match Latin-1 encoding (Also known as ISO-8859-1). This includes both ASCII and the Latin-1 supplement which are heavily used. Because of their frequency of use, the virtual machine will refer to objects in this scalar cache to reduce object
allocation and increase performance.
Creation
A <UnicodeScalar> is typically created automatically while cursoring through the #unicodeScalars view of a <UnicodeString>, <Grapheme>, <String> or <ByteArray> object.
| firstScalar |
firstScalar := 'abc' unicodeScalars next.
self assert: [firstScalar = $a]
However, you can manually create a <UnicodeScalar> by using the APIs from the Creation categories on the class side. Additionally, there are #asUnicodeScalar extension methods provided in the system. Below are some examples of both.
"From Integer, which is interpreted as the unicode code point value"
self assert: [97 asUnicodeScalar = (UnicodeScalar value: 97)].
"From Character, which performs any necessary code page conversion
if the value is > 7-bit ascii range"
self assert: [97 asCharacter asUnicodeScalar = (UnicodeScalar value: 97)].
"From UTF-8 bytes"
self assert: [(UnicodeScalar utf8: #[97]) = $a].
"From UTF-16 bytes (platform/little/big-endian)"
self assert: [(UnicodeScalar utf16LE: #[97 0]) = (UnicodeScalar utf16BE: #[0 97])].
self assert: [(UnicodeScalar utf16LE: #[97 0]) = $a].
"From UTF-32 bytes (platform/little/big-endian)"
self assert: [(UnicodeScalar utf32LE: #[97 0 0 0]) = (UnicodeScalar utf32BE: #[0 0 0 97])].
self assert: [(UnicodeScalar utf32LE: #[97 0 0 0]) = $a].
"The #value: API can accept an <Integer> or a <Character>"
self assert: [(UnicodeScalar value: 97) = $a].
self assert: [(UnicodeScalar value: $a) = $a].
"The #value: API can also accept a <String> or <UnicodeString> describing the scalar in escaped syntax"
"See UnicodeScalar class>>value: method comments for a complete list of escapes"
self assert: [(UnicodeScalar value: '\t') = UnicodeScalar tab].
self assert: [(UnicodeScalar value: '\u{1F600}') name = 'GRINNING FACE'].
"Factory methods for commonly used unicode scalars"
self assert: [UnicodeScalar cr = Character cr].
self assert: [UnicodeScalar lf = Character lf].
"Special replacement character, which is used anywhere unicode content must be repaired"
self assert: [(UnicodeString utf8: #[255] repair: true) unicodeScalars first = UnicodeScalar replacementCharacter]
Properties
A <UnicodeScalar> has many properties that are defined by the Unicode Standard. These can be found in the Properties category on the instance side.
• #canonicalCombiningClass - Canonical combining class corresponding to the 'Canonical Combining Class' property in the Unicode Standard.
• #generalCategory - General classification. This is its 'first-order, most usual categorization'
• #isAlphabetic - Boolean indicating if the unicode scalar is alphabetic
• #isAlphaNumeric - Boolean indicating if the unicode scalar is alphabetic and numeric
• #isAscii - Boolean indicating if the unicode scalar is in ASCII range
• #isAsciiAlphabetic - Boolean indicating if the unicode scalar is an ASCII alphabetic char
• #isAsciiAlphaNumeric - Boolean indicating if the unicode scalar is an ASCII alphabetic or numeric char
• #isAsciiControl - Boolean indicating if the unicode scalar is an ASCII control char
• #isAsciiDigit - Boolean indicating if the unicode scalar is an ASCII digit
• #isAsciiGraphic - Boolean indicating if the unicode scalar is an ASCII graphic character
• #isAsciiHexDigit - Boolean indicating if the unicode scalar is an ASCII hex digit
• #isAsciiLowercase - Boolean indicating if the unicode scalar is an ASCII lowercase char
• #isAsciiPunctuation - Boolean indicating if the unicode scalar is an ASCII punctuation char
• #isAsciiUppercase - Boolean indicating if the unicode scalar is an ASCII uppercase char
• #isAsciiWhitespace - Boolean indicating if the unicode scalar is an ASCII whitespace char
• #isCased - Boolean indicating whether the scalar is either lowercase, uppercase, or titlecase.
• #isControl - Boolean indicating whether the scalar has the general category for control codes
• #isLowercase - Boolean indicating if the unicode scalar has the unicode lowercase property
• #isNumeric - Boolean indicating whether the scalar has the general category for numbers
• #isTitlecase - Boolean indicating if the unicode scalar has the unicode titlecase property
• #isUppercase - Boolean indicating if the unicode scalar has the unicode uppercase property
• #isWhitespace - Boolean indicating if the unicode scalar has the unicode whitespace property
• #name - Human readable name of the scalar
Views
The following views are available for Unicode Scalars. These can be found on the Views category on the instance side.
• #utf8 - UTF-8 encoded view of the unicode scalar value
• #utf16 - UTF-16 platform-endian encoded view of the unicode scalar value
• #utf16LE - UTF-16 little-endian encoded view of the unicode scalar value
• #utf16BE - UTF-16 big-endian encoded view of the unicode scalar value
• #utf32 - UTF-32 encoded view of the unicode scalar value
• #utf32LE - UTF-32 little-endian encoded view of the unicode scalar value
• #utf32BE - UTF-32 big-endian encoded view of the unicode scalar value
Equality/Comparison
A <UnicodeScalar> should be compared to other objects using equality, not identity. The first 256 unicode scalars in Unicode (i.e., Latin-1) are cached in the latin1Scalars class instance variable. Because of this, any unicode scalar in the range [0, 255] will compare by identity, but this is not a property that should be used in code.
"Yes, this is true here..."
self assert: [97 asUnicodeScalar == 97 asUnicodeScalar].
"...but don't ever count on it being true everywhere"
self assert: [300 asUnicodeScalar ~~ 300 asUnicodeScalar].
"Equality/Compare"
self assert: [97 asUnicodeScalar = 97 asUnicodeScalar].
self assert: [97 asUnicodeScalar < 98 asUnicodeScalar].
self assert: [97 asUnicodeScalar <= 97 asUnicodeScalar].
self assert: [98 asUnicodeScalar >= 97 asUnicodeScalar].
self assert: [98 asUnicodeScalar > 97 asUnicodeScalar].
"Special compare method for [-1, 0, 1]"
self assert: [(97 asUnicodeScalar compareTo: 98 asUnicodeScalar) = -1].
self assert: [(97 asUnicodeScalar compareTo: 97 asUnicodeScalar) = 0].
self assert: [(98 asUnicodeScalar compareTo: 97 asUnicodeScalar) = 1].
Conversion
Unicode Component
A <UnicodeScalar> can convert itself to any other unicode component, such as a <Grapheme> or <UnicodeString>.
self assert: [UnicodeScalar cr asGrapheme = Grapheme cr].
self assert: [UnicodeScalar cr asUnicodeString = UnicodeString cr].
UTF Encoding
A <UnicodeScalar> can convert itself to any of the UTF encodings. Platform-endian accessors are provided in the Conversion category on the instance side, but any endian encoding can be accomplished with views.
"Platform-endian"
self assert: [UnicodeScalar cr asUtf8 asByteArray = #[13]].
self assert: [UnicodeScalar cr asUtf16 = (Utf16 with: 13)].
self assert: [UnicodeScalar cr asUtf32 = (Utf32 with: 13)].
"Little/Big endian via views"
self assert: [UnicodeScalar cr utf16LE asByteArray = #[13 0]].
self assert: [UnicodeScalar cr utf16BE asByteArray = #[0 13]].
self assert: [UnicodeScalar cr utf32LE asByteArray = #[13 0 0 0]].
self assert: [UnicodeScalar cr utf32BE asByteArray = #[0 0 0 13]].
Case Mapping
A <UnicodeScalar> can convert itself to its uppercase, lowercase and titlecase form. The result of these conversions is not a <UnicodeScalar>, but rather a <UnicodeString>. Case mapping can change the number of unicode scalars.
Lowercase
Consider calling #asLowercase on $A asUnicodeScalar, which is the unicode scalar for 'LATIN CAPITAL LETTER A'. With simple examples like this, it may not be obvious why the answer is a unicode string 'a' instead of $a asUnicodeScalar (LATIN SMALL LETTER A).
It becomes clear when we look at the unicode scalar 16r0130 asUnicodeScalar, which is the 'LATIN CAPITAL LETTER I WITH DOT ABOVE'. When this is lowercased, it becomes two unicode scalars 16r0069 (LATIN SMALL LETTER I) and 16r0307 (COMBINING DOT ABOVE). This can't be represented by a single <UnicodeScalar>, so we must represent it as a <UnicodeString>.
In the example below, we:
1. Convert 16r0130 unicode scalar to lowercase which answers a <UnicodeString>
2. Request the unicode scalars view on the lowercase unicode string which answers a <UnicodeScalarView>
3. Request the contents from the view which answers the <Array> of lowercase <UnicodeScalar>s
self assert: [
16r0130 asUnicodeScalar asLowercase unicodeScalars contents = { 16r0069 asUnicodeScalar. 16r0307 asUnicodeScalar }].
Uppercase
Similar examples occur for #asUppercase such as the German 'sharp S' 16rDF asUnicodeScalar. When this is uppercased, it becomes two unicode scalars 16r53 (LATIN CAPITAL LETTER S) and 16r53 (LATIN CAPITAL LETTER S).
self assert: [
16rDF asUnicodeScalar asUppercase unicodeScalars contents = { 16r53 asUnicodeScalar. 16r53 asUnicodeScalar }]
Titlecase
There are several unicode characters that require special handling when they are used as the initial character in the word. One example is 16rFB01 asUnicodeScalar (LATIN SMALL LIGATURE FI). When this is titlecased, it becomes two unicode scalars 16r46 (LATIN CAPITAL LETTER F) and 16r69 (LATIN SMALL LETTER I)
self assert: [
16rFB01 asUnicodeScalar asTitlecase unicodeScalars contents = { 16r46 asUnicodeScalar. 16r69 asUnicodeScalar }]
Character Compatibility
A <UnicodeScalar> can be directly compared with a <Character> object. This is possible because the Smalltalk
primitives that implement unicode have general awareness of <Character> and try to work with them where possible.
This compatibility is carried out in three different ways.
• Primitives: The virtual machine primitives can quickly detect and convert a <Character> if it is in the 7-bit ASCII range. If it is outside this range, this means that the character represents a value from a particular code page, for which there are many code pages, and they can differ wildly in the upper 128 bytes. Because code page conversion is required in this case, a primitive failure is triggered.
• Primitive Fail Handlers: The primitive failure handlers in Smalltalk detect if the argument was a <Character> and code page converts the character to a <UnicodeScalar> and tries the primitive call again (this time with a unicode scalar argument).
• Smalltalk Methods: Compatibility methods are provided in the Compatibility category of methods.
Important Note
• Compatibility relationship is uni-directional. A <Character> does not have direct knowledge of <UnicodeScalar>.
• A <UnicodeScalar> is NOT an immediate object like <Character>, it is not good practice to use identity ==
Here are the various ways that a <UnicodeScalar> and a <Character> can work together.
"="
self assert: [97 asUnicodeScalar = 97 asCharacter].
"<"
self assert: [97 asUnicodeScalar < 98 asCharacter].
"<="
self assert: [97 asUnicodeScalar <= 97 asCharacter].
">"
self assert: [98 asUnicodeScalar > 97 asCharacter].
">="
self assert: [98 asUnicodeScalar >= 97 asCharacter].
"hash"
self assert: [97 asUnicodeScalar hash = 97 asCharacter hash].
"hash but not equal if outside 7-bit ASCII and code page differs from Latin-1
This example assumes a code page like Windows-1252 (which only varies from
Latin-1 in the range 128-159)"
| s c |
s := 159 asUnicodeScalar. "159 unicode code point"
c := 159 asCharacter. "159 double-byte char value"
self assert: [s hash = c hash].
self assert: [s ~= c].
Class Methods
allDo:
Loop through all valid unicode scalars performing @aBlock for each.
The set of Unicode scalars are all code points in the code space except
for the high-surrogate and low-surrogate ranges
Example:
| lcWhoseUcIsMultipleScalars |
'All lowercase scalers that turn into multiple scalars when uppercased'.
lcWhoseUcIsMultipleScalars := OrderedCollection new.
UnicodeScalar allDo: [:s | (s isLowercase and: [s asUppercase size > 1]) ifTrue: [lcWhoseUcIsMultipleScalars add: s]].
self assert: [lcWhoseUcIsMultipleScalars anySatisfy: [:s | s name = 'LATIN SMALL LETTER SHARP S']].
Arguments:
aBlock - <Block> 1-arg block
backspace
Answer the unicode scalar for a backspace.
Answers:
<UnicodeScalar>
codePoint:
Answer a new unicode scalar with @anInteger value
Examples:
#'From unicode scalar value'.
self assert: [(UnicodeScalar codePoint: 16r1F600) value = 16r1F600].
Arguments:
anObject - <Integer> Unicode scalar
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if can not convert anObject to a UnicodeScalar
cr
Answer the unicode scalar containing a carriage return and a linefeed.
Answers:
<UnicodeScalar>
escape
Answer the unicode scalar for an escape.
Answers:
<UnicodeScalar>
escaped:
Create a new unicode scalar from @aStringObject.
Escaped Strings:
If @aStringObject is a String or UnicodeString, then the following escapes
will be parsed:
Escapes:
\x53 7-bit character code (exactly 2 digits, up to 0x7F)
\u{1F600} 24-bit Unicode character code (up to 6 digits)
\n Newline (This is the Lf character)
\r Carriage return (This is the Cr character)
\t Tab
\\ Backslash
\0 Nul
Examples:
#'From a single element string object'.
self assert: [(UnicodeScalar value: '\x53') = $S asUnicodeScalar].
self assert: [(UnicodeScalar value: '\u{1F600}') name = 'GRINNING FACE'].
self assert: [(UnicodeScalar value: '\t') = UnicodeScalar tab].
self assert: [(UnicodeScalar value: '\n') = UnicodeScalar lf].
Arguments:
aStringObject - <UnicodeString> unicode string containing escape characters
(Compat) <String> Smalltalk code-page string containing escape characters (requires conversion if outside ascii range)
Answers:
<UnicodeString>
lf
Answer the unicode scalar containing a line feed.
Answers:
<UnicodeScalar>
newPage
Answer the unicode scalar for a new page.
Answers:
<UnicodeScalar>
replacementCharacter
Answer the unicode scalar for the unicode
replacement character.
Answers:
<UnicodeScalar>
space
Answer the unicode scalar for space
Answers:
<UnicodeScalar>
tab
Answer the unicode scalar for tab
Answers:
<UnicodeScalar>
utf16:
Answer the unicode scalar constructed from @aByteObject which should be UTF-16 platfrom-endian encoded data.
@aByteObject is validated before any attempt is made to create a unicode string from its data.
Examples:
self assert: [(UnicodeScalar utf16: 'a' utf16 contents) = $a].
Arguments:
aByteObject - <String | ByteArray> or byte shaped object
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16
utf16BE:
Answer the unicode scalar constructed from @aByteObject which should be UTF-16 big-endian encoded data.
Examples:
self assert: [(UnicodeScalar utf16BE: #[0 97]) = $a].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16BE
utf16LE:
Answer the unicode scalar constructed from @aByteObject which should be UTF-16 little-endian encoded data.
Examples:
self assert: [(UnicodeScalar utf16LE: #[97 0]) = $a].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16LE
utf32:
Answer the unicode scalar constructed from @aByteObject which should be UTF-32 platfrom-endian encoded data.
@aByteObject is validated before any attempt is made to create a unicode string from its data.
Examples:
self assert: [(UnicodeScalar utf32: 'a' utf32 contents) = $a].
Arguments:
aByteObject - <String | ByteArray | Utf32> or byte shaped object
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32
utf32BE:
Answer the unicode scalar constructed from @aByteObject which should be UTF-32 big-endian encoded data.
Examples:
self assert: [(UnicodeScalar utf32BE: #[0 0 0 97]) = $a].
Arguments:
aByteObject - <String | ByteArray | Utf32>
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32BE
utf32LE:
Answer the unicode scalar constructed from @aByteObject which should be UTF-32 little-endian encoded data.
Examples:
self assert: [(UnicodeScalar utf32LE: #[97 0 0 0]) = $a].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32LE
utf8:
Answer the unicode scalar constructed from @aByteObject which should be UTF-8 encoded data.
@ByteObject is validated before any attempt is made to create a unicode scalar from it.
Examples:
self assert: [(UnicodeScalar utf8: #[97]) = $a]
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-8
value:
Answer a new unicode scalar with @anInteger value
Escaped Strings:
If @anObject is a String or UnicodeString, then the following escapes
will be parsed to create an extended grapheme cluster:
Escapes:
\x53 7-bit character code (exactly 2 digits, up to 0x7F)
\u{1F600} 24-bit Unicode character code (up to 6 digits)
\n Newline (This is the Lf character)
\r Carriage return (This is the Cr character)
\t Tab
\\ Backslash
\0 Nul
Examples:
#'From unicode scalar value'.
self assert: [(UnicodeScalar value: 16r1F600) value = 16r1F600].
#'From a Character object'.
self assert: [(UnicodeScalar value: $a) value = $a value].
#'From a single element string object'.
self assert: [(UnicodeScalar value: '\x53') = $S asUnicodeScalar].
self assert: [(UnicodeScalar value: '\u{1F600}') name = 'GRINNING FACE'].
self assert: [(UnicodeScalar value: '\t') = UnicodeScalar tab].
self assert: [(UnicodeScalar value: '\n') = UnicodeScalar lf].
Arguments:
anObject - <Integer> Unicode code point
<UnicodeString> unicode string containing escape characters
(Compat) <Character> Smalltalk code-page character (requires conversion if outside ascii range)
(Compat) <String> Smalltalk code-page string containing escape characters (requires conversion if outside ascii range)
Answers:
<UnicodeScalar>
Raises:
<Exception> EsPrimErrValueOutOfRange if can not convert anObject to a UnicodeScalar
Instance Methods
<
Answer a Boolean indicating true if the receiver is less
than aUnicodeScalar; answer false otherwise
Arguments:
aUnicodeScalar - <UnicodeScalar> or <Character> for compatibility
Answers:
<Boolean>
<=
Answer a Boolean indicating true if the receiver is less or equal
than aUnicodeScalar; answer false otherwise
Arguments:
aUnicodeScalar - <UnicodeScalar> or <Character> for compatibility
Answers:
<Boolean>
=
Answer a Boolean indicating true if the receiver is equal
to aUnicodeScalar; answer false otherwise.
Examples:
self assert: [UnicodeScalar cr = UnicodeScalar cr].
self assert: [UnicodeScalar cr = Character cr].
Arguments:
aUnicodeScalar - <UnicodeScalar> or <Character> for compatibility
>
Answer a Boolean indicating true if the receiver is greater
than aUnicodeScalar; answer false otherwise
Arguments:
aUnicodeScalar - <UnicodeScalar> or <Character> for compatibility
Answers:
<Boolean>
>=
Answer a Boolean indicating true if the receiver is greater or equal
than aUnicodeScalar; answer false otherwise
Arguments:
aUnicodeScalar - <UnicodeScalar> or <Character> for compatibility
Answers:
<Boolean>
asGrapheme
Convert unicode scalar to a <Grapheme>
Examples:
self assert: [UnicodeScalar cr asGrapheme = Grapheme cr]
Answers:
<Grapheme>
asLowercase
Answers an lowercased version of this unicode scalar.
Case conversion can result in multiple scalars or graphemes,
therefore the result must be expressed as a UnicodeString.
For example, the character 'Ä°' (16r0130 asUnicodeScalar LATIN CAPITAL LETTER I WITH DOT ABOVE)
becomes two scalars (16r0069 LATIN SMALL LETTER I, 16r0307 COMBINING DOT ABOVE)
when converted to lowercase (but still a single grapheme).
Examples:
self assert: [$A asUnicodeScalar asLowercase = 'a' asUnicodeString].
self assert: [16r0130 asUnicodeScalar name = 'LATIN CAPITAL LETTER I WITH DOT ABOVE'].
self assert: [16r0130 asUnicodeScalar asLowercase = (UnicodeString value: { 16r0069 asUnicodeScalar. 16r0307 asUnicodeScalar. })].
Answers:
<UnicodeString>
asTitlecase
Answers an titlecased version of this unicode scalar.
Case conversion can result in multiple scalars or graphemes,
therefore the result must be expressed as a UnicodeString.
For example, the ligature 'ï¬' (16rFB01 LATIN SMALL LIGATURE FI)
becomes 'Fi' (16r0046 LATIN CAPITAL LETTER F, 16r0069 LATIN SMALL LETTER I)
when converted to titlecase.
Examples:
self assert: [$a asUnicodeScalar asTitlecase = 'A' asUnicodeString].
self assert: [16rFB01 asUnicodeScalar name = 'LATIN SMALL LIGATURE FI'].
self assert: [16rFB01 asUnicodeScalar asTitlecase = (UnicodeString value: { 16r46 asUnicodeScalar. 16r69 asUnicodeScalar. })].
Answers:
<UnicodeString>
asUnicodeScalar
Answer self.
Answers:
<UnicodeScalar>
asUnicodeString
Answer the receiver as a <UnicodeString> instance.
Examples:
self assert: [UnicodeScalar cr asUnicodeString = UnicodeString cr]
Answers:
<UnicodeString>
asUppercase
Answers an uppercased version of this unicode scalar.
Case conversion can result in multiple scalars or graphemes,
therefore the result must be expressed as a UnicodeString.
For example, the German letter 'ß' becomes 'SS' when converted
to uppercase.
Examples:
self assert: [$a asUnicodeScalar asUppercase = 'A' asUnicodeString].
self assert: [16rDF asUnicodeScalar name = 'LATIN SMALL LETTER SHARP S'].
self assert: [16rDF asUnicodeScalar asUppercase = (UnicodeString value: { 16r53 asUnicodeScalar. 16r53 asUnicodeScalar. })].
Answers:
<UnicodeString>
asUtf16
Answer a <Utf16> that contains the utf-16 encoded bytes of the receiver.
Example:
self assert: [233 asUnicodeScalar asUtf16 = (Utf16 with: 233)]
Answers:
<Utf16>
asUtf32
Answer a <Utf32> that contains the utf-32 encoded bytes of the receiver.
Example:
self assert: [233 asUnicodeScalar asUtf32= (Utf32 with: 233)]
Answers:
<Utf32>
asUtf8
Answer a <Utf8> that contains the utf-8 encoded bytes of the receiver.
Example:
self assert: [233 asUnicodeScalar asUtf8 = (Utf8 with: 195 with: 169)]
Answers:
<Utf8>
canonicalCombiningClass
Answers the canonical combining class identifier for the receiver.
This property corresponds to the 'Canonical Combining Class' property in
the Unicode Standard.
@see UnicodeCanonicalCombiningClasses pragma for some
predefined classes.
Examples:
self assert: [$A asUnicodeScalar canonicalCombiningClass = UnicodeCanonicalCombiningClasses::NotReordered]
Answers:
<Integer>
codePoint
Compatibility: Answer the unicode scalar value as an Integer
Answers:
<Integer>
compareTo:
Orders the receiver relative to @aUnicodeScalar.
Both the receiver and @aUnicodeScalar will be gauranteed to have the same normalization
form before the comparison is made.
Fail if @aUnicodeScalar is not a convertable <UnicodeScalar> object.
Examples:
self assert: [(97 asUnicodeScalar compareTo: 98 asUnicodeScalar) = -1].
self assert: [(97 asUnicodeScalar compareTo: 97 asUnicodeScalar) = 0].
self assert: [(98 asUnicodeScalar compareTo: 97 asUnicodeScalar) = 1].
Arguments:
aUnicodeScalar - <UnicodeScalar> or <Character> for compatibility
Answers:
<Integer> -1 The receiver is less than @aUnicodeScalar
0 The receiver is equal to @aUnicodeScalar
1 The receiver is greater than @aUnicodeScalar
escaped
Answer a copy of the receiver that has been escaped using the following
rules.
Escaped Strings:
Tab is escaped as `\t`
Carriage return is escaped as `\r`.
Line feed is escaped as `\n`.
Backslash is escaped as '\\'
Any character in the 'printable ASCII' range `16r20` .. `16r7E` inclusive is not escaped.
All other characters are given hexadecimal Unicode escapes `\u{NNNNNN}` where
`NNNNNN` is a hexadecimal uppercase representation
Example:
self assert: [UnicodeScalar tab escaped = '\t'].
self assert: [UnicodeScalar lf escaped = '\n'].
self assert: [$a asUnicodeScalar escaped = 'a'].
self assert: [$\ asUnicodeScalar escaped = '\\'].
self assert: [0 asUnicodeScalar escaped = '\u{0}'].
self assert: [16r1F37A asUnicodeScalar escaped = '\u{1F37A}'].
Answers:
<UnicodeString>
generalCategory
Answer the general classification of the receiver.
This is its 'first-order, most usual categorization'
For more information about the General_Category property,
see Chapter 4, Character Properties in the Unicode Standard.
@see UnicodeGeneralCategories pragma for categories.
Examples:
self assert: [$A asUnicodeScalar generalCategory = UnicodeGeneralCategories::UppercaseLetter]
Answers:
<Integer>
isAlphabetic
Answers true if the receiver has the Alphabetic property as described by
Chapter 4 (Character Properties) of the Unicode Standard.
Examples:
self assert: [$A asUnicodeScalar isAlphabetic].
self assert: [$5 asUnicodeScalar isAlphabetic not].
Answers:
<Boolean>
isAlphaNumeric
Answers true if the receiver isAlphabetic or isNumeric.
Examples:
self assert: [$A asUnicodeScalar isAlphaNumeric].
self assert: [$5 asUnicodeScalar isAlphaNumeric].
self assert: [$! asUnicodeScalar isAlphaNumeric not].
Answers:
<Boolean>
isAscii
Answers true if the receiver is within the ASCII range.
Examples:
self assert: [$A asUnicodeScalar isAscii].
self assert: [233 asUnicodeScalar isAscii not].
Answers:
<Boolean>
isAsciiAlphabetic
Answers true if the receiver is an ASCII alphabetic character.
U+0041 'A' ..= U+005A 'Z', or
U+0061 'a' ..= U+007A 'z'.
Examples:
self assert: [$A asUnicodeScalar isAsciiAlphabetic].
self assert: [233 asUnicodeScalar isAsciiAlphabetic not].
Answers:
<Boolean>
isAsciiAlphaNumeric
Answers true if the receiver is an ASCII alphanumeric character:
U+0041 'A' ..= U+005A 'Z', or
U+0061 'a' ..= U+007A 'z', or
U+0030 '0' ..= U+0039 '9'.
Examples:
self assert: [$A asUnicodeScalar isAsciiAlphaNumeric].
self assert: [$5 asUnicodeScalar isAsciiAlphaNumeric].
self assert: [233 asUnicodeScalar isAsciiAlphaNumeric not].
Answers:
<Boolean>
isAsciiControl
Answers true if the receiver is an ASCII control character:
U+0000 NUL ..= U+001F UNIT SEPARATOR, or
U+007F DELETE.
Note that most ASCII whitespace characters are control characters, but SPACE is not.
Examples:
self assert: [UnicodeScalar cr isAsciiControl].
self assert: [UnicodeScalar space isAsciiControl not].
Answers:
<Boolean>
isAsciiDigit
Answers true is an ASCII decimal digit:
U+0030 '0' ..= U+0039 '9'.
Examples:
self assert: [$5 asUnicodeScalar isAsciiDigit].
self assert: [UnicodeScalar space isAsciiDigit not].
Answers:
<Boolean>
isAsciiGraphic
Answers true is an ASCII graphic character:
U+0021 '!' ..= U+007E '~'.
Examples:
self assert: [$! asUnicodeScalar isAsciiGraphic].
self assert: [16r9 asUnicodeScalar isAsciiGraphic not].
Answers:
<Boolean>
isAsciiHexDigit
Answers true is an ASCII hexadecimal digit:
U+0030 '0' ..= U+0039 '9', or
U+0041 'A' ..= U+0046 'F', or
U+0061 'a' ..= U+0066 'f'.
Examples:
self assert: [$A asUnicodeScalar isAsciiHexDigit].
self assert: [$G asUnicodeScalar isAsciiHexDigit not].
Answers:
<Boolean>
isAsciiLowercase
Answers true is an ASCII lowercase character:
U+0061 'a' ..= U+007A 'z'.
Examples:
self assert: [$a asUnicodeScalar isAsciiLowercase].
self assert: [$A asUnicodeScalar isAsciiLowercase not].
Answers:
<Boolean>
isAsciiPunctuation
Answers true is an ASCII punctuation character:
U+0021 ..= U+002F ! <quote> # $ % & ' ( ) * + , - . /, or
U+003A ..= U+0040 : ; < = > ? @, or
U+005B ..= U+0060 [ \ ] ^ _ ` , or
U+007B ..= U+007E { | } ~
Examples:
self assert: [$! asUnicodeScalar isAsciiPunctuation].
self assert: [$a asUnicodeScalar isAsciiPunctuation not].
Answers:
<Boolean>
isAsciiUppercase
Answers true is an ASCII uppercase character:
U+0041 'A' ..= U+005A 'Z'.
Examples:
self assert: [$A asUnicodeScalar isAsciiUppercase].
self assert: [$a asUnicodeScalar isAsciiUppercase not].
Answers:
<Boolean>
isAsciiWhitespace
Answers true is an ASCII whitespace character:
U+0020 SPACE,
U+0009 HORIZONTAL TAB,
U+000A LINE FEED,
U+000C FORM FEED, or
U+000D CARRIAGE RETURN.
Note: This uses the WhatWG Infra Standard's definition of ASCII whitespace.
Examples:
self assert: [UnicodeScalar space isAsciiWhitespace].
self assert: [$a asUnicodeScalar isAsciiWhitespace not].
Answers:
<Boolean>
isCased
Answers true if the receiver is considered to be either lowercase, uppercase or titlecase
as described by Chapter 4 (Character Properties) of the Unicode Standard.
Examples:
self assert: [$a asUnicodeScalar isCased].
self assert: [UnicodeScalar space isCased not].
Answers:
<Boolean>
isControl
Answers true if the receiver has the general category for control codes as described by
Chapter 4 (Character Properties) of the Unicode Standard.
Examples:
self assert: [UnicodeScalar lf isControl].
self assert: [UnicodeScalar space isControl not].
Answers:
<Boolean>
isDigit
Answer true if the receiver is a valid Smalltalk digit as described in
the ANSI Smalltalk Standard; otherwise answer false.
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Read #isSmalltalkDigit for more details.
Answers:
<Boolean>
isLowercase
Answers true if the receiver has the Lowercase property as described by
Chapter 4 (Character Properties) of the Unicode Standard.
Examples:
self assert: [16rE2 asUnicodeScalar name = 'LATIN SMALL LETTER A WITH CIRCUMFLEX'].
self assert: [16rE2 asUnicodeScalar isLowercase].
self assert: [16rC5 asUnicodeScalar name = 'LATIN CAPITAL LETTER A WITH RING ABOVE'].
self assert: [16rC5 asUnicodeScalar isLowercase not].
Answers:
<Boolean>
isNumeric
Answers true if the receiver has one of the general categories for numbers.
Examples:
self assert: [$3 asUnicodeScalar isNumeric].
self assert: [16r2070 asUnicodeScalar name = 'SUPERSCRIPT ZERO'].
self assert: [16r2070 asUnicodeScalar isNumeric].
self assert: [16r1F40 asUnicodeScalar name = 'GREEK SMALL LETTER OMICRON WITH PSILI'].
self assert: [16r1F40 asUnicodeScalar isNumeric not].
Answers:
<Boolean>
isSeparator
Compatibility: Captures a superset of Character>>isSeparator
isSmalltalkAlphaNumeric
Answer true if the receiver is a valid smalltalk lettor or digit, false otherwise
Answers:
<Boolean>
isSmalltalkDigit
Read #isDigit for more details
Answers:
<Boolean>
isSmalltalkLetter
Answer true if the receiver is a valid Smalltalk letter as described in the ANSI Smalltalk Standard; otherwise answer false.
letter ::= uppercaseAlphabetic | lowercaseAlphabetic | nonCaseLetter
uppercaseAlphabetic ::= 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S'| 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'
lowercaseAlphabetic ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'I' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
nonCaseLetter ::= '_'
It would be easier to simply send #isLetter, but we cannot do this because some country codes have characters that say they are letters
but are not valid Smalltalk syntactic letters. We also need to allow for the nonCaseLetter
isTitlecase
Answers true if the receiver has the Titlecase property as described by
Chapter 4 (Character Properties) of the Unicode Standard.
Examples:
self assert: [16r01C8 asUnicodeScalar name = 'LATIN CAPITAL LETTER L WITH SMALL LETTER J'].
self assert: [16r01C8 asUnicodeScalar isTitlecase].
self assert: [$a asUnicodeScalar isTitlecase not].
Answers:
<Boolean>
isUnicodeScalar
Answer true if the receiver is a unicode scalar, false otherwise.
Answers:
<Boolean>
isUppercase
Answers true if the receiver has the Uppercase property as described by
Chapter 4 (Character Properties) of the Unicode Standard.
Examples:
self assert: [16rC5 asUnicodeScalar name = 'LATIN CAPITAL LETTER A WITH RING ABOVE'].
self assert: [16rC5 asUnicodeScalar isUppercase].
self assert: [16rE2 asUnicodeScalar name = 'LATIN SMALL LETTER A WITH CIRCUMFLEX'].
self assert: [16rE2 asUnicodeScalar isUppercase not].
Answers:
<Boolean>
isWhitespace
Answers true if the receiver has the White_Space property
from the Unicode Character Database.
Examples:
self assert: [16r1680 asUnicodeScalar name = 'OGHAM SPACE MARK'].
self assert: [16r1680 asUnicodeScalar isWhitespace].
self assert: [16r1F40 asUnicodeScalar name = 'GREEK SMALL LETTER OMICRON WITH PSILI'].
self assert: [16r1F40 asUnicodeScalar isWhitespace not].
Answers:
<Boolean>
name
Answers a <UnicodeString> describing the human readable name of the scalar.
Examples:
self assert: [16r388 asUnicodeScalar name = 'GREEK CAPITAL LETTER EPSILON WITH TONOS']
Answers:
<UnicodeString>
sameAs:
Answer whether the receiver is equal to aUnicodeScalar, ignoring case.
Arguments:
aUnicodeScalar - <UnicodeScalar> or <Character> for compatibility
Answers:
<Boolean>
utf16
Answer the utf16 platform-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeScalar utf16.
self assert: [view size = 2].
self assert: [view next = 55356].
self assert: [view next = 57269].
self assert: [view atEnd]
Answers:
<Utf16View>
utf16BE
Answer the utf16 big-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeScalar utf16BE.
self assert: [view size = 2].
self assert: [view next = 15576].
self assert: [view next = 46559].
self assert: [view atEnd]
Answers:
<Utf16BigEndianView>
utf16LE
Answer the utf16 little-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeScalar utf16LE.
self assert: [view size = 2].
self assert: [view next = 55356].
self assert: [view next = 57269].
self assert: [view atEnd]
Answers:
<Utf16LittleEndianView>
utf32
Answer the utf32 platform-endian view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeScalar utf32.
self assert: [view size = 1].
self assert: [view next = 16r1F3B5].
self assert: [view atEnd]
Answers:
<Utf32View>
utf32BE
Answer the utf32 big-endian view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeScalar utf32BE.
self assert: [view size = 1].
self assert: [view next = 3052601600].
self assert: [view asByteArray = #[0 1 243 181]].
self assert: [view atEnd]
Answers:
<Utf32BigEndianView>
utf32LE
Answer the utf32 little-endian view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeScalar utf32LE.
self assert: [view size = 1].
self assert: [view next = 127925].
self assert: [view asByteArray = #[181 243 1 0]].
self assert: [view atEnd]
Answers:
<Utf32LittleEndianView>
utf8
Answer the utf8 view of the receiver.
Each element in this view is a UTF-8 code unit. UTF-8 is an 8-bit
encoded form of unicode scalar values.
Example:
| view |
'LATIN SMALL LETTER E WITH ACUTE'.
view := 233 asUnicodeScalar utf8.
self assert: [view size = 2].
self assert: [view next = 195].
self assert: [view next = 169].
self assert: [view atEnd]
Answers:
<Utf8View>
value
Answer the unicode scalar value as an Integer
Answers:
<Integer>