UnicodeString
Description
A <UnicodeString> in VAST is an <AdditiveSequenceableCollection> of <Grapheme>s (aka. user-perceived characters).
<UnicodeString> is designed with "Unicode-Correctness" in mind. The Unicode Standard does not specify how a Unicode String model should be implemented. Instead, the Unicode Standard should be thought of as an open toolkit with lots of definitions and rules relating to human writing systems which contain lots of contradictions and edge cases. The <UnicodeString> is designed to help you implement international support in your applications safely and correctly.
A <UnicodeString> is also designed to be as high-performance as possible, without sacrificing correctness. The VAST virtual machine contains several Unicode related primitives that do the heavy algorithmic lifting to give maximum performance. The following are some performance properties that make <UnicodeString> fast and memory efficient.
• Copy-On-Write: The storage of a <UnicodeString> can be shared (or partially shared) by many Unicode strings in the system.
• Performance Flags: Special bit flags that lead Unicode strings to faster primitive code paths such as isAscii, isNFC, isSingleScalar.
• SIMD: SIMD Instructions used for ascii and utf8 validation, as well as string searching.
• Optimized Storage: At the ABI level, a <UnicodeString> is backed by compact valid UTF-8 bytes.
• C Function FFI: Zero-cost since the storage is backed by valid UTF-8 null-terminated bytes. We simply pass a pointer.
A <UnicodeString> is not locale sensitive. This means that comparisons are stable and performant. While locale is an important consideration, it is outside the scope of what the base Unicode Support provides.
Instance State
• storage: ABI object managed by the virtual machine primitives. This has the following forms: <EsUnicodeStorage> for storage owners or copies that share a view of the whole storage with the owner <EsUnicodeSlice> for subview of the whole storage.
• length: <Integer> content length
Creation
A <UnicodeString> has a rich collection of creational methods making it easy to create a Unicode string from many types of input sources.
The following is how you create an empty Unicode string with default capacity.
| str |
str := UnicodeString new.
self assert: [str size = 0].
self assert: [str reservedStorage = 0].
If you know ahead of time how big the string might get, then you can go ahead and preallocate the storage. Notice that #reservedStorage is independent of the #size of the string.
| str |
str := UnicodeString new: 100.
self assert: [str size = 0].
self assert: [str reservedStorage = 100].
The following shows how to create a <UnicodeString> with a grinning face emoji characters repeated 100 times using #new:withAll:
| str |
str := UnicodeString new: 100 withAll: (Grapheme escaped: '\u{1F600}').
self assert: [str size = 100].
self assert: [str allSatisfy: [:g | g escaped = '\u{1F600}']].
A <UnicodeString> can be created from a string object containing escape syntax using the #escaped: method.
"\x followed by 2 digits for a 7-bit character code"
self assert: [(UnicodeString escaped: '\x53malltalk') = 'Smalltalk'].
"\u{} with up to 6 digits between the braces for a 24-bit Unicode character code"
self assert: [(UnicodeString escaped: '\u{1F914}') first name = 'THINKING FACE'].
self assert: [(UnicodeString escaped: 'e\u{301}') first asNFC name = 'LATIN SMALL LETTER E WITH ACUTE'].
"Support for whitespace: \t \r \n. \r\n will combine into the single Grapheme crlf character"
self assert: [ | str |
str := UnicodeString escaped: '\t\r\n\n'.
str size = 3 and: [str asArray = { Grapheme tab. Grapheme crlf. Grapheme lf }]].
"\0 for null character"
self assert: [(UnicodeString escaped: 'Smalltalk ~= \0') last = 0 asGrapheme].
"\\ if you need an actual backslash - which is formally named reverse solidus"
self assert: [(UnicodeString escaped: '\\') first name = 'REVERSE SOLIDUS']
#value: is a very flexible creation method that can create a <UnicodeString> from many types of inputs.
"From an <Integer> Unicode code point"
self assert: [(UnicodeString value: 16r1F600) first name = 'GRINNING FACE'].
"From a <UnicodeScalar>"
self assert: [(UnicodeString value: 16r1F600 asUnicodeScalar) first name = 'GRINNING FACE'].
"From a <Grapheme>"
self assert: [(UnicodeString value: 16r1F600 asGrapheme) first name = 'GRINNING FACE'].
"From an <Array> of <Integer> code points or <UnicodeScalar>s"
self assert: [(UnicodeString value: { 16r1F600. 16r1F923 }) names = #('GRINNING FACE' 'ROLLING ON THE FLOOR LAUGHING')].
"From a <UnicodeString> using escape syntax. @see #escaped: creational method for more details"
self assert: [ | str |
str := UnicodeString value: '\u{1F600}\u{1F923}' asUnicodeString.
str names = #('GRINNING FACE' 'ROLLING ON THE FLOOR LAUGHING')].
"Compatibility: From a VAST <Character>. Code page conversion occurs if character is outside ASCII range."
self assert: [(UnicodeString value: $S) first name = 'LATIN CAPITAL LETTER S'].
"Compatibility: From a VAST <String>. Code page conversion occurs if any element in the string is outside ASCII range"
self assert: [(UnicodeString value: '\u{1F600}\u{1F923}') names = #('GRINNING FACE' 'ROLLING ON THE FLOOR LAUGHING')].
A <UnicodeString> can also be created from various valid UTF encodings.
"From valid utf8"
self assert: [(UnicodeString utf8: #[240 159 152 130]) first name = 'FACE WITH TEARS OF JOY'].
"From valid utf16 expressed in platform/little/big endian formats"
self assert: [ | le be pe names |
le := #[61 216 2 222].
be := #[216 61 222 2].
pe := System bigEndian ifTrue: [be] ifFalse: [le].
names := Set new.
names add: (UnicodeString utf16: pe) first name.
names add: (UnicodeString utf16LE: le) first name.
names add: (UnicodeString utf16BE: be) first name.
names size = 1 and: [names asArray first = 'FACE WITH TEARS OF JOY']].
"From valid utf32 expressed in platform/little/big endian formats"
self assert: [ | le be pe names |
le := #[2 246 1 0].
be := #[0 1 246 2].
pe := System bigEndian ifTrue: [be] ifFalse: [le].
names := Set new.
names add: (UnicodeString utf32: pe) first name.
names add: (UnicodeString utf32LE: le) first name.
names add: (UnicodeString utf32BE: be) first name.
names size = 1 and: [names asArray first = 'FACE WITH TEARS OF JOY']].
Since not all data in the real world is valid, we need to be able to detect and repair invalid data. The VAST <UnicodeString> can do that too!
"Repair invalid 2nd utf8 byte by replacing it with the special Unicode replacement character"
self assert: [(UnicodeString utf8: #[72 253 108 108 111] repair: true) second name = 'REPLACEMENT CHARACTER'].
"Repair invalid utf16 byte replacing invalid code units with the Unicode replacement character"
self assert: [ | le be pe names |
le := #(16rD834 16rDD1E 16r006D 16r0075 16r0073 16rDD1E 16r0069 16r0063 16rD834) asUtf16LE.
be := #(16r34D8 16r1EDD 16r6D00 16r7500 16r7300 16r1EDD 16r6900 16r6300 16r34D8) asUtf16BE.
pe := System bigEndian ifTrue: [be] ifFalse: [le].
names := Set new.
names add: (UnicodeString utf16: pe repair: true) last name.
names add: (UnicodeString utf16LE: le repair: true) last name.
names add: (UnicodeString utf16BE: be repair: true) last name.
names size = 1 and: [names asArray first = 'REPLACEMENT CHARACTER']].
"Repair invalid utf32 byte replacing invalid code units with the Unicode replacement character"
self assert: [ | le be pe names |
le := #(16r61 16r1F30D 16r61 16r1F30D 16rD834) asUtf32LE.
be := #(16r61000000 16rDF30100 16r61000000 16rDF30100 16r34D80000) asUtf32BE.
pe := System bigEndian ifTrue: [be] ifFalse: [le].
names := Set new.
names add: (UnicodeString utf32: pe repair: true) last name.
names add: (UnicodeString utf32LE: le repair: true) last name.
names add: (UnicodeString utf32BE: be repair: true) last name.
names size = 1 and: [names asArray first = 'REPLACEMENT CHARACTER']].
Accessing
The elements of a <UnicodeString> are <Grapheme> objects. Grapheme is short for extended grapheme cluster which is the digital approximation to what a user normally perceives as a character. A <Grapheme> may be composed of multiple Unicode code points or expressed as multiple code units in some encoded form. In most languages you would need to perform complex operations just to extract this kind of element. However, in VAST it is easy because the <UnicodeString> knows how to do this for you. We'll use an emoji in the example below since emoji tend to be more structurally complex.
| nationalParkStr |
"We'll use escape syntax to easily construct our sentence with the emoji for 'national park'"
nationalParkStr := UnicodeString escaped: 'I love to visit a \u{1F3DE}\u{FE0F}'.
"The last character is a complex emoji consisting of:"
self assert: [nationalParkStr last utf8 size = 7]. "7 utf8 bytes"
self assert: [nationalParkStr last utf16 size = 3]. "3 utf16 code units"
self assert: [nationalParkStr last unicodeScalars size = 2]. "2 Unicode scalar"
"In VAST we don't need to figure out what bytes to copy that make up the emoji.
We just need to know the character position in the string, just like a byte string"
self assert: [nationalParkStr last name = 'NATIONAL PARK,VARIATION SELECTOR-16']
Iteration
The graphemes of a Unicode string can be iterated using the normal <SequenceableCollection> protocol.
| str uppercaseStr |
str := 'Hello Smalltalk' asUnicodeString.
uppercaseStr := UnicodeString new.
self assert: [(str inject: uppercaseStr into: [:upStr :g | upStr addAll: g asUppercase; yourself]) = 'HELLO SMALLTALK'].
Different from most programming languages, a VAST <UnicodeString> also answers the correct result when using protocols like #reverse on more complex strings with combining marks, emoji and other multi-scalar characters.
| str uhOhStr |
"To demonstrate the problem, let's create a string with 2 characters.
An 'e' with a combining 'accent mark' followed by an 'a'"
str := UnicodeString escaped: 'e\u{301}a'.
"In VAST we get what is expected when the string is reversed. An 'a' followed by an 'e' with a combining 'accent mark'"
self assert: [str reverse = (UnicodeString escaped: 'ae\u{301}')].
"However, languages that present their characters as bytes, code units or code points will typically give the wrong
answer. And you won't know its wrong unless you semantically inspect it. Below we are going to reverse it by scalar
(code point)"
uhOhStr := UnicodeString new.
str unicodeScalars reverseDo: [:u | uhOhStr add: u asGrapheme].
"Uh-oh, this is bad. The combing accent mark always modifies the base character before it and now that we reversed at
the Unicode scalar level...we have a string composed of an 'a' with an 'accent mark' followed by an 'e'!"
self assert: [uhOhStr escaped = 'a\u{301}e'].
Iteration methods are mostly inherited, but are also available in the Iteration category on the instance side.
Normalization
NFC, NFD, NFKC, NFKD normalized forms are all supported. Normalization can either be done in-place or a copy can be made.
These can be found in the Normalization category on the instance side.
| str |
"LATIN SMALL LETTER E, COMBINING ACUTE ACCENT"
str := UnicodeString value: #(16r65 16r301).
self assert: [str isNFD].
self assert: [str asNFC unicodeScalars contents first value = 16rE9].
self assert: [str isNFD].
"In-Place normalization"
str ensureNFC.
self assert: [str isNFD not and: [str isNFC]].
FFI
These can be found in the PlatformInterface category on the instance side. Use #asPSZ when you want to pass a <UnicodeString> to a C function FFI call. This will pass a pointer to the utf-8 null-terminated backing storage.
Properties
These can be found in the Properties category on the instance side.
• #isAscii - Boolean indicating if the Unicode String is composed of all ASCII characters
• #names - Collection of the <UnicodeString>'s grapheme names
Views
The following views are available for Unicode Strings. These can be found on the Views category on the instance side.
• #graphemes - Grapheme view of the Unicode string
• #unicodeScalars - Unicode scalar view of the Unicode string
• #utf8 - UTF-8 encoded view of the Unicode string
• #utf16 - UTF-16 platform-endian encoded view of the Unicode string
• #utf16LE - UTF-16 little-endian encoded view of the Unicode string
• #utf16BE - UTF-16 big-endian encoded view of the Unicode string
• #utf32 - UTF-32 encoded view of the Unicode string
• #utf32LE - UTF-32 little-endian encoded view of the Unicode string
• #utf32BE - UTF-32 big-endian encoded view of the Unicode string
Equality/Comparison
Like <Grapheme> comparisons, a <UnicodeString> also incorporates the concepts in Unicode Canonical Equivalence.
"These are = according to Unicode canonical equivalence".
self assert: [#[195 169] asUnicodeString = #[101 204 129] asUnicodeString].
Conversion
Single-Byte String
A <UnicodeString> can convert itself to a single-byte <String> in a few ways.
If the Unicode string contains only characters in the current code page of VAST, then the #asSBStringmethod can be used.
"7-bit ASCII will always convert"
self assert: ['Hello Smalltalk' asUnicodeString asSBString = 'Hello Smalltalk'].
"This will answer most likely answer nil since Hangul can only be mapped by code page 949"
self assert: [(UnicodeString escaped: '\u{D6C8}\u{BBFC}\u{C815}\u{C74C}') asSBString isNil].
If you need a form that is guaranteed to map to a single-byte string, use the #escaped method.
self assert: [ | hangulStr |
hangulStr := UnicodeString escaped: '\u{D6C8}\u{BBFC}\u{C815}\u{C74C}'.
hangulStr escaped asSBString = '\u{D6C8}\u{BBFC}\u{C815}\u{C74C}'].
UTF Encoding
A <UnicodeString> can convert itself to any of the UTF encodings. Platform-endian accessors are provided in the Conversion category on the instance side, but any endian encoding can be accomplished with views.
"Platform-endian"
self assert: [UnicodeString cr asUtf8 asByteArray = #[13]].
self assert: [UnicodeString cr asUtf16 = (Utf16 with: 13)].
self assert: [UnicodeString cr asUtf32 = (Utf32 with: 13)].
"Little/Big endian via views"
self assert: [UnicodeString cr utf16LE asByteArray = #[13 0]].
self assert: [UnicodeString cr utf16BE asByteArray = #[0 13]].
self assert: [UnicodeString cr utf32LE asByteArray = #[13 0 0 0]].
self assert: [UnicodeString cr utf32BE asByteArray = #[0 0 0 13]].
Case Mapping
A <UnicodeString> can convert itself to its uppercase, lowercase and titlecase form. Case mapping can change the number of Unicode scalars. Depending on how they combine, this may change the number of <Grapheme>s.
Uppercase
Here is an example that shows why case mapping answers a <UnicodeString>. Consider calling #asUppercase on the German sharp S 16rDF asUnicodeString. When this is uppercased, it becomes two graphemes 16r53 (LATIN CAPITAL LETTER S) and 16r53 (LATIN CAPITAL LETTER S).
self assert: [16rDF asUnicodeString asUppercase = 'SS' asUnicodeString]
Lowercase
This example was given in the class documentation for <UnicodeScalar>. This example produced two Unicode scalars when 16r0130 asUnicodeScalar (LATIN CAPITAL LETTER I WITH DOT ABOVE) was lowercased. However, these two Unicode scalars produce one user-perceived character or <Grapheme>. If this lowercase form were rendered as a Glyph on-screen, the user would typically see a small letter i with a combining dot above.
self assert: [16r0130 asUnicodeString asLowercase = (UnicodeString value: #(16r0069 16r0307))].
Titlecase
There are several Unicode characters that require special handling when they are used as the initial "character" in the word. One example is 16rFB01 asUnicodeString (LATIN SMALL LIGATURE FI). When this is titlecased, it becomes two graphemes 16r46 (LATIN CAPITAL LETTER F) and 16r69 (LATIN SMALL LETTER I)
self assert: [16rFB01 asUnicodeString asTitlecase = 'Fi' asUnicodeString]
String/Character Compatibility
This compatibility is carried out in three different ways.
• Primitives: The virtual machine primitives can quickly detect and convert a <String> or <Character> if it is in the 7-bit ASCII range. If it is outside this range, this means that the string/char represents a value from a particular code page, for which there are many code pages, and they can differ wildly in the upper 128 bytes. Because code page conversion is required in this case, a primitive failure is triggered.
• Primitive Fail Handlers: The primitive failure handlers in Smalltalk detect if the argument was a <String> or <Character> and code page converts the string to a <UnicodeString> for <Grapheme> and tries the primitive call again (this time with a Unicode string/grapheme argument).
• Smalltalk Methods: Compatibility methods are provided in the Compatibility category of methods.
Important Note
• Compatibility relationship is uni-directional. A <String> or <Character> does not have direct knowledge of <UnicodeString>.
Class Methods
cr
Answer the unicode string containing a carriage return.
Answers:
<UnicodeString>
crlf
Answer the unicode string containing a carriage return and a linefeed.
Answers:
<UnicodeString>
escaped:
Create a new unicode string from @aStringObject.
Escaped Strings:
If @aStringObject is a String or UnicodeString, then the following escapes
will be parsed:
Escapes:
\x53 7-bit character code (exactly 2 digits, up to 0x7F)
\u{1F600} 24-bit Unicode character code (up to 6 digits)
\n Newline (This is the Lf character)
\r Carriage return (This is the Cr character)
\t Tab
\\ Backslash
\0 Nul
Examples:
#'From a string or unicode string containing possible escapes'.
self assert: [(UnicodeString escaped: '\x53malltalk') = 'Smalltalk'].
self assert: [ | str1 str2 |
str1 := UnicodeString escaped: 'Line1\nLine2\r\t\x54abbedLine3\r\nEmoji: \u{1F600}'.
str2 := ('Line1' , String lf , 'Line2' , String cr , Character tab asString, 'TabbedLine3', String crlf , 'Emoji: ' , 16r1F600 asGrapheme utf8 contents) utf8AsUnicodeString.
str1 = str2].
Arguments:
aStringObject - <UnicodeString> unicode string containing escape characters
(Compat) <String> Smalltalk code-page string containing escape characters (requires conversion if outside ascii range)
Answers:
<UnicodeString>
lf
Answer the unicode string containing a line feed.
Answers:
<UnicodeString>
new
Create an empty unicode string.
Example:
| str |
str := UnicodeString new.
self assert: [str isEmpty].
self assert: [str ownsStorage].
self assert: [str reservedStorage = 0].
Answers:
<UnicodeString>
new:
Create a unicode string with @anInteger capacity reserved for fast growth.
Example:
| str |
str := UnicodeString new: 100.
self assert: [str isEmpty].
self assert: [str ownsStorage].
self assert: [str reservedStorage = 100].
Arguments:
anInteger - <Integer>
Answers:
<UnicodeString>
new:withAll:
Answer an instance of me, with number of elements equal to size, each
of which refers to the argument, value.
Example:
| str |
str := UnicodeString new: 100 withAll: (Grapheme value: 16r1F4A6).
self assert: [str size = 100].
self assert: [str utf8 size = ((Grapheme value: 16r1F4A6) utf8 size * 100)]
Arguments:
size - <Integer>
value - <Grapheme> @see implementors of #asGrapheme for additional argument types
Answers:
<UnicodeString>
utf16:
Answer the unicode string constructed from @aByteObject which should be UTF-16 platfrom-endian encoded data.
@aByteObject is validated before any attempt is made to create a unicode string from its data.
If @aByteObject is empty, then an empty <UnicodeString> is answered.
@Note - use #utf16:repair: true if you are not sure if @aByteObject is valid utf16 and you would like to
have replacement characters inserted.
Examples:
self assert: [(UnicodeString utf16: 'a' utf16 contents) = 'a'].
Arguments:
aByteObject - <String | ByteArray> or byte shaped object
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16
utf16:repair:
Answer the unicode string constructed from @aByteObject which should be UTF-16 platform-endian encoded data.
If @repair is true, then invalid utf16 will be replaced with the unicode replacement character U+FFFD.
If @aByteObject is empty, then an empty <UnicodeString> is answered.
Arguments:
aByteObject - <String | ByteArray>
repair - <Boolean> if true, then repair invalid utf16 using U+FFFD replacement characters.
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16
utf16BE:
Answer the unicode string constructed from @aByteObject which should be UTF-16 big-endian encoded data.
If @aByteObject is empty, then an empty <UnicodeString> is answered.
@Note - use #utf16BE:repair: true if you are not sure if @aByteObject is valid big-endian utf16 and you would like to
have replacement characters inserted.
Examples:
self assert: [(UnicodeString utf16BE: #[0 97]) = 'a'].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16BE
utf16BE:repair:
Answer the unicode string constructed from @aByteObject which should be UTF-16 big-endian encoded data.
If @repair is true, then invalid utf16 will be replaced with the unicode replacement character U+FFFD.
If @aByteObject is empty, then an empty <UnicodeString> is answered.
Arguments:
aByteObject - <String | ByteArray>
repair - <Boolean> if true, then repair invalid utf16 using U+FFFD replacement characters.
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16BE
utf16LE:
Answer the unicode string constructed from @aByteObject which should be UTF-16 little-endian encoded data.
If @aByteObject is empty, then an empty <UnicodeString> is answered.
@Note - use #utf16LE:repair: true if you are not sure if @aByteObject is valid little-endian utf16 and you would like to
have replacement characters inserted.
Examples:
self assert: [(UnicodeString utf16LE: #[97 0]) = 'a'].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16LE
utf16LE:repair:
Answer the unicode string constructed from @aByteObject which should be UTF-16 little-endian encoded data.
If @repair is true, then invalid utf16 will be replaced with the unicode replacement character U+FFFD.
If @aByteObject is empty, then an empty <UnicodeString> is answered.
Arguments:
aByteObject - <String | ByteArray>
repair - <Boolean> if true, then repair invalid utf16 using U+FFFD replacement characters.
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-16LE
utf32:
Answer the unicode string constructed from @aByteObject which should be UTF-32 platfrom-endian encoded data.
@aByteObject is validated before any attempt is made to create a unicode string from its data.
@Note - use #utf32:repair: true if you are not sure if @aByteObject is valid utf32 and you would like to
have replacement characters inserted.
Examples:
self assert: [(UnicodeString utf32: 'a' utf32 contents) = 'a'].
Arguments:
aByteObject - <String | ByteArray> or byte shaped object
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32
utf32:repair:
Answer the unicode string constructed from @aByteObject which should be UTF-32 platform-endian encoded data.
If @repair is true, then invalid utf32 will be replaced with the unicode replacement character U+FFFD.
Arguments:
aByteObject - <String | ByteArray>
repair - <Boolean> if true, then repair invalid utf32 using U+FFFD replacement characters.
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32
utf32BE:
Answer the unicode string constructed from @aByteObject which should be UTF-32 big-endian encoded data.
@Note - use #utf32BE:repair: true if you are not sure if @aByteObject is valid big-endian utf32 and you would like to
have replacement characters inserted.
Examples:
self assert: [(UnicodeString utf32BE: #[0 0 0 97]) = 'a'].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32BE
utf32BE:repair:
Answer the unicode string constructed from @aByteObject which should be UTF-32 big-endian encoded data.
If @repair is true, then invalid utf32 will be replaced with the unicode replacement character U+FFFD.
Arguments:
aByteObject - <String | ByteArray>
repair - <Boolean> if true, then repair invalid utf32 using U+FFFD replacement characters.
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32BE
utf32LE:
Answer the unicode string constructed from @aByteObject which should be UTF-32 little-endian encoded data.
@Note - use #utf32LE:repair: true if you are not sure if @aByteObject is valid little-endian utf32 and you would like to
have replacement characters inserted.
Examples:
self assert: [(UnicodeString utf32LE: #[97 0 0 0]) = 'a'].
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32BE
utf32LE:repair:
Answer the unicode string constructed from @aByteObject which should be UTF-32 little-endian encoded data.
If @repair is true, then invalid utf32 will be replaced with the unicode replacement character U+FFFD.
Arguments:
aByteObject - <String | ByteArray>
repair - <Boolean> if true, then repair invalid utf32 using U+FFFD replacement characters.
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-32BE
utf8:
Answer the unicode string constructed from @aByteObject which should be UTF-8 encoded data.
@ByteObject is validated before any attempt is made to create a unicode string from its data.
An exception is raised if @aByteObject is not valid utf8
@Note - use #utf8:repair: true if you are not sure if @aByteObject is valid utf8 and you would like to
have replacement characters inserted.
Examples:
self assert: [(UnicodeString utf8: #[97]) = 'a']
Arguments:
aByteObject - <String | ByteArray>
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if invalid utf-8
utf8:repair:
Answer the unicode string constructed from @aByteObject which should be UTF-8 encoded data.
@ByteObject is validated before any attempt is made to create a unicode string from its data.
If @repair is true, then invalid utf8 will be replaced with the unicode replacement character U+FFFD.
Example:
| input |
'Hello World with garbage utf8 characters between Hello and World'.
input := #[72 101 108 108 111 32 16rF0 16r90 16r80 87 111 114 108 100].
self assert: [[UnicodeString utf8: input repair: true. true] on: Exception do: [:ex | ex exitWith: false]].
Arguments:
aByteObject - <String | ByteArray>
repair - <Boolean> if true, then repair invalid utf8 by replacing them with U+FFFD replacement character.
Answers:
<UnicodeString>
Raises:
<Exception> EsPrimErrValueOutOfRange if @aByteObject contains invalid UTF-8
and @repair is false
value:
Create a new unicode string from @anObject.
Escaped Strings:
If @anObject is a String or UnicodeString, then the following escapes
will be parsed:
Escapes:
\x53 7-bit character code (exactly 2 digits, up to 0x7F)
\u{1F600} 24-bit Unicode character code (up to 6 digits)
\n Newline (This is the Lf character)
\r Carriage return (This is the Cr character)
\t Tab
\\ Backslash
\0 Nul
Examples:
#'From unicode scalar value'.
self assert: [(UnicodeString value: 16r1F600) unicodeScalars first value = 16r1F600].
#'From unicode scalar object'.
self assert: [(UnicodeString value: 16r1F600 asUnicodeScalar) unicodeScalars first = 16r1F600 asUnicodeScalar].
#'From grapheme object'.
self assert: [(UnicodeString value: 16r1F600 asGrapheme) first = 16r1F600 asGrapheme].
#'From a character object'.
self assert: [(UnicodeString value: $a) first asciiValue = $a value].
#'From array of unicode scalar object/values' .
self assert: [(UnicodeString value: { 16r65. 16r301 asUnicodeScalar }) utf32 contents asArray = #(16r65 16r301)].
#'From a string or unicode string containing possible escapes'.
self assert: [(UnicodeString value: '\x53malltalk') = 'Smalltalk'].
self assert: [ | str1 str2 |
str1 := UnicodeString value: 'Line1\nLine2\r\t\x54abbedLine3\r\nEmoji: \u{1F600}'.
str2 := ('Line1' , String lf , 'Line2' , String cr , Character tab asString, 'TabbedLine3', String crlf , 'Emoji: ' , 16r1F600 asGrapheme utf8 contents) utf8AsUnicodeString.
str1 = str2].
Arguments:
anObject - <Integer> unicode code point
<UnicodeScalar> unicode scalar
<Grapheme> unicode grapheme
<Array> of <Integer | UnicodeScalar> array of unicode scalars
<UnicodeString> unicode string containing escape characters
(Compat) <Character> Smalltalk code-page character (requires conversion if outside ascii range)
(Compat) <String> Smalltalk code-page string containing escape characters (requires conversion if outside ascii range)
Answers:
<UnicodeString>
Instance Methods
,
Answer a copy of the receiver with each element
of the argument, aCollection, added in #do: order.
Fail if aCollection is not a kind of Collection.
Example:
self assert: [('Small' asUnicodeString , 'talk' asUnicodeString) = 'Smalltalk']
Arguments:
aCollection - <Collection>
Answers:
<UnicodeString>
<
Answer a Boolean which is true if the receiver is ordered before @aUnicodeString and false otherwise.
Both the receiver and @aUnicodeString will be gauranteed to have the same normalization
form before the comparison is made.
This uses locale-independent machine-sorting for stable sorts.
Argument:
aUnicodeString - <UnicodeString | String>
Answers:
<Boolean>
Raises:
Fail if @aUnicodeString is not a String or UnicodeString.
<=
Answer a Boolean which is true if the receiver is equal or ordered before @aUnicodeString and false otherwise.
Both the receiver and @aUnicodeString will be gauranteed to have the same normalization
form before the comparison is made.
This uses locale-independent machine-sorting for stable sorts.
Argument:
aUnicodeString - <UnicodeString | String>
Answers:
<Boolean>
Raises:
Fail if @aUnicodeString is not a String or UnicodeString.
=
Answer a Boolean which is true when the receiver and the argument
anObject are equivalent, and false otherwise.
Equality of 2 UnicodeStrings is based on Unicode canonical equivalence.
Example:
'UTF-8 for LATIN SMALL LETTER E WITH ACUTE - NFC Form'.
self assert: [#[195 169] asUnicodeString first name = 'LATIN SMALL LETTER E WITH ACUTE'].
'UTF-8 for LATIN SMALL LETTER E and COMBINING ACUTE ACCENT - NFD Form'.
self assert: [#[101 204 129] asUnicodeString first name = 'LATIN SMALL LETTER E,COMBINING ACUTE ACCENT'].
'These are = according to Unicode canonical equivalence'.
self assert: [#[195 169] asUnicodeString = #[101 204 129] asUnicodeString].
Arguments:
anObject - <Object>
Answers:
<Boolean>
>
Answer a Boolean which is true if the receiver is ordered after @aUnicodeString and false otherwise.
Both the receiver and @aUnicodeString will be gauranteed to have the same normalization
form before the comparison is made.
This uses locale-independent machine-sorting for stable sorts.
Argument:
aUnicodeString - <UnicodeString | String>
Answers:
<Boolean>
Raises:
Fail if @aUnicodeString is not a String or UnicodeString.
>=
Answer a Boolean which is true if the receiver is equal or ordered after @aUnicodeString and false otherwise.
Both the receiver and @aUnicodeString will be gauranteed to have the same normalization
form before the comparison is made.
This uses locale-independent machine-sorting for stable sorts.
Argument:
aUnicodeString - <UnicodeString | String>
Answers:
<Boolean>
Raises:
Fail if @aUnicodeString is not a String or UnicodeString.
add:
Answer aGrapheme having added aGrapheme to the end of receiver.
Example:
| str |
str := UnicodeString new.
str add: $S asGrapheme.
self assert: [str = 'S'].
str add: $T.
self assert: [str = 'ST']
Arguments:
aGrapheme - <Grapheme>
Answers:
<Grapheme> grapheme or converted grapheme
Raises:
Exception if aGrapheme is not a grapheme or character
addAll:
Answer aUnicodeString having added each element of aUnicodeString to the
end of receiver.
Example:
| str |
str := UnicodeString new.
str
addAll: 'Small' asUnicodeString;
addAll: 'talk'.
self assert: [str = 'Smalltalk']
Arguments:
aUnicodeString - <UnicodeString>
Answers:
<UnicodeString>
Raises:
Fail if aUnicodeString is not a type of Collection.
addLineDelimiters
Answer a string which contains the same characters as the
receiver, but with each occurence of the backslash character
replaced by the line terminator character(s).
Example:
self assert: ['abc\def\ghi' asUnicodeString addLineDelimiters lines = { 'abc'. 'def'. 'ghi' }]
Answers:
<UnicodeString>
asEsString
Convert the receiver to an <EsString>.
This is done by attempting to represent the receiver in a single-byte
locale-senstive string.
Example:
self assert: ['Smalltalk' asUnicodeString asEsString class == String].
self assert: ['Smalltalk' asUnicodeString asEsString = 'Smalltalk'].
self assert: [('Small' , 16r1F440 asUnicodeString , 'talk') asEsString isNil].
Answers:
<String> or nil if conversion is not possible
asLowercase
Answers a lowercased version of this unicode string.
Examples:
self assert: ['A' asUnicodeString asLowercase = 'a' asUnicodeString].
self assert: [16r0130 asUnicodeString asLowercase = (UnicodeString value: { 16r0069 asUnicodeScalar. 16r0307 asUnicodeScalar. })].
Answers:
<UnicodeString>
asNFC
Answer an NFC normalized copy of this unicode string.
Examples:
'LATIN SMALL LETTER E, COMBINING ACUTE ACCENT -> LATIN SMALL LETTER E WITH ACUTE'.
self assert: [(UnicodeString value: #(16r65 16r301)) asNFC unicodeScalars first value = 16rE9]
Answers:
<UnicodeString>
asNFD
Answer an NFD normalized copy of this unicode string.
Examples:
'LATIN SMALL LETTER E WITH ACUTE -> LATIN SMALL LETTER E, COMBINING ACUTE ACCENT'.
self assert: [16rE9 asUnicodeString asNFD unicodeScalars contents = { 16r65 asUnicodeScalar. 16r301 asUnicodeScalar}]
Answers:
<UnicodeString>
asNFKC
Answer an NFKC normalized copy of this unicode string.
Examples:
'SUPERSCRIPT TWO -> DIGIT TWO'.
self assert: [16rB2 asUnicodeString asNFKC = 16r32 asUnicodeString]
Answers:
<UnicodeString>
asNFKD
Answer an NFKD normalized copy of this unicode string.
Examples:
'SUPERSCRIPT TWO -> DIGIT TWO'.
self assert: [16rB2 asGrapheme asNFKD = 16r32 asGrapheme]
Answers:
<UnicodeString>
asParameter
Answer the receiver as a parameter for an OS call.
Answers:
<EsUnicodeStorage> UTF-8 bytes with null-terminator
asPSZ
Answer the receiver as a null terminated OS string.
Copy-On-Write:
If the receiver is sharing storage, the COW-barrier is triggered
and an unshared storage will be created.
If receiver is read-only, then make a copy of the receiver (which will not be read-only)
and try again.
Example:
self assert: ['a' asUnicodeString asPSZ isKindOf: EsUnicodeStorage].
self assert: ['a' asUnicodeString asPSZ endsWithSubCollection: #[0]].
Answers:
<EsUnicodeStorage> UTF-8 bytes with null-terminator
asSBString
COMPATIBILITY API: Convert the receiver to a single byte string if possible using the current
code page, otherwise answer nil
Example:
self assert: ['Smalltalk' asUnicodeString asSBString class == String].
self assert: ['Smalltalk' asUnicodeString asSBString = 'Smalltalk'].
self assert: [('Small' , 16r1F440 asUnicodeString , 'talk') asSBString isNil].
Answers:
<String> if reduceable
<UndefinedObject> nil
asSBString:
Convert the receiver to a single byte string in the supplied @codePage if possible,
otherwise answer nil.
@Note: The primitive will answer a new String if the receiver is ASCII.
Otherwise, the prim will signal a failure and this will trigger code page conversion.
Example:
| utf8CodePage |
utf8CodePage := EsAbstractCodePageConverter current class utf8CodePage.
self assert: [('Smalltalk' asUnicodeString asSBString: utf8CodePage) = 'Smalltalk'].
Arguments:
codePage - <Object>
Answers:
<String> if reduceable
<UndefinedObject> self if not reduceable to single-byte repr in the @codePage
asStream
Answer a Unicode adapted ReadWriteStream on the receiver.
Example:
| rwStream |
rwStream := 'Smalltalk' asUnicodeString asStream.
rwStream position: 3.
rwStream next: 2 put: $z asGrapheme.
self assert: [(rwStream reset; upToEnd) = 'Smazztalk']
Answers:
<ReadWriteStream>
asString
A UnicodeString is considered a 'string' in the sense that its a sequenceable collection of 'character's
where a 'character' is defined as a user-perceived character according to the rules in
Unicode Standard Annex #29. It also shares the API of a String object.
Answers:
<UnicodeString>
asSymbol
Answer a symbol representation of the receiver. Currently, only handle
single byte symbols.
asTitlecase
Answers a titlecased version of this string.
This will titlecase the first grapheme only, ever other grapheme
will be left unmodified.
Examples:
self assert: [$a asUnicodeString asTitlecase = 'A' asUnicodeString].
self assert: [16rFB01 asUnicodeString asTitlecase = (UnicodeString value: { 16r46 asUnicodeScalar. 16r69 asUnicodeScalar. })].
Answers:
<UnicodeString>
asTitlecase:
Answers a titlecased version of this string.
If @lowercaseRemaining is true, then all graphemes after the first one
will be lowercased, if false then they are left unmodified.
Examples:
self assert: [('abc DEF' asUnicodeString asTitlecase: false) = 'Abc DEF'].
self assert: [('abc DEF' asUnicodeString asTitlecase: true) = 'Abc def'].
Arguments:
lowercaseRemaining - <Boolean>
Answers:
<UnicodeString>
asUnicodeString
Answer the receiver.
Answers:
<UnicodeString>
asUppercase
Answers an uppercased version of this unicode string.
Examples:
self assert: ['a' asUnicodeString asUppercase = 'A' asUnicodeString].
self assert: [16rDF asUnicodeString asUppercase = (UnicodeString value: { 16r53 asUnicodeScalar. 16r53 asUnicodeScalar. })].
Answers:
<UnicodeString>
asUtf16
Answer a <Utf16> that contains the utf-16 platform-endian encoded bytes of the receiver.
Example:
| utf16 |
'EARTH GLOBE EUROPE-AFRICA'.
utf16 := 16r1F30D asUnicodeString asUtf16.
System bigEndian ifFalse: [
self assert: [utf16 first = 55356].
self assert: [utf16 second = 57101]].
self assert: [(UnicodeString utf16: utf16) first name = 'EARTH GLOBE EUROPE-AFRICA']
Answers:
<Utf16>
asUtf16Bytes
Answer a <ByteArray> that contains the utf-16 platform-endian encoded bytes of the receiver.
Example:
| utf16 |
'EARTH GLOBE EUROPE-AFRICA'.
utf16 := 16r1F30D asUnicodeString asUtf16Bytes.
System bigEndian ifFalse: [
self assert: [utf16 = #[60 216 13 223]]].
self assert: [(UnicodeString utf16: utf16) first name = 'EARTH GLOBE EUROPE-AFRICA']
Answers:
<ByteArray>
asUtf32
Answer a <Utf32> that contains the utf-32 platform-endian encoded bytes of the receiver.
Example:
| utf32 |
'EARTH GLOBE EUROPE-AFRICA'.
utf32 := 16r1F30D asUnicodeString asUtf32.
System bigEndian ifFalse: [self assert: [utf32 first = 127757]].
self assert: [(UnicodeString utf32: utf32) first name = 'EARTH GLOBE EUROPE-AFRICA']
Answers:
<Utf32>
asUtf32Bytes
Answer a <ByteArray> that contains the utf-32 encoded bytes of the receiver.
Example:
| utf32 |
'EARTH GLOBE EUROPE-AFRICA'.
utf32 := 16r1F30D asUnicodeString asUtf32Bytes.
System bigEndian ifFalse: [self assert: [utf32 = #[13 243 1 0]]].
self assert: [(UnicodeString utf32: utf32) first name = 'EARTH GLOBE EUROPE-AFRICA']
Answers:
<ByteArray>
asUtf8
Answer a <Utf8> that contains the utf-8 encoded bytes of the receiver.
Example:
| utf8 |
'EARTH GLOBE EUROPE-AFRICA'.
utf8 := 16r1F30D asUnicodeString asUtf8.
self assert: [utf8 asByteArray = #[240 159 140 141]].
self assert: [(UnicodeString utf8: utf8) first name = 'EARTH GLOBE EUROPE-AFRICA']
Answers:
<Utf8>
asUtf8Bytes
Answer a <ByteArray> that contains the utf-8 encoded bytes of the receiver.
Example:
| utf8 |
'EARTH GLOBE EUROPE-AFRICA'.
utf8 := 16r1F30D asUnicodeString asUtf8Bytes.
self assert: [utf8 asByteArray = #[240 159 140 141]].
self assert: [(UnicodeString utf8: utf8) first name = 'EARTH GLOBE EUROPE-AFRICA']
Answers:
<ByteArray>
at:
Answer the <Grapheme> at index, @anInteger in the <UnicodeString>,
receiver.
Performance Note:
While this can be an O(1) operation, it is not guaranteed unless the contents
of the receiver are single-scalar ascii (i.e. Ascii - No CrLf). Do not use this for
element retrieval in a loop, instead use iteration methods or 'views'.
Example:
'O(1) because its single-scalar ascii'.
self assert: [('Smalltalk' asUnicodeString at: 9) = $k asGrapheme].
'O(n) because its multi-scalar ascii (due to CrLf)'.
self assert: [(('Small' , String crlf , 'talk') asUnicodeString at:10) = $k asGrapheme].
'O(n) because it contains non-ascii characters'.
self assert: [(('Small' , 233 asUnicodeString , 'talk') asUnicodeString at:10) = $k asGrapheme].
Arguments:
anInteger - <Integer> 1-based index
Answers:
<Grapheme>
Raises:
<Exception> - Fail if anInteger is not an Integer. Fail if anInteger
exceeds the size of the receiver.
at:put:
Replace the grapheme in the receiver at @anInteger with @aGrapheme. Answer @aGrapheme.
This message sets one of the receiver’s elements based on @anInteger. The @aGrapheme is stored at
@anInteger in the receiver’s elements, replacing any previously stored object. Subsequent retrievals at
this @anInteger will answer @aGrapheme.
Performance Note:
While this can be an O(1) operation, it is not guaranteed unless the contents
of the receiver are single-scalar ascii (i.e. Ascii - No CrLf).
Example:
'O(1) because its single-scalar ascii'.
self assert: [ | str |
str := 'Smalltalk' asUnicodeString.
str at: 9 put: $z asGrapheme.
str = 'Smalltalz' asUnicodeString].
'O(n) because its multi-scalar ascii (due to CrLf)'.
self assert: [ | str |
str := ('Small' , String crlf , 'talk') asUnicodeString.
str at: 6 put: $z asGrapheme.
str = 'Smallztalk' asUnicodeString].
'O(n) because it contains non-ascii characters'.
self assert: [ | str |
str := ('Small' , 233 asUnicodeString , 'talk') asUnicodeString.
str at: 6 put: $z asGrapheme.
str = 'Smallztalk' asUnicodeString].
Arguments:
anInteger - <Integer> 1-based index
aGrapheme - <Grapheme> replacement
Answers:
<Grapheme>
Raises:
Fail if @anInteger is not an Integer.
Fail if @anInteger < 0.
Fail if @anInteger > the receiver’s size.
Fail if @aGrapheme does not conform to any element type restrictions of the receiver.
basicAt:
Answer the <Grapheme> at index, @anInteger in the <UnicodeString>,
receiver.
Performance Note:
While this can be an O(1) operation, it is not guaranteed unless the contents
of the receiver are single-scalar ascii (i.e. Ascii - No CrLf). Do not use this for
element retrieval in a loop, instead use iteration methods or 'views'.
Example:
'O(1) because its single-scalar ascii'.
self assert: [('Smalltalk' asUnicodeString at: 9) = $k asGrapheme].
'O(n) because its multi-scalar ascii (due to CrLf)'.
self assert: [(('Small' , String crlf , 'talk') asUnicodeString at:10) = $k asGrapheme].
'O(n) because it contains non-ascii characters'.
self assert: [(('Small' , 233 asUnicodeString , 'talk') asUnicodeString at:10) = $k asGrapheme].
Arguments:
anInteger - <Integer> 1-based index
Answers:
<Grapheme>
Raises:
<Exception> - Fail if anInteger is not an Integer. Fail if anInteger
exceeds the size of the receiver.
basicAt:put:
Replace the grapheme in the receiver at @anInteger with @aGrapheme. Answer @aGrapheme.
This message sets one of the receiver’s elements based on @anInteger. The @aGrapheme is stored at
@anInteger in the receiver’s elements, replacing any previously stored object. Subsequent retrievals at
this @anInteger will answer @aGrapheme.
Performance Note:
While this can be an O(1) operation, it is not guaranteed unless the contents
of the receiver are single-scalar ascii (i.e. Ascii - No CrLf).
Example:
'O(1) because its single-scalar ascii'.
self assert: [ | str |
str := 'Smalltalk' asUnicodeString.
str at: 9 put: $z asGrapheme.
str = 'Smalltalz' asUnicodeString].
'O(n) because its multi-scalar ascii (due to CrLf)'.
self assert: [ | str |
str := ('Small' , String crlf , 'talk') asUnicodeString.
str at: 6 put: $z asGrapheme.
str = 'Smallztalk' asUnicodeString].
'O(n) because it contains non-ascii characters'.
self assert: [ | str |
str := ('Small' , 233 asUnicodeString , 'talk') asUnicodeString.
str at: 6 put: $z asGrapheme.
str = 'Smallztalk' asUnicodeString].
Arguments:
anInteger - <Integer> 1-based index
aGrapheme - <Grapheme> replacement
Answers:
<Grapheme>
Raises:
Fail if @anInteger is not an Integer.
Fail if @anInteger < 0.
Fail if @anInteger > the receiver’s size.
Fail if @aGrapheme does not conform to any element type restrictions of the receiver.
bindWith:
Answer the string formatted under the control of the receiver. The receiver
is a unicode string that contains field descriptors used to insert arguments into
the format string.
Each conversion specification is introduced by the percent character ($%).
After the $% character, the following single character can appear:
[1]
Specifies the argument string to be used in place of the field descriptor.
e.g. argString1 replaces all occurances of %1.
[0]
Causes a line delimiter to be inserted into the formatted string.
[%]
Print a $% character; no argument is converted.
Arguments:
argString1 - <UnicodeString | String>
Answers:
<UnicodeString>
Raises:
Error cases handled:
1. Argument array is too small -- attempt recovery
2. Replacement value is not a string or a character -- attempt recovery
3. Malformed message string -- signal error
bindWith:with:
Answer the string formatted under the control of the receiver. The receiver
is a unicode string that contains field descriptors used to insert arguments into
the format string.
Each conversion specification is introduced by the percent character ($%).
After the $% character, the following single character can appear:
[1-2]
Specifies the argument string to be used in place of the field descriptor.
e.g. argString1 replaces all occurances of %1.
argString2 replaces all occurances of %2.
[0]
Causes a line delimiter to be inserted into the formatted string.
[%]
Print a $% character; no argument is converted.
Arguments:
argString1 - <UnicodeString | String>
argString2 - <UnicodeString | String>
Answers:
<UnicodeString>
Raises:
Error cases handled:
1. Argument array is too small -- attempt recovery
2. Replacement value is not a string or a character -- attempt recovery
3. Malformed message string -- signal error
bindWith:with:with:
Answer the string formatted under the control of the receiver. The receiver
is a unicode string that contains field descriptors used to insert arguments into
the format string.
Each conversion specification is introduced by the percent character ($%).
After the $% character, the following single character can appear:
[1-3]
Specifies the argument string to be used in place of the field descriptor.
e.g. argString1 replaces all occurances of %1.
argString2 replaces all occurances of %2.
argString3 replaces all occurances of %3.
[0]
Causes a line delimiter to be inserted into the formatted string.
[%]
Print a $% character; no argument is converted.
Arguments:
argString1 - <UnicodeString | String>
argString2 - <UnicodeString | String>
argString3 - <UnicodeString | String>
Answers:
<UnicodeString>
Raises:
Error cases handled:
1. Argument array is too small -- attempt recovery
2. Replacement value is not a string or a character -- attempt recovery
3. Malformed message string -- signal error
bindWith:with:with:with:
Answer the string formatted under the control of the receiver. The receiver
is a unicode string that contains field descriptors used to insert arguments into
the format string.
Each conversion specification is introduced by the percent character ($%).
After the $% character, the following single character can appear:
[1-4]
Specifies the argument string to be used in place of the field descriptor.
e.g. argString1 replaces all occurances of %1.
argString2 replaces all occurances of %2.
argString3 replaces all occurances of %3.
argString4 replaces all occurances of %4.
[0]
Causes a line delimiter to be inserted into the formatted string.
[%]
Print a $% character; no argument is converted.
Arguments:
argString1 - <UnicodeString | String>
argString2 - <UnicodeString | String>
argString3 - <UnicodeString | String>
argString4 - <UnicodeString | String>
Answers:
<UnicodeString>
Raises:
Error cases handled:
1. Argument array is too small -- attempt recovery
2. Replacement value is not a string or a character -- attempt recovery
3. Malformed message string -- signal error
bindWith:with:with:with:with:
Answer the string formatted under the control of the receiver. The receiver
is a unicode string that contains field descriptors used to insert arguments into
the format string.
Each conversion specification is introduced by the percent character ($%).
After the $% character, the following single character can appear:
[1-5]
Specifies the argument string to be used in place of the field descriptor.
e.g. argString1 replaces all occurances of %1.
argString2 replaces all occurances of %2.
argString3 replaces all occurances of %3.
argString4 replaces all occurances of %4.
argString5 replaces all occurances of %5.
[0]
Causes a line delimiter to be inserted into the formatted string.
[%]
Print a $% character; no argument is converted.
Arguments:
argString1 - <UnicodeString | String>
argString2 - <UnicodeString | String>
argString3 - <UnicodeString | String>
argString4 - <UnicodeString | String>
argString5 - <UnicodeString | String>
Answers:
<UnicodeString>
Raises:
Error cases handled:
1. Argument array is too small -- attempt recovery
2. Replacement value is not a string or a character -- attempt recovery
3. Malformed message string -- signal error
bindWith:with:with:with:with:with:
Answer the string formatted under the control of the receiver. The receiver
is a unicode string that contains field descriptors used to insert arguments into
the format string.
Each conversion specification is introduced by the percent character ($%).
After the $% character, the following single character can appear:
[1-5]
Specifies the argument string to be used in place of the field descriptor.
e.g. argString1 replaces all occurances of %1.
argString2 replaces all occurances of %2.
argString3 replaces all occurances of %3.
argString4 replaces all occurances of %4.
argString5 replaces all occurances of %5.
argString6 replaces all occurances of %6.
[0]
Causes a line delimiter to be inserted into the formatted string.
[%]
Print a $% character; no argument is converted.
Arguments:
argString1 - <UnicodeString | String>
argString2 - <UnicodeString | String>
argString3 - <UnicodeString | String>
argString4 - <UnicodeString | String>
argString5 - <UnicodeString | String>
argString6 - <UnicodeString | String>
Answers:
<UnicodeString>
Raises:
Error cases handled:
1. Argument array is too small -- attempt recovery
2. Replacement value is not a string or a character -- attempt recovery
3. Malformed message string -- signal error
bindWithArguments:
Answer the string formatted under the control of the receiver. The receiver
is a unicode string that contains field descriptors used to insert arguments into
the format string.
Each conversion specification is introduced by the percent character ($%).
After the $% character, the following single character can appear:
[1-9]
Specifies the argument string to be used in place of the field descriptor.
e.g. (anArrayOfArguments at: 1) replaces all occurances of %1.
.
.
.
(anArrayOfArguments at: 9) replaces all occurances of %9.
[0]
Causes a line delimiter to be inserted into the formatted string.
[%]
Print a $% character; no argument is converted.
Arguments:
anArrayOfArguments - <Array>
Answers:
<UnicodeString>
Raises:
Error cases handled:
1. Argument array is too small -- attempt recovery
2. Replacement value is not a string or a character -- attempt recovery
3. Malformed message string -- signal error
capitalized
Return a copy of the receiver with the first letter capitalized.
Examples:
self assert: ['abcd' asUnicodeString capitalized = 'Abcd' asUnicodeString].
self assert: [16rFB01 asUnicodeString capitalized = (UnicodeString value: { 16r46 asUnicodeScalar. 16r69 asUnicodeScalar. })].
Answers:
<UnicodeString>
collect:
Answer a Collection that is created by iteratively
evaluating the one argument block, aBlock using each element of
the receiver as an argument.
Example:
| str |
str := 'abc' asUnicodeString collect: [:g | (g value + 1) asGrapheme].
self assert: [str = 'bcd']
Arguments:
aBlock - <Block> 1-arg block that evaluates each element
Answers:
<UnicodeString>
Raises:
Fail if aBlock is not a one argument block.
compareTo:
Orders the receiver relative to @aUnicodeString.
Both the receiver and @aUnicodeString will be gauranteed to have the same normalization
form before the comparison is made.
Example:
| str |
'e with acute'.
str := 233 asUnicodeString.
self assert: [str asNFC unicodeScalars size ~= str asNFD unicodeScalars size].
self assert: [str asNFC = str asNFD]
Arguments:
aUnicodeString - <UnicodeString | String>
Answers:
<Integer> -1 The receiver is less than @aUnicodeString
0 The receiver is equal to @aUnicodeString
1 The receiver is greater than @aUnicodeString
Raises:
Fail if @aUnicodeString is not a <UnicodeString> object.
copyAndGrowBy:
Answer a copy of the receiver whose growth capacity has been
increased by the amount anInteger.
Copy-On-Write:
This API only participates in copy-on-write IF @anInteger is 0.
We should think of this API as copy-then-grow. The 'grow' would normally trigger
the write barrier and then create a unique copy. After this call, the receiver and the
copy will not share storage and the copy will have the increased capacity.
Example:
| str copy |
str := UnicodeString new: 1.
self assert: [str reservedStorage = 1].
copy := str copyAndGrowBy: 100.
self assert: [str ownsStorage & copy ownsStorage].
self assert: [str reservedStorage = 1].
self assert: [copy reservedStorage = 100].
Arguments:
anInteger - <Integer>
Answers:
<UnicodeString>
Raises:
Fail if anInteger is not an Integer.
copyFrom:
Answer a copy of a subset of the receiver,
starting from the element at Integer start
until the index of the last element.
Copy-On-Write:
This API uses copy-on-write semantics. This resulting copy
will be sharing storage with the receiver until a subsequent
write by either the receiver or the copy.
Example:
| str copy |
str := 'Smalltalk' asUnicodeString.
self assert: [(str copyFrom: 1) = str].
self assert: [(str copyFrom: 6) = 'talk'].
self assert: [(str copyFrom: -1) isEmpty].
self assert: [(str copyFrom: 1000) isEmpty].
Arguments:
start - <Integer>
Answers:
<UnicodeString>
Raises:
Fail if start is not an Integer
copyFrom:to:
Answer a copy of a subset of the receiver,
starting from the element at Integer start
until the element at Integer index stop.
Copy-On-Write:
This API uses copy-on-write semantics. This resulting copy
will be sharing storage with the receiver until a subsequent
write by either the receiver or the copy.
Example:
| str copy |
str := 'Smalltalk' asUnicodeString.
self assert: [(str copyFrom: 1 to: str size) = str].
self assert: [(str copyFrom: 2 to: 2) = 'm'].
self assert: [(str copyFrom: 6 to: 9) = 'talk'].
self assert: [(str copyFrom: 1000 to: 3) isEmpty].
self assert: [[str copyFrom: 1000 to: 1005. false] on: Exception do: [:ex | ex exitWith: true]].
self assert: [[str copyFrom: -1 to: 9. false] on: Exception do: [:ex | ex exitWith: true]].
Arguments:
start - <Integer> 1-based index
stop - <Integer> 1-based stop index (inclusive)
Answers:
<UnicodeString>
Raises:
Fail if start is not an Integer.
Fail if stop is not an Integer.
copyReplaceAll:with:
Answer a copy of the receiver in which all sequences of elements
contained within the SequenceableCollection oldSubCollection have been
replaced the elements in the SequenceableCollection newSubCollection.
Copy-On-Write:
This API does not participate in copy-on-write semantics.
Copying with element replacement prevents this.
Example:
self assert: [('Smalltalk Smalltalk' copyReplaceAll: 'll' with: 'zz') = 'Smazztalk Smazztalk'].
self assert: [('Smalltalk' copyReplaceAll: 'll' with: 'aaaaa') = 'Smaaaaaatalk'].
self assert: [('Smalltalk' copyReplaceAll: 'zz' with: 'aa') = 'Smalltalk'].
Arguments:
oldSubCollection - <SequenceableCollection>
newSubCollection - <SequenceableCollection>
Answers:
<UnicodeString> copy
Raises:
Fail if oldSubCollection is not a SequenceableCollection.
Fail if newSubCollection is not a SequenceableCollection.
copyReplaceFrom:to:with:
Answer a SequenceableCollection which is a copy of the receiver
in which the elements between the Integer index start and the
Integer index stop have each been replaced by the elements
of the SequenceableCollection replacementCollection.
This method can be used to perform insertion, replacement and to append.
start and stop are first adjusted to be within the bounds of the
receiver.
If stop > start then the elements at indices are replaced
with replacementCollection.
If stop < start, then this is an insertion. Insertion will occur before
the element at the index represented by
start.
If start > receiver size, this means append after the last
element.
Copy-On-Write:
This API does not participate in copy-on-write semantics.
Copying with element replacement prevents this.
Example:
self assert: [('this' asUnicodeString copyReplaceFrom: 3 to: 4 with: 'e') = 'the'].
self assert: [('this' asUnicodeString copyReplaceFrom: 3 to: 4 with: 'eee') = 'theee'].
self assert: [('Java' asUnicodeString copyReplaceFrom: 1 to: 4 with: 'Smalltalk') = 'Smalltalk']
Arguments:
start - <Integer> 1-based index
stop - <Integer> 1-based index (inclusive)
replacementCollection - <SequenceableCollection>
Answers:
<UnicodeString>
Raises:
Fail if start in not an Integer
Fail if stop is not an Integer.
copyReplaceFrom:to:withObject:
Answer a copy of the receiver in which the elements of the
receiver between start and stop inclusive have been replaced
with replacementElement.
Copy-On-Write:
This API does not participate in copy-on-write semantics.
Copying with element replacement prevents this.
Example:
| str |
str := 'Smalltalk' asUnicodeString.
self assert: [(str copyReplaceFrom: 4 to: 5 withObject: $k asGrapheme) = 'Smakktalk'].
Arguments:
start - <Integer> 1-based index
stop - <Integer> 1-based index (inclusive)
replacementElement - <implementors of #asGrapheme>
Answers:
<UnicodeString>
copyWithout:
Answer a copy of the receiver in which all elements that are
equivalent to oldElement have been omitted.
Copy-On-Write:
This API does not participate in copy-on-write semantics.
Copying without elements would produce a non-contiguous
view.
Example:
self assert: [('Smalltalk' copyWithout: $l) = 'Smatak']
Arguments:
oldElement - <Grapheme> or compatible
Answers:
<UnicodeString>
do:
Iteratively evaluate the one argument block, aBlock using
each element of the receiver, in order.
Example:
| str col |
str := 'Smalltalk' asUnicodeString.
col := (str graphemes contentsInto: Array) asOrderedCollection.
str do: [:grapheme |self assert: [grapheme = col removeFirst]].
self assert: [col isEmpty]
Arguments:
aBlock - <Block> 1-arg block that evaluates each element
Raises:
Fail if aBlock is not a one argument block.
doWithIndex:
Iteratively evaluate the two argument block, aBlock using
each element of the receiver, in order, and the element
index.
Example:
| str col |
str := 'Smalltalk' asUnicodeString.
col := (str graphemes contentsInto: Array) asOrderedCollection.
str doWithIndex: [:grapheme :i |self assert: [grapheme = (col at: i)]].
Arguments:
aBlock - <Block> 2-arg block that evaluates each element and index
Raises:
Fail if aBlock is not a two argument block.
ensureNFC
Ensure the receiver is NFC normalized.
Example:
| str |
str := UnicodeString with: 233 asGrapheme asNFD.
self assert: [str isNFC not].
str ensureNFC.
self assert: [str isNFC].
Answers:
<UnicodeString> self
ensureNFD
Ensure the receiver is NFD normalized.
Example:
| str |
str := UnicodeString with: 233 asGrapheme asNFC.
self assert: [str isNFD not].
str ensureNFD.
self assert: [str isNFD].
Answers:
<UnicodeString> self
ensureNFKC
Ensure the receiver is NFKC normalized.
Example:
| str |
str := UnicodeString with: 233 asGrapheme asNFKD.
self assert: [str isNFKC not].
str ensureNFKC.
self assert: [str isNFKC].
Answers:
<UnicodeString> self
ensureNFKD
Ensure the receiver is NFKD normalized.
Example:
| str |
str := UnicodeString with: 233 asGrapheme asNFKC.
self assert: [str isNFKD not].
str ensureNFKD.
self assert: [str isNFKD].
Answers:
<UnicodeString> self
escaped
Answer a copy of the receiver that has been escaped using the following
rules.
Escaped Strings:
Tab is escaped as `\t`
Carriage return is escaped as `\r`.
Line feed is escaped as `\n`.
Backslash is escaped as '\\'
Any character in the 'printable ASCII' range `16r20` .. `16r7E` inclusive is not escaped.
All other characters are given hexadecimal Unicode escapes `\u{NNNNNN}` where
`NNNNNN` is a hexadecimal uppercase representation
Example:
| str |
str := ('Small\talk' , String crlf) asUnicodeString.
self assert: [str escaped = 'Small\\talk\r\n'].
str add: (16r1F37A asGrapheme).
self assert: [str escaped = 'Small\\talk\r\n\u{1F37A}'].
Answers:
<UnicodeString>
from:to:doWithIndex:
Iteratively evaluate the two argument block, aBlock using
the elements from start to stop in the reciever and the
element index.
Example:
| str str2 |
str := 'Smalltalk' asUnicodeString.
str2 := UnicodeString new.
str from: 6 to: 9 doWithIndex: [:grapheme :i | str2 add: grapheme].
self assert: str2 = 'talk'.
Arguments:
start - <Integer> 1-based index
stop - <Integer> 1-based index (inclusive)
aBlock - <Block> 2-arg block that evaluates each element and index
Raises:
Fail if aBlock is not a two argument block.
graphemes
Answer the grapheme view of the receiver.
A grapheme <Grapheme> is the collection of one or more unicode code points to
create a 'user-perceived character'. This is done using Unicode's boundary algorithms
to combine code points into 'extended grapheme clusters'. Each element in the view
will be a grapheme <Grapheme>.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
'Balloon - U+1F388'.
view := 16r1F388 asUnicodeString graphemes.
self assert: [view size = 1].
self assert: [view first name = 'BALLOON'].
Answers:
<GraphemeView>
includes:
Answer an Integer which is the first index of within the receiver
that is equivalent to the Object anElement.
If the receiver does not contain an element that is equivalent to
anElement, answer 0.
Example:
self assert: ['Smalltalk' asUnicodeString includes: $k asGrapheme].
self assert: ['Smalltalk' asUnicodeString includes: $k].
self assert: [('Smalltalk' asUnicodeString includes: $z) not].
Arguments:
anElement - <Grapheme | Character>
anIndex - <Integer>
Answers:
<Integer> index of element or 0 if not found
includesSubstring:
Answers whether @aString is a substring of the receiver (case sensitive).
Example:
| str |
str := 'Smalltalk' asUnicodeString.
self assert: [str includesSubstring: 'talk'].
self assert: [(str includesSubstring: 'tAlK') not].
Arguments:
aString - <String>
Answers:
<Boolean>
includesSubstring:caseSensitive:
Checks whether @aString is a substring of the receiver.
@caseSensitive determines if the search is case sensitive or not.
Example:
| str |
str := 'Smalltalk' asUnicodeString.
self assert: [str includesSubstring: 'talk' caseSensitive: true].
self assert: [(str includesSubstring: 'tAlK' caseSensitive: true) not].
self assert: [(str includesSubstring: 'zzz' caseSensitive: false) not].
Arguments:
aString - <String>
caseSensitive - <Boolean>
Answers:
<Boolean>
indexOf:matchCase:startingAt:
Answer a collection of indices specifying the subsequence of
the receiver which matches pattern, or an empty collection
if there is no match. Consider case if flag is true. Start
searching at index start.
Example:
| str |
str := 'Smalltalk' asUnicodeString.
self assert: [(str indexOf: 'l' matchCase: false startingAt: 1) = (4 to: 4)].
self assert: [(str indexOf: 'l' matchCase: false startingAt: 5) = (5 to: 5)].
self assert: [(str indexOf: 'l' matchCase: false startingAt: 6) = (8 to: 8)].
self assert: [(str indexOf: 'l*' matchCase: false startingAt: 1) = (4 to: 9)].
self assert: [(str indexOf: 'L*' matchCase: true startingAt: 1) isEmpty].
Arguments:
pattern - <UnicodeString | String>
flag - <Boolean>
start - <Integer>
Answers:
<Array> of <Integer> indices
indexOfSubCollection:startingAt:ifAbsent:
Answer the index of the first element of the receiver which is the start of a subsequence which
matches @targetSequence. Start searching at index @start in the receiver. Answer the result of
evaluating @exceptionHandler with no parameters if no such subsequence is found.
Each subsequence of the receiver starting at index @start is checked for a match with
@targetSequence. To match, each element of a subsequence of the receiver must be equivalent
to the corresponding element of @targetSequence. Answer the index of the first element which
begins a matching subsequence; no further subsequences are considered. Answer the result of
evaluating @exceptionHandler with no parameters if no such subsequence is found or if
@targetSequence is empty.
The elements are traversed in the order specified by the #do: message for the receiver.
Example:
| str |
str := 'Smalltalk' asUnicodeString.
self assert: [(str indexOfSubCollection: 'talk' startingAt: 1 ifAbsent: [0]) = 6].
self assert: [(str indexOfSubCollection: 'talking' startingAt: 1 ifAbsent: [0]) = 0].
Arguments:
targetSequence - <SequenceableCollection>
start - <Integer> 1-based index
exceptionHandler - <Block> 0-arg absent handler
Answers:
<Integer> 1-based index
Raises:
targetSequence is not a <SequenceableCollection>
start < 1 (ANSI)
start > the receiver's size (ANSI)
isAscii
Answers true if the receiver is within the ASCII range, false otherwise
Example:
self assert: ['Smalltalk' asUnicodeString isAscii].
self assert: [('Smalltalk' asUnicodeString add: 233 asGrapheme; yourself) isAscii not]
Answers:
<Boolean>
isEmpty
Answer true if this string is empty, false otherwise.
Example:
self assert: [UnicodeString new isEmpty].
self assert: ['a' asUnicodeString isEmpty not].
Answers:
<Boolean>
isNFC
Answer true if the receiver is NFC normalized, false otherwise.
Example:
self assert: ['Smalltalk' asUnicodeString isNFC].
self assert: [(UnicodeString with: (233 asGrapheme asNFC)) isNFC].
self assert: [(UnicodeString with: (233 asGrapheme asNFD)) isNFC not].
Answers:
<UnicodeString>
isNFD
Answer true if the receiver is NFD normalized, false otherwise.
Example:
self assert: ['Smalltalk' asUnicodeString isNFD].
self assert: [(UnicodeString with: (233 asGrapheme asNFD)) isNFD].
self assert: [(UnicodeString with: (233 asGrapheme asNFC)) isNFD not].
Answers:
<UnicodeString>
isNFKC
Answer true if the receiver is NFKC normalized, false otherwise.
Example:
self assert: ['Smalltalk' asUnicodeString isNFKC].
self assert: [(UnicodeString with: (233 asGrapheme asNFKC)) isNFKC].
self assert: [(UnicodeString with: (233 asGrapheme asNFKD)) isNFKC not].
Answers:
<UnicodeString>
isNFKD
Answer true if the receiver is NFKD normalized, false otherwise.
Example:
self assert: ['Smalltalk' asUnicodeString isNFKD].
self assert: [(UnicodeString with: (233 asGrapheme asNFKD)) isNFKD].
self assert: [(UnicodeString with: (233 asGrapheme asNFKC)) isNFKD not].
Answers:
<UnicodeString>
isSmalltalkIdentifier
Answer true of the receiver is a valid Smalltalk identifier as described in the ANSI Smalltalk Standard; otherwise answer false.
identifier ::= letter (letter | digit)*
letter ::= uppercaseAlphabetic | lowercaseAlphabetic | nonCaseLetter
uppercaseAlphabetic ::= ’A’ | ’B’ | ’C’ | ’D’ | ’E’ | ’F’ | ’G’ | ’H’ | ’I’ | ’J’ | ’K’ | ’L’ | ’M’ | ’N’ | ’O’ | ’P’ | ’Q’ | ’R’ | ’S’| ’T’ | ’U’ | ’V’ | ’W’ | ’X’ | ’Y’ | ’Z’
lowercaseAlphabetic ::= ’a’ | ’b’ | ’c’ | ’d’ | ’e’ | ’f’ | ’g’ | ’h’ | ’I’ | ’j’ | ’k’ | ’l’ | ’m’ | ’n’ | ’o’ | ’p’ | ’q’ | ’r’ | ’s’ | ’t’ | ’u’ | ’v’ | ’w’ | ’x’ | ’y’ | ’z’
nonCaseLetter ::= ’_’
digit ::= ’0’ | ’1’ | ’2’ | ’3’ | ’4’ | ’5’ | ’6’ | ’7’ | ’8’ | ’9’
Answers:
<Boolean>
isString
Answer true if the receiver is a logical String. This
includes UnicodeString, String, DBString, and Symbol.
Answers:
<Boolean>
isUnicodeString
Answer true if the receiver is a unicode string, false otherwise.
Answers:
<Boolean>
join:
Append the elements of the argument, @aCollection, separating them by the receiver.
Examples:
self assert: [('*' asUnicodeString join: #('WWWWW' 'W EW' 'zzzz')) = 'WWWWW*W EW*zzzz']
Arguments:
aCollection - <Collection>
Answers:
<String | Symbol>
lines
Answer an array of lines composing this receiver without the line ending delimiters.
Example:
| str |
str := ('Hello' , String cr , 'fellow' , String lf , 'Smalltalker' , String crlf , 's') asUnicodeString.
self assert: [str lines = { 'Hello'. 'fellow'. 'Smalltalker'. 's' }].
Answers:
<Array> of lines
linesDo:
Execute aBlock with each line in this string.
The terminating line delimiters CR, LF or CRLF pairs
are not included in what is passed to aBlock
Example:
| str lines |
str := ('Hello' , String cr , 'fellow' , String lf , 'Smalltalker' , String crlf , 's') asUnicodeString.
lines := OrderedCollection new.
str linesDo: [:line | lines add: line].
self assert: [lines asArray = { 'Hello'. 'fellow'. 'Smalltalker'. 's' }].
Arguments:
aBlock - <Block> 1-arg block that is evaluated for each line
Raises:
Fail if aBlock is not a one argument block.
match:
Answer whether the receiver (which may contain wildcard characters) matches the
<readableString> argument, ignoring case differences.
Note that the pattern matching characters are *, matching any sequence of characters
and # matching any single character. The latter differs from the usual ? for historical
reasons.
Example:
self assert: ['#malltalk' match: 'Smalltalk'].
self assert: ['sm*al*' match: 'Smalltalk'].
Arguments:
aString - <EsString> string to apply pattern match against
Answers:
<Boolean> true if this pattern matches aString, false otherwise
match:ignoreCase:
Answer whether the receiver (which may contain wildcard characters) matches the
<readableString> argument, aString, ignoring or respecting case differences depending
on the <Boolean> argument, aBoolean.
Note that the pattern matching characters are *, matching any sequence of characters
and # matching any single character. The latter differs from the usual ? for historical
reasons.
Example:
self assert: ['#Malltalk' match: 'Smalltalk' ignoreCase: true].
self assert: [('#Malltalk' match: 'Smalltalk' ignoreCase: false) not].
Arguments:
aString - <EsString> string to apply pattern match against
aBoolean - <Boolean> ignoreCase option
Answers:
<Boolean> true if this pattern matches aString, false otherwise
matchPatternFrom:in:from:ignoreCase:
Answer whether the receiver matches aString (starting
at patternStart in the receiver and sourceStart in source).
The receiver may contain wildcards. Any differences in case between individual
characters are ignored if the <boolean> argument is true.
Example:
self assert: ['#Malltalk' matchPatternFrom: 1 in: 'Smalltalk' from: 1 ignoreCase: true].
self assert: [('#Malltalk' matchPatternFrom: 1 in: 'Smalltalk' from: 1 ignoreCase: false) not].
Arguments:
patternStart - <Integer> location to start pattern
aString - <EsString> string to apply pattern match against
sourceStart - <Integer> location to start matching from in source
aBoolean - <Boolean> ignoreCase option
Answers:
<Boolean> true if this pattern matches aString, false otherwise
names
Answer the names of all the unicode scalars in the receiver
as an array.
Example:
self assert: ['a' asUnicodeString names = { 'LATIN SMALL LETTER A' }].
self assert: [233 asUnicodeString names = { 'LATIN SMALL LETTER E WITH ACUTE' }].
self assert: [233 asUnicodeString asNFD names = { 'LATIN SMALL LETTER E'. 'COMBINING ACUTE ACCENT' }]
Answers:
<Array>
nullTerminated
Answer a new string containing a zero at the end.
In order to remain backward compatible with String,
we will actually write an explicit Null (0) at the end
even though increasing the capacity (usable length) of
the UnicodeString by 1 is always good enough.
NOTE: For FFI, it is highly suggested to use asPSZ if you
can...which will just use the current UnicodeString and
the storage *may* be increased by 1 to accomodate a
NULL.
Example:
| str |
str := 'a' asUnicodeString.
self assert: [str utf8 asByteArray = #[97]].
self assert: [str nullTerminated utf8 asByteArray = #[97 0]]
Answers:
<UnicodeString>
occurrencesOf:
Answer an Integer indicating how many of the receiver's elements
are equivalent to anObject.
Example:
self assert: [('Smalltalk' asUnicodeString occurrencesOf: $l) = 3].
self assert: [('Smalltalk' asUnicodeString occurrencesOf: $z) = 0].
Arguments:
anObject - <Grapheme | Character>
Answers:
<Integer> number of occurrences
ownsStorage
Answer true if the backing storage is exclusivley owned
by this unicode string and has never been shared, false otherwise.
Copy-On-Write:
A feature of copy-on-write is that the backing storage can be shared
with other unicode strings for storage savings.
Example:
| str copy |
str := 'Smalltalk' asUnicodeString.
self assert: [str ownsStorage].
copy := str copy.
self assert: [str ownsStorage not & copy ownsStorage not].
'Trigger Copy-On-Write Barrier of copy'.
copy at: 6 put: $w asGrapheme.
'This creates unique storage for copy'.
self assert: [copy ownsStorage & copy = 'Smallwalk'].
'Original str is unaware that last sharer went away'.
self assert: [str sharesStorage & str = 'Smalltalk'].
Answers:
<Boolean>
reject:
Answer a Collection that is created by iteratively evaluating the
one argument block, aBlock for an element of the receiver,
and adding the result to the new collection
only if aBlock evaluates to the Boolean false.
Example:
| str |
str := '!Smalltalk!' asUnicodeString reject: [:g | g isAsciiPunctuation].
self assert: [str = 'Smalltalk']
Arguments:
aBlock - <Block>
Answers:
<UnicodeString>
Raises:
Fail if aBlock is not a one-argument Block.
Fail if the result of aBlock is not a Boolean.
replaceFrom:to:with:startingAt:
Replace the elements of the receiver between Integer indices
@start and @stop with the elements of the argument @repString
starting at Integer index @repStart. Answer self.
Example:
| str |
str := 'aaaaaaaaaa' asUnicodeString.
str replaceFrom: 1 to: 1 with: 'b' startingAt: 1.
self assert: [str = 'baaaaaaaaa'].
str replaceFrom: 2 to: 3 with: 'dcc' startingAt: 2.
self assert: [str = 'bccaaaaaaa'].
str replaceFrom: 4 to: 10 with: 'ddddddd' startingAt: 1.
self assert: [str = 'bccddddddd']
Arguments:
start - <Integer> 1-based index
stop - <Integer> 1-based index
repString - <SequenceableCollection>
repStart - <Integer> 1-based index
Answers:
<UnicodeString> self
Raises:
Fail if @start is not an Integer.
Fail if @stop is not an Integer.
Fail if @repStart is not an Integer.
Fail if @start is < 1.
Fail if @start is > size of the receiver.
Fail if stop is < 1.
Fail if repStart < 1.
Fail if repStart > size of repString.
Fail if repString is not a String or UnicodeString.
reservedStorage
Answer the amount of reserved space (reported in ascii character units) remaining in the backing storage.
Example:
| str |
str := UnicodeString new: 0.
self assert: [str reservedStorage = 0].
str reservedStorage: 100.
self assert: [str reservedStorage = 100]
Answers:
<Integer> number of bytes
reservedStorage:
Reserve enough capacity in the backing storage to store @reservedCapacity worth of ASCII characters.
This is an optimization method that can be used when the caller knows the string will be growing rapidly.
Sizing this appropriately can prevent unnecessary storage growth.
Example:
| str |
str := UnicodeString new: 0.
self assert: [str reservedStorage = 0].
self assert: [(str reservedStorage: 1000) = 1000].
self assert: [(str reservedStorage: 2) = 1000]
Arguments:
reservedCapacity - <Integer>
Answers:
<Integer> The actual amount of reserved capacity available
reverse
Answer a object conforming to the same protocols as the receiver, but with its
elements arranged in reverse order and Unicode correct.
Example:
| str |
'a, GRINNING FACE, c'.
str := UnicodeString new.
str add: $a; add: 16r1F600 asGrapheme; add: $c.
self assert: [(str reverse graphemes contentsInto: Array) = { $c asGrapheme. 16r1F600 asGrapheme. $a asGrapheme }]
Answers:
<UnicodeString> reversed
reverseDo:
Iteratively evaluate the one argument block, aBlock using each
element of the receiver, in reverse order and Unicode correct.
Example:
| str col |
str := 'Smalltalk' asUnicodeString.
col := (str graphemes contentsInto: Array) asOrderedCollection.
str reverseDo: [:g | self assert: [g = col removeLast]].
self assert: col isEmpty
Arguments:
aBlock - <Block> 1-arg block that evaluates each element
Raises:
Fail if aBlock is not a one argument block.
sameAs:
Answer a Boolean which is true if the receiver collates precisely
with aString and false otherwise. The collation sequence ignores the
case of the Characters.
Example:
| str |
self assert: ['a' asUnicodeString sameAs: 'a' asUnicodeString].
self assert: ['a' asUnicodeString sameAs: 'A' asUnicodeString].
'Comparing e with accent in NFC and NFD forms is still the same'.
self assert: [16rE9 asUnicodeString sameAs: (UnicodeString escaped: 'e\u{301}')]
Arguments:
aString - <UnicodeString | String>
Answers:
<Boolean>
Raises:
Fail if aString is not a String.
select:
Answer a Collection that is created by iteratively evaluating the
one argument block, aBlock for an element of the receiver,
and adding the result to the new collection
only if aBlock evaluates to the Boolean false.
Example:
| str |
str := '!Smalltalk!' asUnicodeString select: [:g | g isAsciiPunctuation].
self assert: [str = '!!']
Arguments:
aBlock - <Block>
Answers:
<UnicodeString>
Raises:
Fail if aBlock is not a one-argument Block.
Fail if the result of aBlock is not a Boolean.
sharesStorage
Answer true if the backing storage of this unicode string is (or was)
being shared with other UnicodeString objects, false otherwise.
Copy-On-Write:
A feature of copy-on-write is that the backing storage can be shared
with other unicode strings for storage savings.
Example:
| str copy |
str := 'Smalltalk' asUnicodeString.
self assert: [str sharesStorage not].
copy := str copy.
self assert: [str sharesStorage & copy sharesStorage].
'Trigger Copy-On-Write Barrier of copy'.
copy at: 6 put: $w asGrapheme.
'This creates unique storage for copy'.
self assert: [copy sharesStorage not & copy = 'Smallwalk'].
'Original str is unaware that last sharer went away'.
self assert: [str sharesStorage & str = 'Smalltalk'].
Answers:
<Boolean>
shrinkStorage
Ensure there is no extra capacity (room to grow) in the backing storage.
This is an optimization method that can be used when the caller wants to conserve memory.
Ideally, this should be a candidate for usage when it is known the receiver will not have any
further modifications that would impact the size of the string.
Example:
| str |
str := UnicodeString new: 1000.
self assert: [str reservedStorage = 1000].
str shrinkStorage.
self assert: [str reservedStorage = 0]
Answers:
<UnicodeString> receiver
size
The size of the unicode string is expressed in <Graphemes>s.
Performance Note:
While this can be an O(1) operation, it is not guaranteed unless the contents
of the receiver are single-scalar ascii (i.e. Ascii - No CrLf).
If ascii with crlf or non-ascii, then the grapheme breaking rules defined in Unicode Standard Annex #29
are used making this an O(n) operation
Example:
'O(1) because its single-scalar ascii'.
self assert: ['Smalltalk' asUnicodeString size = 9].
'O(n) because its multi-scalar ascii (due to CrLf).'.
'Notice the size is 10, not 11. CrLf is a single Grapheme'.
self assert: [(('Small' , String crlf , 'talk') asUnicodeString size) = 10].
'O(n) because it contains non-ascii characters in NFC normalized form'.
self assert: [(('Small' , 233 asUnicodeString , 'talk') asUnicodeString ensureNFC size) = 10].
self assert: [(('Small' , 233 asUnicodeString , 'talk') asUnicodeString ensureNFC unicodeScalars size) = 10].
'O(n) because it contains non-ascii characters in NFD normalized form'.
'The NFC and NFD counts for unicode scalars is different for the same string. But notice that Graphemes are consistent'.
self assert: [(('Small' , 233 asUnicodeString , 'talk') asUnicodeString ensureNFD size) = 10].
self assert: [(('Small' , 233 asUnicodeString , 'talk') asUnicodeString ensureNFD unicodeScalars size) = 11].
Answers:
<Integer>
subStrings
Answer an array which contains the words in the receiver.
Examples:
self assert: [('A multiline ' , String crlf , 'string') asUnicodeString subStrings = { 'A'. 'multiline'. 'string' }]
Answers:
<Array>
subStrings:
Answer an array containing the substrings in the receiver
separated by the elements of @separators.
Answer an array of strings. Each element represents a group
of characters separated by any of the characters in @separators.
Implementation Notes:
The CLDT protocol says @separators is a single Character while
the ANSI protocol says it is a collection of Characters. This
implementation supports both protocols.
Consecutive separators are treated as a single separation point.
Leading or trailing separators are ignored.
Examples:
self assert: [(('A multiline ' , String crlf , 'string') asUnicodeString subStrings: Grapheme crlf) = { 'A multiline '. 'string' }]
Arguments:
separators - <SequenceableCollection>
Answers:
<Array>
Raises:
If @separators contains anything other than Characters.
trimBlanks
Answer a Unicode string containing the receiver string with leading and
trailing space characters removed.
Examples:
self assert: [' Smalltalk ' asUnicodeString trimBlanks = 'Smalltalk']
Answers:
<UnicodeString>
trimNull
Answer an instance of the receiver that is not
terminated by an ASCII Null. If the receiver is not
null terminated, do not create a new instance.
Example:
self assert: [UnicodeString new trimNull isEmpty].
self assert: [UnicodeString new nullTerminated trimNull isEmpty].
self assert: ['a' asUnicodeString nullTerminated trimNull = 'a']
Answers:
<UnicodeString>
trimSeparators
Answer a String containing the receiver string with leading and
trailing separators removed.
Examples:
| str |
str := UnicodeString new.
str
add: Grapheme space;
add: Grapheme cr;
add: Grapheme lf;
addAll: 'Smalltalk';
add: Grapheme space;
add: Grapheme space;
add: Grapheme crlf.
self assert: [str trimSeparators = 'Smalltalk']
Answers:
<UnicodeString>
unicodeScalars
Answer the unicode scalar view of the receiver.
A unicode scalar <UnicodeScalar> represents a 'unicode scalar value', which is similar to,
but not the same as, a 'unicode code point' as it will never represent high/low-surrogate
code points reserved for UTF-16 encoding.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
view := 'Hello' asUnicodeString unicodeScalars.
self assert: [view size = 'Hello' size].
self assert: [view contents = ('Hello' asArray collect: [:c | c asUnicodeScalar])].
self assert: [view asByteArray = 'Hello' asByteArray].
self assert: [view next = $H asUnicodeScalar]
Answers:
<UnicodeScalarView>
unshareStorage
If the backing storage of the receiver is marked as shared, then
invoke the copy-on-write barrier to make a unique copy of the
backing storage to ensure storage ownership. This is effectivley
a no-op if the receiver already owns the backing storage
Example:
| str str2 |
str := 'Smalltalk' asUnicodeString.
self assert: [str ownsStorage].
str2 := str copy.
self assert: [str sharesStorage & str2 sharesStorage].
str unshareStorage.
self assert: [str sharesStorage not & str2 sharesStorage].
str2 unshareStorage.
self assert: [str sharesStorage not & str2 sharesStorage not]
utf16
Answer the utf16 platfrom-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeString utf16.
self assert: [view size = 2].
self assert: [view next = 55356].
self assert: [view next = 57269].
self assert: [view atEnd]
Answers:
<Utf16View>
utf16BE
Answer the utf16 big-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeString utf16BE.
self assert: [view size = 2].
self assert: [view next = 15576].
self assert: [view next = 46559].
self assert: [view atEnd]
Answers:
<Utf16BigEndianView>
utf16LE
Answer the utf16 little-endian view of the receiver.
Each element in this view is a UTF-16 code unit. UTF-16 is an 16-bit
encoded form of unicode scalar values.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeString utf16LE.
self assert: [view size = 2].
self assert: [view next = 55356].
self assert: [view next = 57269].
self assert: [view atEnd]
Answers:
<Utf16LittleEndianView>
utf32
Answer the utf32 platform-endian view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeString utf32.
self assert: [view size = 1].
self assert: [view next = 16r1F3B5].
self assert: [view atEnd]
Answers:
<Utf32View>
utf32BE
Answer the utf32 big-endian view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeString utf32BE.
self assert: [view size = 1].
self assert: [view next = 3052601600].
self assert: [view asByteArray = #[0 1 243 181]].
self assert: [view atEnd]
Answers:
<Utf32View>
utf32LE
Answer the utf32 little-endian view of the receiver.
Each element in this view is a UTF-32 code unit. UTF-32 is an 32-bit
encoded form of unicode scalar values.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
'MUSICAL NOTE - U+1F3B5'.
view := 16r1F3B5 asUnicodeString utf32LE.
self assert: [view size = 1].
self assert: [view next = 16r1F3B5].
self assert: [view atEnd]
Answers:
<Utf32View>
utf8
Answer the utf8 view of the receiver.
Each element in this view is a UTF-8 code unit. UTF-8 is an 8-bit
encoded form of unicode scalar values.
Copy-On-Write:
This view is copy-on-write enabled. A consistent view of the elements
that existed at the time of creation is provided, even if an outside mutator is making
modifications to this UnicodeString.
Example:
| view |
'LATIN SMALL LETTER E WITH ACUTE'.
view := 233 asUnicodeString utf8.
self assert: [view size = 2].
self assert: [view next = 195].
self assert: [view next = 169].
self assert: [view atEnd]
Answers:
<Utf8View>