=encoding utf8 =head1 TITLE DRAFT: Synopsis 32: Setting Library - Str =head1 AUTHORS Rod Adams Larry Wall Aaron Sherman Mark Stosberg Carl Mäsak Moritz Lenz Tim Nelson =head1 VERSION Created: 19 Mar 2009 extracted from S29-functions.pod Last Modified: 17 Apr 2009 Version: 3 The document is a draft. If you read the HTML version, it is generated from the Pod in the pugs repository under /docs/Perl6/Spec/S32-setting-library/Str.pod so edit it there in the SVN repository if you would like to make changes. =head1 Str General notes about strings: A Str can exist at several Unicode levels at once. Which level you interact with typically depends on what your current lexical context has declared the "working Unicode level to be". Default is C. [Default can't be C because we don't go into "language" mode unless there's a specific language declaration saying either exactly what language we're going into or, in the absence of that, how to find the exact language somewhere in the environment.] Attempting to use a string at a level higher it can support is handled without warning. The current highest supported level of the string is simply mapped Char for Char to the new higher level. However, attempting to stuff something of a higher level a lower-level string is an error (for example, attempting to store Kanji in a Byte string). An explicit conversion function must be used to tell it how you want it encoded. Attempting to use a string at a level lower than what it supports is not allowed. If a function takes a C and returns a C, the returned C will support the same levels as the input, unless specified otherwise. The following are all provided by the C role: =over =item p5chop our Char multi method p5chop ( Str $string is rw: ) is export(:P5) my Char multi p5chop ( Str *@strings is rw ) is export(:P5) Trims the last character from C<$string>, and returns it. Called with a list, it chops each item in turn, and returns the last character chopped. =item chop our Str multi method chop ( Str $string: ) is export Returns string with one Char removed from the end. =item p5chomp our Int multi method p5chomp ( Str $string is rw: ) is export(:P5) my Int multi p5chomp ( Str *@strings is rw ) is export(:P5) Related to C, only removes trailing chars that match C. In either case, it returns the number of chars removed. =item chomp our Str multi method chomp ( Str $string: ) is export Returns string with one newline removed from the end. An arbitrary terminator can be removed if the input filehandle has marked the string for where the "newline" begins. (Presumably this is stored as a property of the string.) Otherwise a standard newline is removed. Note: Most users should just let their I/O handles autochomp instead. (Autochomping is the default.) =item lc our Str multi method lc ( Str $string: ) is export Returns the input string after converting each character to its lowercase form, if uppercase. =item lcfirst our Str multi method lcfirst ( Str $string: ) is export Like C, but only affects the first character. =item uc our Str multi method uc ( Str $string: ) is export Returns the input string after converting each character to its uppercase form, if lowercase. This is not a Unicode "titlecase" operation, but a full "uppercase". =item ucfirst our Str multi method ucfirst ( Str $string: ) is export Performs a Unicode "titlecase" operation on the first character of the string. =item normalize our Str multi method normalize ( Str $string: Bool :$canonical = Bool::True, Bool :$recompose = Bool::False ) is export Performs a Unicode "normalization" operation on the string. This involves decomposing the string into its most basic combining elements, and potentially re-composing it. Full detail on the process of decomposing and re-composing strings in a normalized form is covered in the Unicode specification Sections 3.7, Decomposition and 3.11, Canonical Ordering Behavior of the Unicode Standard, 4.0. Additional named parameters are reserved for future Unicode expansion. For everyday use there are aliases that map to the I document's names for the various modes of normalization: our Str multi method nfd ( Str $string: ) is export { $string.normalize(:canonical, :!recompose); } our Str multi method nfc ( Str $string: ) is export { $string.normalize(:canonical, :recompose); } our Str multi method nfkd ( Str $string: ) is export { $string.normalize(:!canonical, :!recompose); } our Str multi method nfkc ( Str $string: ) is export { $string.normalize(:!canonical, :recompose); } Decomposing a string can be used to compare Unicode strings in a binary form, providing that they use the same encoding. Without decomposing first, two Unicode strings may contain the same text, but not the same byte-for-byte data, even in the same encoding. The decomposition of a string is performed according to tables in the Unicode standard, and should be compatible with decompositions performed by any system. The C<:canonical> flag controls the use of "compatibility decompositions". For example, in canonical mode, "fi" is left unaffected because it is not a composition. However, in compatibility mode, it will be replaced with "fi". Decomposed sequences will be ordered in a canonical way in either mode. The C<:recompose> flag controls the re-composition of decomposed forms. That is, a combining sequence will be re-composed into the canonical composite where possible. These de-compositions and re-compositions are performed recursively, until there is no further work to be done. Note that this function is really only applicable when dealing with codepoint strings. Grapheme strings are normally processed at a higher abstraction level that is independent of normalization, and are lazily normalized into the desired normalization when transferred to lexical scopes or handles that care. =item samecase our Str multi method samecase ( Str $string: Str $pattern ) is export Has the effect of making the case of the string match the case pattern in C<$pattern>. (Used by s:ii/// internally, see L.) =item samemark our Str multi method samemark ( Str $string: Str $pattern ) is export Has the effect of making the case of the string match the marking pattern in C<$pattern>. (Used by s:mm/// internally, see L.) =item capitalize our Str multi method capitalize ( Str $string: ) is export Has the effect of first doing an C on the entire string, then performing a C on it. =item length This word is banned in Perl 6. You must specify units. =item chars our Int multi method chars ( Str $string: ) is export Returns the number of characters in the string in the current (lexically scoped) idea of what a normal character is, usually graphemes. =item graphs our Int multi method codes ( Str $string: ) is export Returns the number of graphemes in the string in a language-independent way. =item codes our Int multi method codes ( Str $string: $nf = $?NF) is export Returns the number of codepoints in the string if it were canonicalized the specified way. Do not confuse codepoints with UTF-16 encoding. Characters above U+FFFF count as a single codepoint. =item bytes our Int multi method bytes ( Str $string: $enc = $?ENC, :$nf = $?NF) is export Returns the number of bytes in the string if it were encoded in the specified way. Note the inequality: .bytes("UTF-16","C") >= .codes("C") * 2 This is caused by the possibility of surrogate pairs, which are counted as one codepoint. However, this problem does not arise for UTF-32: .bytes("UTF-32","C") == .codes("C") * 4 =item encode our Buf multi method encode($encoding = $?ENC, $nf = $?NF) Returns a C which represents the original string in the given encoding and normal form. The actual return type is as specific as possible, so C<$str.encode('UTF-8')> returns an C object, C<$str.encode('ISO-8859-1')> a C. =item index our StrPos multi method index( Str $string: Str $substring, StrPos $pos = StrPos(0) ) is export our StrPos multi method index( Str $string: Str $substring, Int $pos ) is export C searches for the first occurrence of C<$substring> in C<$string>, starting at C<$pos>. If $pos is an C, it is taken to be in the units of the calling scope, which defaults to "graphemes". The value returned is always a C object. If the substring is found, then the C represents the position of the first character of the substring. If the substring is not found, a bare C containing no position is returned. This prototype C evaluates to false because it's really a kind of undefined value. Do not evaluate as a number, because instead of returning -1 it will return 0 and issue a warning. =item pack our buf8 multi pack( *@items where { all(@items) ~~ Pair } ) our buf8 multi pack( Str $template, *@items ) C takes a list of pairs and formats the values according to the specification of the keys. Alternately, it takes a string C<$template> and formats the rest of its arguments according to the specifications in the template string. The result is a sequence of bytes. Templates are strings of the form: grammar Str::PackTemplate { regex TOP { ^