# Copyright (C) 2003-2007, The Perl Foundation. =pod =head0 Grammars and Regexes TODO: This chapter is outdated in some ways * It should be explained when we use "rule" and when "regex", and what a "subrule" is. * The interpolation rules are outdated * some of the assertion syntax has changed, for example means something different now * Modifiers: explain :ratchet modifier * The match object needs more explanation Z X X X Perl 6 "regular expressions" are so far beyond the formal definition of regular expressions that we don't use that name anymore, but simply stick to the abbreviation I.N Perl 6 regexes bring the full power of recursive descent parsingN> to the core of Perl, but are comfortably useful even if you don't know anything about recursive descent parsing. In the usual case, all you'll ever need to know is that regexes are X patterns for matching text. =head1 Using Regexes Z X Regexes are a language within a language, with their own syntax and conventions. At the highest level, though, they're just another set of Perl constructs. So the first thing to learn about regexes is the Perl "glue" code for creating and using them. =head2 Immediate Matches Z The simplest way to create and use a regex is an immediate match. A regex defined with the CX operator always immediately matches. Substitutions, defined with the C X operator also immediately match. A regex defined with the CX operator immediately matches when it's in void, boolean, string, or numeric context, or the argument of the smart-match operator (C<~~>). X<~ (tilde);~~ (smart match) operator> if $string ~~ m/ \w+ / {...} if $string ~~ s/ \w+ /word/ {...} if $string ~~ / \w+ / {...} You can substitute other delimiters, like C<#...#>, C<[...]>, and C<{...}> for the standard C, though C and C<(...)> are not valid delimiters: $string ~~ s/\w+/word/ $string ~~ s[\w+][word] # The same $string ~~ s{\w+}{word} # The same $string ~~ s#\w+#word# # The same $string ~~ s(\w+)(word) # Wrong! $string ~~ s?\w?word} # Wrong! Modifiers now come in front using I syntax, so to do multiple substitutions on the same string is: $string ~~ s:g/\w+/word/ Also, if you use brackets on the first part of a substitution, the second part can be specified as a pseudoassignment: $string ~~ s[\w+] = 'word'; This form also allows assignment operators, so if you want to add one to all the number within a string, you can say: $string ~~ s:g[\d+] += 1; If you want to do some processing on the match, you can call a function to prepare the replacement text too: $string ~~ s:g[\d+] = build_replacement() =head2 Deferred Matches Z Sometimes you want a little more flexibility than an immediate match. The CX operator defines an anonymous regex that can be executed later. $digits = rx/\d+/; The simple C operator also defines an anonymous regex in all contexts other than void, boolean, string, or numeric, or as an argument of C<~~>. $digits = /\d+/; # store regex You can use the unary context forcing operators, C<+>, C, and C<~>, to force the C operator to match immediately in a context where it ordinarily wouldn't. For a boolean value of success or failure, force boolean context with C. For a count of matches, force numeric context with C<+//>. For the matched string value, force string context with C<~//>. $truth = ?/\d+/; # match $_ and return success $count = +/(\d+\s+)*/; # match $_ and return count $string = ~/^\w+/; # match $_ and return string Another option for deferred matches is a C block. The C keyword defines a named or anonymous regex, in much the same way that C declares a subroutine or C declares a method. But the code within the block of a C is regex syntax, not Perl syntax. $digits = regex {\d+}; regex digits {\d+} There are two more keywords that defines regexes similarly to C, which imply slightly different behavior. C introduces a regex that does not backtrack,N modifier> (more details on that below; for now it's enough to know that it matches simple regexes faster), and C is the same as C except that whitespaces in regexes also match optional whitespaces in the string.N and the C<:sigspace> modifer>. To match a named or anonymous regex, call it as a subregex within another regex. Subregexes, whether they're named regexes or a variable containing an anonymous regex, are enclosed in assertion delimiters C<< <...> >>. You can read more about assertions in A "Assertions" later in this chapter. $string ~~ /\d+/; # same as $string ~~ /<$digits>/; $string ~~ /EdigitsE/; ATable 7-1 summarizes the basic Perl syntax for defining rules. =begin table picture Rules Z =headrow =row =cell Syntax =cell Meaning =bodyrows =row =cell C =cell Match a pattern (immediate execution). =row =cell C =cell Perform a substitution (immediate execution). =row =cell C =cell Define an anonymous rule (deferred execution). =row =cell C =cell Immediately match or define an anonymous regex, depending on the context. =row =cell C =cell Define an anonymous regex. =row =cell C =cell Define a named regex. =end table =head2 Grammars Z X A grammar is a collection of regexes, in much the same way that a class is a collection of methods. In fact, grammars are classes, they're just classes that inherit from the universal base class CX. This means that grammars can inherit from other grammars, and that they define a namespace for their regexes. grammar Hitchhikers { token name { Zaphod | Ford | Arthur } token id { \d+ } ... } Any regex in the current grammar or in one of its parents can be called directly, but a regex from an external grammar needs to have its package specified: if $newsrelease ~~ / EHitchhikers.nameE / { send_alert($1); } If you want to match against the entire grammar, you can define a regex C in that grammar. grammar Hitchhikers { regex TOP { } ... } $roster ~~ Hitchhikers; # Calls Hitchhikers.TOP by default X X Grammars are especially useful for complex text or data parsing. In fact, overloading grammar rules for the Perl 6 grammar itself is a method to change the way the program is parsed. Instead of having to create custom complex source filters like was necessary in Perl 5, we can overload the rules in the C Grammar class to change the very syntax of Perl 6 on the fly. =head1 Building Blocks Z Every language has a set of basic components (words or parts of words) and a set of syntax rules for combining them. The "words" in regexes are literal characters (or symbols), some X metacharacters (or metasymbols), and XX escape sequences, while the combining syntax includes other metacharacters, X X quantifiers, bracketing characters, and assertions. =head2 Metacharacters Z The "word"-like metacharacters are C<.>, C<^>, C<^^>, C<$>, C<$$>. The C<.> matches any single character, even a newline character. Actually, Perl 6 has a the notion of a Unicode level, which determines if string manipulation happens on the byte, codepoint or grapheme level. C<.> matches a character in the current level, which defaults to grapheme. The Unicode level can be adjusted with a pragma or with modifiers. We'll talk more about modifiers in A"Modifiers" later in this chapter. The C<^> and C<$> metacharacters are zero-width matches on the beginning and end of a string. They each have doubled alternates C<^^> and C<$$> that match at the beginning and end of every line within a string. The C<|>, C<&>, C<\>, C<#>, and C<:=> metacharacters are all syntax structure elements. The C<|> is an alternation between two options. The C<&> matches two patterns simultaneously (the patterns must be the same length). The C<\> turns literal characters into metacharacters (the escape sequences) or turns metacharacters into literal characters. The C<#> marks a comment to the end of the line. Whitespace insensitivity (the old C modifier) is on by default, so you can start a comment at any point on any line in a regex. Just make sure you don't comment out the symbol that terminates the regex. The C<:=> X<: (colon);:= (binding);in regex> binds a hypothetical variable to the result of a subregex or grouped pattern. Hypotheticals are covered in A"Hypothetical Variables" later in this chapter. The metacharacters C<()>, C<[]>, C<{}> and CE> are bracketing pairs. The pairs always have to be balanced within the regex, unless they are literal characters (escaped with a C<\>). The brackets C<()> and C<[]> group patterns to match as a single atom. They're often used to capture a result, mark the boundaries of an alternation, or mark a group of patterns with a quantifier, among other things. Parentheses C<()> are capturing and square brackets C<[]> are non-capturing. The C<{}> brackets define a section of Perl code (a closure) within a regex. These closures are always a successful zero-width match, unless the code explicitly calls the C function. The C...E> brackets mark assertions, which handle a variety of constructs including character classes and user-defined quantifiers. Assertions are covered in A"Assertions" later in this chapter. ATable 7-2 summarizes the basic set of metacharacters. =begin table picture Metacharacters Z =headrow =row =cell Symbol =cell Meaning =bodyrows =row =cell C<.> =cell Match any single character, including a newline. X<. (dot);. match single character (regexes)> =row =cell C<^> =cell Match the beginning of a string. X<^ (caret);^ beginning of string (regexes)> =row =cell C<$> =cell Match the end of a string. X<$ (dollar sign);$ end of string (regexes)> =row =cell C<^^> =cell Match the beginning of a line. X<^ (caret);^^ beginning of line (regexes)> =row =cell C<$$> =cell Match the end of a line. X<$ (dollar sign);$$ end of line (regexes)> =row =cell C<|> =cell Match alternate patterns (OR). =row =cell C<&> =cell Match multiple patterns (AND). =row =cell C<\> =cell Escape a metacharacter to get a literal character, or escape a literal character to get a metacharacter. X<\ (backslash);\ escape sequences (regexes)> X<\ (backslash);\ to escape metacharacters (regexes)> =row =cell C<#> =cell Mark a comment (to the end of the line). =row =cell C<:=> =cell Bind the result of a match to a hypothetical variable. X<: (colon);:= (binding);in regexes> =row =cell C<(...)> =cell Group patterns and capture the result. =row =cell C<[...]> =cell Group patterns without capturing. =row =cell C<{...}> =cell Execute a closure (Perl 6 code) within a regex. =row =cell C...E> =cell Match an assertion. =row =cell C<< <( >> =cell Begin of capturing =row =cell C<< )> >> =cell End of capturring =end table =head2 Escape Sequences Z X X X<\ (backslash);\ escape sequences (regexes)> The escape sequences are literal characters acting as metacharacters, marked with the C<\> escape. Some escape sequences represent single characters that are difficult to represent literally, like C<\t> for tab, or C<\x[...]> for a character specified by a hexadecimal number. Some represent limited character classes, like C<\d> for digits or C<\w> for word characters. Some represent zero-width positions in a match, like C<\b> for a word boundary. With all the escape sequences that use brackets, C<()>, C<{}>, and CE> work in place of C<[]>. X X Note that since an ordinary variable now interpolates as a literal string by default, the C<\Q> escape is rarely needed. An interpolated array is interpreted as an alternation of all array elements. ATable 7-3 shows the escape sequences for regexes. =begin table picture Escape sequences Z =headrow =row =cell Escape =cell Meaning =bodyrows =row =cell C<'...'> =cell Tread everyhing between the quotes literally, except the backslash C<\> and single quotes C<'> =row =cell C<"..."> =cell Like C<'...'>, but backslash escape sequences and variable interpolation are enabled =row =cell C<\0[...]> =cell Match a character given in octal (brackets optional). =row =cell C<\b> =cell Match a word boundary. =row =cell C<\B> =cell Match when not on a word boundary. =row =cell C<\c[...]> =cell Match a named character or control character. =row =cell C<\C[...]> =cell Match any character except the bracketed named or control character. =row =cell C<\d> =cell Match a digit. =row =cell C<\D> =cell Match a non-digit. =row =cell C<\e> =cell Match an escape character. =row =cell C<\E> =cell Match anything but an escape character. =row =cell C<\f> =cell Match the form feed character. =row =cell C<\F> =cell Match anything but a form feed. =row =cell C<\n> =cell Match a (logical) newline. =row =cell C<\N> =cell Match anything but a (logical) newline. =row =cell C<\h> =cell Match horizontal whitespace. =row =cell C<\H> =cell Match anything but horizontal whitespace. =row =cell C<\L[...]> =cell Everything within the brackets is lowercase. =row =cell C<\Q[...]> =cell All metacharacters within the brackets match as literal characters. =row =cell C<\r> =cell Match a return. =row =cell C<\R> =cell Match anything but a return. =row =cell C<\s> =cell Match any whitespace character. =row =cell C<\S> =cell Match anything but whitespace. =row =cell C<\t> =cell Match a tab. =row =cell C<\T> =cell Match anything but a tab. =row =cell C<\U[...]> =cell Everything within the brackets is uppercase. =row =cell C<\v> =cell Match vertical whitespace. =row =cell C<\V> =cell Match anything but vertical whitespace. =row =cell C<\w> =cell Match a word character (Unicode alphanumeric plus "_"). =row =cell C<\W> =cell Match anything but a word character. =row =cell C<\x[...]> =cell Match a character given in hexadecimal (brackets optional). =row =cell C<\X[...]> =cell Match anything but the character given in hexadecimal (brackets optional). =end table =head2 Quantifiers Z Quantifiers specify the number of times an atom (a single character, metacharacter, escape sequence, grouped pattern, assertion, etc) will match. X<. (dot);.. (range);quantifier (regexes)> X<. (dot);... (infinite range);quantifier (regexes)> The numeric quantifiers use the C<**> operator followed by the number of desired matches. For a range of matches you can use a closure that returns a range (C matches two to four C's, (C) two or more C's). Each quantifier has a minimal alternate form, marked with a trailing C, that matches the shortest possible sequence first. ATable 7-4 shows the built-in X X quantifiers. =begin table picture Quantifiers Z =headrow =row =cell Maximal =cell Minimal =cell Meaning =bodyrows =row =cell C<*> =cell C<*?> =cell Match 0 or more times. =row =cell C<+> =cell C<+?> =cell Match 1 or more times. =row =cell C =cell C =cell Match 0 or 1 times. =row =cell C<**n> =cell C<**?n> =cell Match exactly R times. =row =cell C<**n..m> =cell C<**?n..m> =cell Match at least R and no more than R times. =row =cell C<**n..*> =cell C<**?n..*> =cell Match at least R times. =end table =head2 Assertions Z X X In general, an assertion simply states that some condition or state is true and the match fails when that assertion is false. Many different constructs with many different purposes use assertion syntax. X X Assertions match named and anonymous regexes, arrays or hashes containing anonymous regexes, and subroutines or closures that return anonymous regexes. You have to enclose a variable in assertion delimiters to get it to interpolate as an anonymous rule or rules. A bare scalar in a pattern interpolates as a literal string, while a scalar variable in assertion brackets interpolates as an anonymous rule. A bare array in a pattern matches as a series of alternate literal strings, while an array in assertion brackets interpolates as a series of alternate anonymous rules. In the simplest case, a bare hash in a pattern matches a word (C<\w+>) and tries to find that word as one of its keys.N, while a hash in assertion brackets does the same, but then also matches the associated value as an anonymous rule. X A bare closure in a pattern always matches (unless it calls C), but a closure in assertion brackets C{...}E> must return an anonymous rule, which is immediately matched. An assertion with parentheses C(...)E> is similar to a bare closure in a pattern in that it allows you to include straight Perl code within a rule. The difference is that C(...)E> evaluates the return value of the closure in boolean context. The match succeeds if the return value is true and fails if the return value is false. Assertions match character classes, both named and enumerated. A named rule character class is often more accurate than an enumerated character class. For example, C[a-zA-Z]E> is commonly used to match alphabetic characters, but generally what's really needed is the built-in rule CalphaE> which matches the full set of Unicode alphabetic characters. ATable 7-5 shows the syntax for assertions. =begin table picture Assertions Z =headrow =row =cell Syntax =cell Meaning =bodyrows =row =cell C...E> =cell Generic assertion delimiter. =row =cell C!...E> =cell Negate any assertion. =cell =row C<< <.rule> >> =row Match named rule, wihtout capturing. =row =cell C>RC> =cell Match a named rule or character class. =row =cell C[...]E>, C<< <+[...]> >> =cell Match an enumerated character class. =row =cell C-...E> =cell Complement a character class (named or enumerated). =row =cell C$scalarE> =cell Match an anonymous rule. =row =cell C@arrayE> =cell Match a series of anonymous rules as alternates. =row =cell C%hashE> =cell Match a key from the hash, then its value (which is an anonymous rule). =row =cell C<< >> =cell Call a named rule with arguments. =row =cell C{>RC<}E> =cell Match an anonymous rule returned by a closure. =row =cell C.E> =cell Match any logical grapheme, including combining character sequences. =end table =head1 Modifiers Z X X<: (colon);: modifier delimiter in rules> Modifiers alter the meaning of the pattern syntax. The standard position for modifiers is at the beginning of the rule, right after the C, C, or C, or after the name in a named rule. Modifiers cannot attach to the outside of a bare C. For example: m:i /marvin/ # case insensitive rule names :i { marvin | ford | arthur } Multiple modifiers can be chained, short and long names can be mixed: m:s :i :g/ zaphod / m:sigspace :i :global / zaphod / Modifiers can be negated with the C<:!pair> notation, so C<:!i> forces case-sensitive matching. Most of the modifiers can also go inside the rule, attached to the rule delimiters or to grouping delimiters. Internal modifiers are lexically scoped to their enclosing delimiters, so you get a temporary alteration of the pattern: m/:s I saw [:i zaphod] / # only 'zaphod' is case insensitive The repetition modifiers (C<:Rx>, C<:Rth>, C<:global>, and C<:exhaustive>) and the continue modifier (C<:cont>) can't be lexically scoped, because they alter the return value of the entire rule. The C<:Rx> modifier matches the rule a counted number of times. If the modifier expects more matches than the string has, the match fails. It has an alternate form C<:x(R)> that can take a variable in place of the number. The C<:global> modifier matches as many times as possible. The C<:exhaustive> modifier also matches as many times as possible, but in as many different ways as possible. The C<:Rth> modifier preserves one result from a particular counted match. If the rule matches fewer times than the modifier expects, the match fails. It has several alternate forms. One form--C<:th(R)>--can take a variable in place of the number. The other forms--C<:Rst>, C<:Rnd>, and C<:Rrd>--are for cases where it's more natural to write C<:1st>, C<:2nd>, C<:3rd> than it is to write C<:1th>, C<:2th>, C<:3th>. Either way is valid, so pick the one that's most comfortable for you. By default, rules ignore literal whitespace within the pattern. The C<:s> or C<:sigspace> modifier makes rules sensitive to literal whitespace, but in an intelligent way. Any cluster of literal whitespace acts like an explicit C<\s+> when it separates two identifiers and C<\s*> everywhere else. More specifically any literal whitespace in the regex is translated to an implict call to C.wsE>, where the C rule matches as mentioned above, but can also be overridden by the user. There are no modifiers to alter whether the matched string is treated as a single line or multiple lines. That's why the "beginning of string" and "end of string" metasymbols have "beginning of line" and "end of line" counterparts. ATable 7-6 shows the current list of modifiers. =begin table picture Modifiers Z =headrow =row =cell Short =cell Long =cell Meaning =bodyrows =row =cell C<:i> =cell C<:ignorecase> =cell Case-insensitive match. =row =cell C<:a> =cell C<:ignoreaccent> =cell Ignore accents and other markings on characters. =row =cell C<:c($pos)> =cell C<:continue($pos)> =cell Match at position C<$pos> or later. If C<$pos> is ommited, start where =row =cell C<:p> =cell C<:pos> =cell Match anchored at position C<$pos>. If C<$pos> is ommited, start where the previous match left off. =row =cell C<:s> =cell C<:sigspace> =cell Literal whitespace in the pattern matches as C<\s+> or C<\s*>. =row =cell =cell :RC/C<:x(>RC<)> =cell Match the pattern R times. =row =cell =cell C<:>RC/C<:nth(>RC<)> =cell Match the Rth occurrence of a pattern. =row =cell C<:g> =cell C<:global> =cell Match the pattern as many times as possible, but only possibilities that don't overlap. =row =cell C<:ov> =cell C<:overlap> =cell Match the pattern as many timies as possible, and allow overlapping matches, but only one match per starting position. =row =cell C<:ex> =cell C<:exhaustive> =cell Match every possible occurrence of a pattern, even overlapping possibilities. =row =cell =cell C<:bytes> =cell C<.> is a byte. =row =cell =cell C<:codes> =cell C<.> is a Unicode codepoint. =row =cell =cell C<:graphs> =cell C<.> is a Unicode grapheme. =row =cell =cell C<:chars> =cell C<.> matches whatever the current Unicode level corresponds to (this is the default). =row =cell =cell C<:ratchet> =cell Imply a C<:> after each atom (see "Backtracking Control" below). =row =cell C<:P5> =cell C<:Perl5> =cell The pattern uses Perl 5 regex syntax. =end table =head2 Substition Modifiers Special modifiers are available for substitions that do not make sense on normal matches. The C<:samecase>, or short C<:ii> modifier implies the C<:ignorecase> modifier, but also carries the case information on a character-by-character base my $s = 'The Quick Brown Fox'; $s ~~ s:ii/brown/blue/; say $s; # The Quick Blue Fox If the C<:sigspace> modifier is also present, a slightly more intelligent algorithm is used. If the source string follows one of the case patterns in $table (XXX: make that a proper cross-link), that pattern is recognized and applied onto the substitution string. $_ = 'All Words Capialized'; s:s:ii/.*/other words/; .say; # Other Words There's a shortcut for C named C, so you could have written the example above aswidth="348" height="300" C. =begin table picture Case patterns for the :samecase modifier =headrow =cell Pattern =cell Corresponding code =bodyrows =row =cell ALL UPPERCASE =cell C<.uc> =row =cell all lowercase =cell C<.lc> =row =cell Every Word Capitalized =cell C<.lc.capitalize> =row =cell First letter upper, rest lower =cell C<.lc.ucfirst> =row =cell fIRST LETTER LOWER, REST UPPER =cell C<.uc.lcfirst> =end table A similar modifier is C<:sameaccent> (short C<:aa>). Instead of carrying case information, it carries accent and marking information. my $stuff = 'Möhre'; $stuff ~~ s:aa/a/o/; say $stuff; # Mähre The third substitution modifier is C<:samespace>, short C<:ss>. It preserves whitespace that is matched by implicit C.wsE> rules: my $s = "Some white\t\n spaces"; $s ~~ s:ss/\w+ \w+ \w+/Completely different text/; # $s is now "Completely different\t\n text" =head1 Built-in Rules Z X A number of named rules are provided by default, including a complete set of X POSIX-style classes, and X Unicode property classes. The list isn't fully defined yet, but ATable 7-7 shows a few you're likely to see. The CnullE> rule matches a zero-width string (so it's always true) and CpriorE> matches whatever the most recent successful rule matched. These replace the two behaviors of X X the Perl 5 null pattern C, which is no longer valid syntax for rules. =begin table picture Built-in rules Z =headrow =row =cell Rule =cell Meaning =bodyrows =row =cell CalphaE> =cell Match a Unicode alphabetic character. =row =cell CdigitE> =cell Match a Unicode digit. =row =cell CspE> =cell Match a single space character (the same as C<\s>). =row =cell CwsE> =cell Match any whitespace (the same as C<\s+>). =row =cell CnullE> =cell Match the null string. =row =cell CpriorE> =cell Match the same thing as the previous match. =row =cell Cbefore ...E> =cell Zero-width lookahead. Assert that you're I a pattern. =row =cell Cafter ...E> =cell Zero-width lookbehind. Assert that you're I a pattern. =end table =head1 Backtracking Control Z X X Backtracking is triggered whenever part of the pattern fails to match. You can also explicitly trigger backtracking by calling the C function within a closure. ATable 7-8 shows some metacharacters and built-in rules relevant to backtracking. =begin table picture Backtracking controls Z =headrow =row =cell Operator =cell Meaning =bodyrows =row =cell C<:> =cell Don't retry the previous atom, fail to the next earlier atom. X<: (colon);: fail to atom before last (rules)> X =row =cell C<::> =cell Don't backtrack over this point, fail out of the closest enclosing group (C<(...)>, C<[...]>, or the rule delimiters). X<: (colon);:: fail out of group (rules)> X =row =cell C<:::> =cell Don't backtrack over this point, fail out of the current rule or subrule. X<: (colon);::: fail out of rule (rules)> X =row =cell CcommitE> =cell Don't backtrack over this point, fail out of the entire match (even from within a subrule). =row =cell CcutE> =cell Like CcommitE>, but also cuts the string matched. The current matching position at this point becomes the new beginning of the string. =end table The C<:ratchet> modifier, which is implied by regexes declared with the C or C keyword, disables backtracking in the subrule, which is the same as adding a C<:> after every atom. =head1 The Match Object Z X X A regex match produces a I object, which contains all information about the match, including start and end position, matched string, and all captures. The match object is returned from a regex match, and is also stored in the special variable C<$/>. my $match = 'Zaphod Beeblebrox' ~~ m/\w+/; say $match; # prints Zaphod In string context it evaluates to the text of the matched part of the string. Table A summarises the properties of the match object. The variables C<$0>, C<$1>, C<$2> etc. are aliases to C<$/[0]>, C<$/[1]>, C<$/[2]>, and C<$EnameE> is an alias to C<$/EnameE>. Likewise an empty C<@()> is the same as C<@($/)>, and C<%()> stands for C<%($/)>. Match variables can also store a different scalar object. A closure in a regex can store such an object by calling C, and can be accessed by forcing scalar context with C<$( $/ )>: regex herd :i :s { (\d+) (\w+)s? { make Herd.new( animal => $1.capitalize count => $0, ); } } 'Yesterday we saw 4 mooses' ~~ m/ /; # now $($) contains the new Herd object This can be used to build object trees directly from regex matches. =begin table picture Properties of the Match object Z =headrow =cell Property =cell Description =bodyrows =row =cell C =cell True if the match was successful. =row =cell C<$/.text> =cell The matched part of the string. =row =cell C<$/.from> =cell Start position of the match. =row =cell C<$/.to> =cell End position of the match. =row =cell C<@( $/ )> =cell List of all positional captures. =row =cell C<%( $/ )> =cell Hash of all named captures. =row =cell C<$/[$n]> =cell C<$n>th positional capture. =row =cell C<$/EnameE> =cell Access to particular named capture. =end table Capture variables are always match objects, and contain the information of their respective sub matches. m/ ( a ( geek ) ( passes ) ) ( many tests ) / | | | | | | | | | $/[0][0] $/[0][1]-+ | | | | | | | $/[0]-------------------+ $/[1] -------+ If a capturing group is quantified, it automatically becomes an array of match objects. Subsequent matches are not renumbered: '12 45 books' ~~ m:s/ ( \d+ )+ (\w+) / say $0[0]; # 12 say $0[1]; # 45 say $1; # books When a subrule is called with the CsubruleE> syntax, it produces a named capture of name C. That named can be changed with the Cnewname=subruleE> syntax. token identifier { \w+ } token number { \d+ } $_ = '24 hours' if m:s/ / { say "Number: $. Unit: $"; } These variables are also available iin the regex itself: "Zaphod saw Zaphod" ~~ m:s/ EnameE \w+ $/ /; =cut # vim: sw=3 ts=3 expandtab ft=pod tw=72