Regular expression grouping, Minimal and maximal matching – Crunch CRiSP File Editor 6 User Manual

Page 50

Page 50

[A-Za-z]]+ matches the final word in a sentence. The '.' matches the full-stop after it. The expression [ ]@
matches zero or more spaces which may separate the full-stop and the first letter of the next sentence.

Regular Expression Grouping: ..

The regular expression grouping characters are used for one of two purposes - alter the precedence in
which the regular expression is parsed, and to define groupings of regular expressions for use by the
translation mechanism.

By and large, the regular expressions:

xyz and {xyz}

are equivalent. The major use is for bracketing in the presence of the following operators: @, +, and |. For
example:

{hello}@ {cat}|{dog}

The other use for the bracket operators is to define a sub-part of a regular expression for use in translation.
Each occurrence of brackets is defined as a grouping. The first occurrence of {..} is group 1, the next is
group 2. By grouping parts of a regular expression, translations can be made which swap fields around.

For example, say we have a piece of C code which defines a table as follows:

"string1", number1, "string2", number2, ..

If we need to swap the fields around so that we have the numbers first on the line, and the strings following
them, then the regular expression search pattern can be defined as:

<[ t]@{"[^"]@",}[ t]@{[0-9]+,}

This breaks down as follows: <[ t]@ matches the spaces and tabs at the beginning of the line. {"[^"]@",}
matches the string field (quote followed by zero or more non-quote characters terminated by a quote and a
comma). This is the first group. [ t]@ matches the zero or more spaces or tabs between the columns. {[0-
9]+,} is the second grouping and matches the number followed by a comma.

If the translation replacement pattern (see (translate)) is defined as follows:

t1t0

then this effects the field swap. The sequence \N where N is in the range 0-9 means insert the matched
group designated by N.

Minimal and Maximal Matching

All Unix regular expression parsers use the '*' and '+' operators to mean repeat the previous expression zero
or more, or one or more times respectively. CRiSP uses the '@' and '+' operators for the same effect.

However, all Unix parsers, when matching repeated groups will always try to match the longest string. Under
Unix, if we have the string:

abbbbbbbc

and issue the search pattern:

then this will match the 7 b's between the 'a' and 'c'. By default, CRiSP performs a shortest match. This
means that the regular expression:

will match the zero length sequence of b's starting with the a!. For pure searches, the difference hardly ever
matters, but when translations are performed the difference is very important. In the above example, using
the following translation from 'vi' will result in the following string:

s/ab*/X/p Xc

This is what happens with CRiSP:

translate("ab@", "X") Xbbbbbbbc

This is simply because the Unix parsers try to match the longest string, whereas CRiSP tries to match the
shortest string.

CRiSP provides the ability to modify this default behaviour. This is called minimal/maximal matching and