| Regular expressions for text processing on strings, pattern matching, scanners.
Text processing predicates:
- match/2/3
- match_all/3
- substitute/4
- substitute_all/4
- split/2/3
- chop/2
- get_line/1/2
Regular Expressions:
Regular expressions are sequences of simple characters
and meta characters. All characters which are not
meta characters stand for themselves.
Examples:
a stands for the character "a"
ab stands for the character sequence "ab"
%% stands for the character "%"
meta characters:
| disjunction
(...) bracketed expression
. any single character
^ first position of a string
$ last position of a string
%d digit
%D not a digit
%s white space
%w character of a word (letter or digit or _)
%W not a character of a word
%< start of a word
i.e. current character is %w and previous is not %w
%> end of a word
i.e. current character is not %w and previous is %w
case sensitivity
%i all following characters are case insensitive
%I all following characters are not case sensitive
classes of characters
[...] one of the listed characters
[^...] not one of the listed characters
The character "-" is used within a class to define a range,
e.g. [a-zA-Z0-9_] is the same as %w.
Grouping and lookahead
(?:...) simple grouping (no sideeffects)
(?=...) positive lookahead,
true if ... follows, but ... is not consumed
(?!...) negative lookahead,
true if ... does not follow, nothing is consumed
Quantifier
Quantifiers are given as suffixes.
The following quantifiers are defined:
* any number of
+ at least once
? none or one
{min,max} "min" to "max" times
{min,} at least "min" times
{n} exactly "n" times
Quantifiers are eager, they try to go for the longest
possible matching string.
To find the shortest possible matching string add a "?"
to the quantifier.
Note: to add a backslash into an atom the backslash
must be written twice ("escape of an atom").
Note: to use a percent mark in an expression the percent
mark must be written twice ("escape of the regular expression").
Examples:
match('(%d+)', 'one123four', L) => L = ['123']
match('(.*)a', barbara, L) => L = [barbar]
match('(.*?)a', barbara, L) => L = [b]
match_all('%<(%w+)', ' one two three ', L)
=> L = [one,two,three]
split('one two three', L) => L = [one,two,three]
chop(' one two ', L) => L = 'one two'
split('%s*:%s*', 'one : two :three', L)
=> L = [one,two,three]
substitute_all('a(.)', barbara, '%1a', L).
==> L = brabraa
e.g. to exchange the position of words
substitute('(%w+) (%w+)', 'one two', '%2 %1', L).
==> L = 'two one'
Bracketed Units
Whatever is bracketed is assigned to a term. Whatever is bracketed by
the from left to right i-th opening bracket is assigned to the i-th
term. Bracketed terms can be referenced in substitute/4 and
substitute_all/4 with %1...%9 in the substitute expression.
For match/3 and match_all/3 these are returned in the corresponding
sequence. If a bracketed expression was not evaluated, e.g. because
it appeared in a disjunction, then its result is ''.
Examples:
match_all('((%w+)|(%s+))', 'a few tokens', L).
=> L = [[a,a,''],[' ','',' '],[few,few,''],[' ','',' '],[tokens,tokens,'']]
match_all('%<(%w+)', 'a few tokens', L).
=> L = [[a],[few],[tokens]]
|