sitelogo
Text Processing and Regular Expressions

Regular expressions for text processing on strings, pattern matching, scanners. Text processing predicates:

Regular Expressions:

Regular expressions are sequences of simple characters and meta characters. All characters which are not meta characters stand for themselves.

Examples:

	a		stands for the character "a"
	ab		stands for the  character sequence "ab"
	%%		stands for the character "%"

meta characters:

	|		disjunction
	(...)		bracketed expression
	.		any single character
	^		first position of a string
	$		last position of a string
	%d		digit
	%D		not a digit
	%s		white space
	%w		character of a word (letter or digit or _)
	%W		not a character of a word
	%<		start of a word 
			i.e. current character is %w and previous is not %w
	%>		end of a word
			i.e. current character is not %w and previous is %w

case sensitivity

	%i		all following characters are case insensitive
	%I		all following characters are not case sensitive

classes of characters

	[...]		one of the listed characters
	[^...]		not one of the listed characters
	The character "-" is used within a class to define a range,
	e.g. [a-zA-Z0-9_] is the same as %w.

Grouping and lookahead

	(?:...)		simple grouping (no sideeffects)
	(?=...)		positive lookahead,
			true if ... follows, but ... is not consumed
	(?!...)		negative lookahead,
			true if ... does not follow, nothing is consumed

Quantifier

	Quantifiers are given as suffixes.
	The following quantifiers are defined:

* any number of + at least once ? none or one {min,max} "min" to "max" times {min,} at least "min" times {n} exactly "n" times

Quantifiers are eager, they try to go for the longest possible matching string. To find the shortest possible matching string add a "?" to the quantifier.

Note: to add a backslash into an atom the backslash must be written twice ("escape of an atom").

Note: to use a percent mark in an expression the percent mark must be written twice ("escape of the regular expression").

Examples:

	match('(%d+)', 'one123four', L)	=>	L = ['123']
	match('(.*)a', barbara, L)		=>  	L = [barbar]
	match('(.*?)a', barbara, L)		=>  	L = [b]
	match_all('%<(%w+)', ' one  two  three  ', L)	
						=> 	L = [one,two,three]
	split('one two three', L)		=>  	L = [one,two,three]
	chop('  one two  ', L)		=>  	L = 'one two'
	split('%s*:%s*', 'one   :  two  :three', L)
			=>  L  = [one,two,three]

substitute_all('a(.)', barbara, '%1a', L). ==> L = brabraa

e.g. to exchange the position of words

substitute('(%w+) (%w+)', 'one two', '%2 %1', L). ==> L = 'two one'

Bracketed Units

Whatever is bracketed is assigned to a term. Whatever is bracketed by the from left to right i-th opening bracket is assigned to the i-th term. Bracketed terms can be referenced in substitute/4 and substitute_all/4 with %1...%9 in the substitute expression. For match/3 and match_all/3 these are returned in the corresponding sequence. If a bracketed expression was not evaluated, e.g. because it appeared in a disjunction, then its result is ''.

Examples:

match_all('((%w+)|(%s+))', 'a few tokens', L).
	=> L = [[a,a,''],[' ','',' '],[few,few,''],[' ','',' '],[tokens,tokens,'']]

match_all('%<(%w+)', 'a few tokens', L). => L = [[a],[few],[tokens]]


match/2/3
match_all/3
substitute/4
substitute_all/4
chop/2
split/2/3
get_line/1/2

Up read on...
scroll to top managed with ubiCMS