UTRegExp.h File Reference

Go to the source code of this file.

Functions
Status_t	evaluate_reg_exp (const utf8 expression, const utf8 match, bool force_start_match, String_t pre, String_t substring_array, int substring_count, bool force_end_match, String_t post)
Status_t	evaluate_reg_exp (const utf8 expression, const utf8 match, bool force_start_match=true, String_t pre=NULL, bool force_end_match=true, String_t post=NULL,...)
Status_t	evaluate_reg_exp (stringliteral expression, const utf8 match, bool force_start_match=true, String_t pre=NULL, bool force_end_match=true, String_t post=NULL,...)
Status_t	validate_reg_exp (const utf8 expression, int expression_fail_point_chars, const utf8 match, int match_fail_pos_chars, bool force_start_match=true, String_t pre=NULL, bool force_end_match=true, String_t post=NULL, String_t **substring_array=NULL, int substring_count=0)

Function Documentation

Status_t evaluate_reg_exp	(	const utf8 *	expression,
		const utf8 *	match,
		bool	force_start_match,
		String_t *	pre,
		String_t **	substring_array,
		int	substring_count,
		bool	force_end_match,
		String_t *	post
	)

Evaluates a regular expression.

If force_start_match is true, the string must match the expression from its first character. In that case, pre must be NULL. If force_start_match is false, characters in the string before the match point can either be returned through pre or discarded if pre is NULL.

If force_end_match is true, the string must end concurrent with the end of the expression matching. In that case, post must be NULL. If force_end_match is false, characters in the string after the expression match can either be returned through post or discarded if post is NULL.

The expression string is a mix of plain characters and escape sequences. For example, if match is "This is a string" and expression is "is a" then pre will be "This " and post will be " string". If the regular expression was "/sis a/s", where /s matches any whitespace character, the result would be pre being "This" and post wouls be "string". Note that the pattern match "absorbed" the spaces.

In a regular text portion of a regular expression, the characters / ] , $ ) must be escaped if they are to be treated as regular characters to clarify that they are not a part of one of the constructs documented below. The escape sequences are // /] /, /$ /). The & character sometimes needs to be escaped as /& to be treated as a regular character when used in a context where it could be misinterpreted as an aggregate "and not" clause (see below).

The supported escape sequences are:

    "/s"   whitespace (space, tab, carriage return, linefeed)
    "/d"   numeric digit (0-9)
    "/h"   hexadecimal character (0-9,a-f)
    "/H"   hexadecimal character (0-9,A-F)
    "/ih"  hexadecimal character (0-9,a-f,A-f)
    "/c"   character (a-z)
    "/C"   character (A-Z)
    "/ic"  character (a-z,A-Z)
    "/U"   character (a-z,A-Z) plus any UTF8 sequences representing non-ASCII unicode
    "/nU"  character (0-9,a-z,A-Z), plus any UTF8 sequences representing non-ASCII unicode
    "/t"   token character (0-9,a-z,A-Z,_)
    "/T"   token start character (a-z,A-Z,_)

Typically, more than one character would be needed in a given matching cluster. Other times, one would need to match 0 or 1 or 0 or more characters. That is accomplished with count control sequences which can follow the slash in any of the escape sequences.

The fundamental layout of a complete escape sequence is /CNT where C is an optional count specifier, N is an optional ! for negation, and T is the type, either as documented above (e.g. s in /s) or an aggregation (see below)

The supported count specifiers are:

    "/+"   one or more instances of T or !T
    "/*"   zero or more instances of T or !T
    "/?"   zero or one instances of T or !T
    "/n"   exactly n instances of T or !T, where n is a decimal integer
    "/n+"  n or more instances of T or !T, where n is a decimal integer
    /m-n m to n instances of T or !T, where n is a decimal integer

Count specifier examples:

    "/+d"    one or more numeric digits
    "/*s"    zero or more whitespace characters
    "/?C"    0 or 1 letters in the A-Z range
    "/4ih"   exactly four hexadecimal characters, case insensitive
    "/0-3c"  0 to 3 letters in the a-z range
    "/5+t"   5 or more token characters

Simple negation examples:

    "/!c"     anything but a character (a-z)
    "/+!d"    one or more of anything which is not a numeric digit
    "/0-1!C"  0 or 1 of anything which is not a letter in the A-Z range

Aggregation uses a [,] sequence. There can be any number of commas, including zero. Between each comma is one of the following constructs:

    1. An individual character, for example "/[a,e,i,o,u]"
    2. A character sequence, for example "/[coders,rule]"
    3. An escape sequence
    4. A character sequence interspersed with escape sequences, examples are:
       Expression               Matches
       "0x/4ih"                 "0x12aB"
       "0x/4ih + /+d"           "0x12aB + 3"
                                "0x12aB + 123789"
       "0x/4ih/*s/[+,-]/*s/+d"  "0x12aB-3"
                                "0x12aB + 123789"

If a comma in an aggreggate is followed by &!, it carries the implication of "and not". If any of the previous comma-separated sequences constituted a match, but also constitute a match with the &! pattern, then any matches based on previous comma separated options will be disregarded. Examples are:

    "/+[/c,&!f]"              one or more instances of the letters a to z, but not f.
    "/[/+[/t,/U],&!/[nope]]"  any word comprised of token characters or unicode characters above the
                              ASCII range (above ASCII because token characters include a-z, A-Z),
                              as long as the word is not "nope".

The regular expression algorithm processes the expression with respect to matching from start to finish, accepting the longest match then moving on. It is not a "best fit" algorithm. Therefore, if one were to match, for example, "xyz" with the regular expression "/+cz" the /+c portion (one or more lowercase characters) would match with the entire string, leaving the end z part with nothing to match against. In order to match in the desired manner for this example, with /*c matching "xy" and z matching c, the regular expression would need to be "/+[c,&!z]z".

When one of the control characters needs to be treated as a regular character where it could otherwise be interpreted in one of the roles described above, it can itself be specfied by an escape sequence:

    "//"  the actual / character
    "/]"  the actual ] character
    "/,"  the actual , character
    "/$"  the actual $ character
    "/)"  the actual ) character
    "/&"  the actual & character

Substring extraction is accomplished using a $() sequence outside the context of any escape sequence described above. Having created a regular expression string, it can be partitioned into substrings to be the extracted in pattern matching. Take this example:

    "0x/4ih/*s/[+,-]/*s/+d"

It can be partitioned into substrings for extraction as follows:

    "0x$(/4ih)/*s$(/[+,-])/*s$(/+d)"

Then, given the match "value: (0x12aB + 123789) == x", the substrings would be:

    pre:          "value: ("
    substring[0]: "12aB"
    substring[1]: "+"
    substring[2]: "123789"
    post:         ") == x"

The varargs versions of this function take a variable number of String_t pointers in place of the substring_array pointer.

Returns false if the pattern did not match, true if it did, and eERR_bad_regexp if the regular expression was invalid.

Status_t evaluate_reg_exp	(	const utf8 *	expression,
		const utf8 *	match,
		bool	force_start_match = `true`,
		String_t *	pre = `NULL`,
		bool	force_end_match = `true`,
		String_t *	post = `NULL`,
			...
	)

Evaluates a regular expression, varargs variant. See the non-varargs version for documentation.

Status_t evaluate_reg_exp	(	stringliteral *	expression,
		const utf8 *	match,
		bool	force_start_match = `true`,
		String_t *	pre = `NULL`,
		bool	force_end_match = `true`,
		String_t *	post = `NULL`,
			...
	)

Evaluates a regular expression, string literal expression and varargs variant. See the non-varargs version for documentation.

Status_t validate_reg_exp	(	const utf8 *	expression,
		int *	expression_fail_point_chars,
		const utf8 *	match,
		int *	match_fail_pos_chars,
		bool	force_start_match = `true`,
		String_t *	pre = `NULL`,
		bool	force_end_match = `true`,
		String_t *	post = `NULL`,
		String_t **	substring_array = `NULL`,
		int	substring_count = `0`
	)

Validates a regular expression to determine whether and where a failure occurred. The match parameter can be NULL, in which case only the regular expresion itself is checked. This is used to obtain the failure point, probably to highlight it to the user if expression was user-specified.

UTRegExp.h File Reference

Functions

Function Documentation