UTRegExp.h File Reference

Go to the source code of this file.

Functions

Status_t evaluate_reg_exp (const utf8 *expression, const utf8 *match, bool force_start_match, String_t *pre, String_t **substring_array, int substring_count, bool force_end_match, String_t *post)
Status_t evaluate_reg_exp (const utf8 *expression, const utf8 *match, bool force_start_match=true, String_t *pre=NULL, bool force_end_match=true, String_t *post=NULL,...)
Status_t evaluate_reg_exp (stringliteral *expression, const utf8 *match, bool force_start_match=true, String_t *pre=NULL, bool force_end_match=true, String_t *post=NULL,...)
Status_t validate_reg_exp (const utf8 *expression, int *expression_fail_point_chars, const utf8 *match, int *match_fail_pos_chars, bool force_start_match=true, String_t *pre=NULL, bool force_end_match=true, String_t *post=NULL, String_t **substring_array=NULL, int substring_count=0)

Function Documentation

Status_t evaluate_reg_exp ( const utf8 expression,
const utf8 match,
bool  force_start_match,
String_t pre,
String_t **  substring_array,
int  substring_count,
bool  force_end_match,
String_t post 
)

Evaluates a regular expression.

If force_start_match is true, the string must match the expression from its first character. In that case, pre must be NULL. If force_start_match is false, characters in the string before the match point can either be returned through pre or discarded if pre is NULL.

If force_end_match is true, the string must end concurrent with the end of the expression matching. In that case, post must be NULL. If force_end_match is false, characters in the string after the expression match can either be returned through post or discarded if post is NULL.

The expression string is a mix of plain characters and escape sequences. For example, if match is "This is a string" and expression is "is a" then pre will be "This " and post will be " string". If the regular expression was "/sis a/s", where /s matches any whitespace character, the result would be pre being "This" and post wouls be "string". Note that the pattern match "absorbed" the spaces.

In a regular text portion of a regular expression, the characters / ] , $ ) must be escaped if they are to be treated as regular characters to clarify that they are not a part of one of the constructs documented below. The escape sequences are // /] /, /$ /). The & character sometimes needs to be escaped as /& to be treated as a regular character when used in a context where it could be misinterpreted as an aggregate "and not" clause (see below).

The supported escape sequences are:

    "/s"   whitespace (space, tab, carriage return, linefeed)
    "/d"   numeric digit (0-9)
    "/h"   hexadecimal character (0-9,a-f)
    "/H"   hexadecimal character (0-9,A-F)
    "/ih"  hexadecimal character (0-9,a-f,A-f)
    "/c"   character (a-z)
    "/C"   character (A-Z)
    "/ic"  character (a-z,A-Z)
    "/U"   character (a-z,A-Z) plus any UTF8 sequences representing non-ASCII unicode
    "/nU"  character (0-9,a-z,A-Z), plus any UTF8 sequences representing non-ASCII unicode
    "/t"   token character (0-9,a-z,A-Z,_)
    "/T"   token start character (a-z,A-Z,_)
    

Typically, more than one character would be needed in a given matching cluster. Other times, one would need to match 0 or 1 or 0 or more characters. That is accomplished with count control sequences which can follow the slash in any of the escape sequences.

The fundamental layout of a complete escape sequence is /CNT where C is an optional count specifier, N is an optional ! for negation, and T is the type, either as documented above (e.g. s in /s) or an aggregation (see below)

The supported count specifiers are:

    "/+"   one or more instances of T or !T
    "/*"   zero or more instances of T or !T
    "/?"   zero or one instances of T or !T
    "/n"   exactly n instances of T or !T, where n is a decimal integer
    "/n+"  n or more instances of T or !T, where n is a decimal integer
    /m-n m to n instances of T or !T, where n is a decimal integer
    

Count specifier examples:

    "/+d"    one or more numeric digits
    "/*s"    zero or more whitespace characters
    "/?C"    0 or 1 letters in the A-Z range
    "/4ih"   exactly four hexadecimal characters, case insensitive
    "/0-3c"  0 to 3 letters in the a-z range
    "/5+t"   5 or more token characters
    

Simple negation examples:

    "/!c"     anything but a character (a-z)
    "/+!d"    one or more of anything which is not a numeric digit
    "/0-1!C"  0 or 1 of anything which is not a letter in the A-Z range
    

Aggregation uses a [,] sequence. There can be any number of commas, including zero. Between each comma is one of the following constructs:

    1. An individual character, for example "/[a,e,i,o,u]"
    2. A character sequence, for example "/[coders,rule]"
    3. An escape sequence
    4. A character sequence interspersed with escape sequences, examples are:
       Expression               Matches
       "0x/4ih"                 "0x12aB"
       "0x/4ih + /+d"           "0x12aB + 3"
                                "0x12aB + 123789"
       "0x/4ih/*s/[+,-]/*s/+d"  "0x12aB-3"
                                "0x12aB + 123789"
    

If a comma in an aggreggate is followed by &!, it carries the implication of "and not". If any of the previous comma-separated sequences constituted a match, but also constitute a match with the &! pattern, then any matches based on previous comma separated options will be disregarded. Examples are:

    "/+[/c,&!f]"              one or more instances of the letters a to z, but not f.
    "/[/+[/t,/U],&!/[nope]]"  any word comprised of token characters or unicode characters above the
                              ASCII range (above ASCII because token characters include a-z, A-Z),
                              as long as the word is not "nope".
    

The regular expression algorithm processes the expression with respect to matching from start to finish, accepting the longest match then moving on. It is not a "best fit" algorithm. Therefore, if one were to match, for example, "xyz" with the regular expression "/+cz" the /+c portion (one or more lowercase characters) would match with the entire string, leaving the end z part with nothing to match against. In order to match in the desired manner for this example, with /*c matching "xy" and z matching c, the regular expression would need to be "/+[c,&!z]z".

When one of the control characters needs to be treated as a regular character where it could otherwise be interpreted in one of the roles described above, it can itself be specfied by an escape sequence:

    "//"  the actual / character
    "/]"  the actual ] character
    "/,"  the actual , character
    "/$"  the actual $ character
    "/)"  the actual ) character
    "/&"  the actual & character
    

Substring extraction is accomplished using a $() sequence outside the context of any escape sequence described above. Having created a regular expression string, it can be partitioned into substrings to be the extracted in pattern matching. Take this example:

    "0x/4ih/*s/[+,-]/*s/+d"
    

It can be partitioned into substrings for extraction as follows:

    "0x$(/4ih)/*s$(/[+,-])/*s$(/+d)"
    

Then, given the match "value: (0x12aB + 123789) == x", the substrings would be:

    pre:          "value: ("
    substring[0]: "12aB"
    substring[1]: "+"
    substring[2]: "123789"
    post:         ") == x"
    

The varargs versions of this function take a variable number of String_t pointers in place of the substring_array pointer.

Returns false if the pattern did not match, true if it did, and eERR_bad_regexp if the regular expression was invalid.

Status_t evaluate_reg_exp ( const utf8 expression,
const utf8 match,
bool  force_start_match = true,
String_t pre = NULL,
bool  force_end_match = true,
String_t post = NULL,
  ... 
)

Evaluates a regular expression, varargs variant. See the non-varargs version for documentation.

Status_t evaluate_reg_exp ( stringliteral expression,
const utf8 match,
bool  force_start_match = true,
String_t pre = NULL,
bool  force_end_match = true,
String_t post = NULL,
  ... 
)

Evaluates a regular expression, string literal expression and varargs variant. See the non-varargs version for documentation.

Status_t validate_reg_exp ( const utf8 expression,
int *  expression_fail_point_chars,
const utf8 match,
int *  match_fail_pos_chars,
bool  force_start_match = true,
String_t pre = NULL,
bool  force_end_match = true,
String_t post = NULL,
String_t **  substring_array = NULL,
int  substring_count = 0 
)

Validates a regular expression to determine whether and where a failure occurred. The match parameter can be NULL, in which case only the regular expresion itself is checked. This is used to obtain the failure point, probably to highlight it to the user if expression was user-specified.


Generated on Tue Dec 14 22:35:06 2010 for UT library by  doxygen 1.6.1