github twitter email rss
Regular expressions regex
0001 Jun 1
7 minutes read

Regular expressions regex

http://www.youtube.com/watch?v=EkluES9Rvak
http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
http://www.bluebox.net/about/blog/2013/02/using-regular-expressions-in-ruby-part-1-of-3/

Regular Expression Language - Quick Reference http://msdn.microsoft.com/en-us/library/vstudio/az24scfc%28v=vs.100%29.aspx
.NET Framework Regular Expressions http://msdn.microsoft.com/en-us/library/hs600312.aspx

indexing
match method
\g match stop


/Z
/z
/i
/x
/g

unrolling loop technique
anker everything

quanofiers
 lazy
 *
    minimum effort minimum return
 greedy
 +
    maximum effort for maximum return
 posessove
     +?

look ahead 
look around

capture groups

lookahed

lookbehind

(?=foo)	
Lookahead	Asserts that what immediately follows the current position in the string is foo
(?<=foo)	
Lookbehind	Asserts that what immediately precedes the current position in the string is foo
(?!foo)	
Negative Lookahead	Asserts that what immediately follows the current position in the string is not foo
(?<!foo)	
Negative Lookbehind	Asserts that what immediately precedes the current position in the string is not foo

backreference

    \number  Backreference. For example, (\w)\1 finds doubled word characters.
    \k<name>  Named backreference. For example, (?<char>\w)\k<char> finds doubled word characters.
            The expression (?<43>\w)\43 does the same. You can use single quotes instead of angle brackets;
            for example, \k'char'.
capture groups

substitutions

    $number  Substitutes the last substring matched by group number number (decimal).
    ${name}  Substitutes the last substring matched by a (?<name> ) group.
    $$  Substitutes a single "$" literal.
    $&  Substitutes a copy of the entire match itself.
    $`  Substitutes all the text of the input string before the match.
    $'  Substitutes all the text of the input string after the match.
    $+  Substitutes the last group captured.
    $_  Substitutes the entire input string

quantifiers lazy

    *?  Specifies the first match that consumes as few repeats as possible (equivalent to lazy *).
    +?  Specifies as few repeats as possible, but at least one (equivalent to lazy +).
    ??  Specifies zero repeats if possible, or one (lazy ?).
    {n}?  Equivalent to {n} (lazy {n}).
    {n,}?  Specifies as few repeats as possible, but at least n (lazy {n,}).
    {n,m}? Specifies as few repeats as possible between n and m (lazy {n,m}).

atomic zero-width assertions

    ^  Specifies that the match must occur at the beginning of the string or the beginning of the line.
    $  Specifies that the match must occur at the end of the string, before \n at the end of the string, or at the end of the line.
    \A  Specifies that the match must occur at the beginning of the string (ignores the Multiline option).
    \Z  Specifies that the match must occur at the end of the string or before \n at the end of the string (ignores the Multiline option).
    \z  Specifies that the match must occur at the end of the string (ignores the Multiline option).
    \G  Specifies that the match must occur at the point where the previous match ended.
    \b  Specifies that the match must occur on a boundary between \w (alphanumeric) and \W (nonalphanumeric) characters.
    \B  Specifies that the match must not occur on a \b boundary.

grouping constructs

    ( )  Captures the matched substring. Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern.
    (?<name> )  Captures the matched substring into a group name or number name. The string used for name must not contain any punctuation and it cannot begin with a number. You can use single quotes instead of angle brackets; for example, (?'name').
    (?<name1-name2> )  Balancing group definition. Deletes the definition of the previously defined group name2 and stores in group name1 the interval between the previously defined name2 group and the current group. If no group name2 is defined, the match backtracks.
    Because deleting the last definition of name2 reveals the previous definition of name2, this construct allows the stack of captures for group name2 to be used as a counter for keeping track of nested constructs such as parentheses. In this construct, name1 is optional.
    You can use single quotes instead of angle brackets; for example, (?'name1-name2').
    (?: )  Noncapturing group.
    (?imnsx-imnsx: )  Applies or disables the specified options within the subexpression. For example, (?i-s: ) turns on case insensitivity and disables single-line mode. For more information, see Regular Expression Options.
    (?= ) Zero-width positive lookahead assertion. Continues match only if the subexpression matches at this position on the right. For example, \w+(?=\d) matches a word followed by a digit, without matching the digit. This construct does not backtrack.
    (?! )  Zero-width negative lookahead assertion. Continues match only if the subexpression does not match at this position on the right. For example, \b(?!un)\w+\b matches words that do not begin with un.
    (?<= ) Zero-width positive lookbehind assertion. Continues match only if the subexpression matches at this position on the left. For example, (?<=19)99 matches instances of 99 that follow 19. This construct does not backtrack.
    (?<! ) Zero-width negative lookbehind assertion. Continues match only if the subexpression does not match at the position on the left.
    (?> ) Nonbacktracking subexpression (also known as a "greedy" subexpression). The subexpression is fully matched once, and then does not participate piecemeal in backtracking.

Basic expressions

*               Specifies zero or more matches; for example, \w* or (abc)*. Equivalent to {0,}.
+               Specifies one or more matches; for example, \w+ or (abc)+. Equivalent to {1,}.
?               Specifies zero or one matches; for example, \w? or (abc)?. Equivalent to {0,1}.
.               any character except new line
^               any except
(?: )  ( )      group

Escapes

\               escapes metasymbol
\b  Matches a backspace if in a [] character class. In a regular expression, \b denotes a word boundary.
\t  Matches a tab.
\r  Matches a return.
\v  Matches a vertical tab.
\f  Matches a form feed.
\n  Matches a new line.
\e  Matches an escape.
\040  Matches an ASCII character as octal (up to three digits)
\x20  Matches an ASCII character using hexadecimal representation (exactly two digits).
\cC  Matches an ASCII control character; for example, \cC is control-C.
\u0020 Matches a Unicode character using hexadecimal representation (exactly four digits).

Character classes

[]              any of specified symbols
[ABCDEF]        any symbol from A-F
[A-F]           any symbol from A-F
[^A-F]          any symbol except A-F

^   beginnig of text file
$   end of text file


\1...\9 match text matched by the group
    \(te[sx]t)_\1   test_test OR text_text

counted repetitions

\{min,max\}
    X\{0,2\}L   L, XL, XXL
    [0-9]\{5,\} at least 5 digit
    [0-9]\{5\}  exactly 5 digits
{n}  Specifies exactly n matches; for example, (pizza){2}.
{n,}  Specifies at least n matches; for example, (abc){2,}.
{n,m}  Specifies at least n, but no more than m, matches.

alternation constructs

|  Matches any one of the terms separated by the | (vertical bar) character; for example, cat|dog|tiger. The leftmost successful match wins.
(?(expression)yes|no)  Matches the "yes" part if the expression matches at this point; otherwise, matches the "no" part. The "no" part can be omitted.
(?(name)yes|no)  Matches the "yes" part if the named capture string has a match; otherwise, matches the "no" part.

Examples

    file\.txt   match file.txt and nothing else
    C:\\    match C:\
    RealNumberPattern = @"(\d+((\.|,)\d+)?)"
    IntegerNumberPattern = @"(\d+)";

Extended expressions

\(\)    becomes ()
\{,\}   become {,}
No back references \1...\9
? equals to {0,1}   match at most one
+ equals to (1,}    match at least one
|   match any of both sides
[:alpha:] [:digit:] [:lnum:] [:upper:] [:lower:] [:xdigit:] [:space:] [:blank:]
[==] [..]   match any variation of symbol
[[=A=][=C=]]    match Ått Ç

Perl expressions

\d      match digit
\D      match not digit
\w      match letter, digit, underscore
\W      match non-word symbol
\b      beginning of word
\s      match space tab new line
\S      match non-white space
\p{name}  Matches any character in the named character class specified by{name}.
\P{name}  Matches text not included in groups and block ranges specified in {name}.
\p{L} \p{Letter}    match Unicode letter
\p{Lu}      match uppercase letter
\p{N}       match digit
\p{Z}       match white space
\P{L}       match non Unicode letter

(?:)    non-capturing group
(?=)    check what is ahead
(?!)    check what is not ahead

(?flags)    change matching behaviour to the end
    i   case insensitive
    m   makes ^ $ match lite terminators
    m   makes . match line terminators
(?flags:)   change matching behaviour for non-capturing group only
    (?i:<HTML>)     match <HTML> <hTml>


*? ?? +? {,}?   match lazily
    <.*?>   in <b>bols</b>  match just <b>

Back to posts


comments powered by Disqus