TWiki . TWiki . RegularExpression

Regular expressions (REs), unlike simple queries, allow you to search for text which matches a particular pattern.

REs are similar to (but more poweful than) the "wildcards" used in the command-line interfaces found in operating systems such as Unix and MS-DOS. REs are used by sophisticated search engines, as well as by many Unix-based languages and tools ( e.g., awk, grep, lex, perl, and sed ).

Examples

compan(y|ies) Search for company , companies
(peter|paul) Search for peter , paul
bug* Search for bug , bugs , bugfix
[Bb]ag Search for Bag , bag
b[aiueo]g Second letter is a vowel. Matches bag , bug , big
b.g Second letter is any letter. Matches also b&g
[a-zA-Z] Matches any one letter (not a number and a symbol)
[^0-9a-zA-Z] Matches any symbol (not a number or a letter)
[A-Z][A-Z]* Matches one or more uppercase letters
[0-9][0-9][0-9]-[0-9][0-9]-
[0-9][0-9][0-9][0-9]
US social security number, e.g. 123-45-6789

Here is stuff for our UNIX freaks:
(copied from 'man grep')

     \c   A backslash (\) followed by any special character is  a
          one-character  regular expression that matches the spe-
          cial character itself.  The special characters are:

               +    `.', `*', `[',  and  `\'  (period,  asterisk,
                    left  square  bracket, and backslash, respec-
                    tively), which  are  always  special,  except
                    when they appear within square brackets ([]).

               +    `^' (caret or circumflex), which  is  special
                    at the beginning of an entire regular expres-
                    sion, or when it immediately follows the left
                    of a pair of square brackets ([]).

               +    $ (currency symbol), which is special at  the
                    end of an entire regular expression.                       

     .    A `.' (period) is a  one-character  regular  expression
          that matches any character except NEWLINE.
 
     [string]
          A non-empty string of  characters  enclosed  in  square
          brackets  is  a  one-character  regular expression that
          matches any one character in that string.  If, however,
          the  first  character of the string is a `^' (a circum-
          flex or caret), the  one-character  regular  expression
          matches  any character except NEWLINE and the remaining
          characters in the string.  The  `^'  has  this  special
          meaning only if it occurs first in the string.  The `-'
          (minus) may be used to indicate a range of  consecutive
          ASCII  characters;  for example, [0-9] is equivalent to
          [0123456789].  The `-' loses this special meaning if it
          occurs  first (after an initial `^', if any) or last in
          the string.  The `]' (right square  bracket)  does  not
          terminate  such a string when it is the first character
          within it (after an initial  `^',  if  any);  that  is,
          []a-f]  matches either `]' (a right square bracket ) or
          one of the letters a through  f  inclusive.   The  four
          characters  `.', `*', `[', and `\' stand for themselves
          within such a string of characters.

     The following rules may be used to construct regular expres-
     sions:

     *    A one-character regular expression followed by `*'  (an
          asterisk)  is a regular expression that matches zero or
          more occurrences of the one-character  regular  expres-
          sion.   If  there  is  any choice, the longest leftmost
          string that permits a match is chosen.

     ^    A circumflex or caret (^) at the beginning of an entire
          regular  expression  constrains that regular expression
          to match an initial segment of a line.

     $    A currency symbol ($) at the end of an  entire  regular
          expression  constrains that regular expression to match
          a final segment of a line.

     *    A  regular  expression  (not  just   a   one-
          character regular expression) followed by `*'
          (an asterisk) is a  regular  expression  that
          matches  zero or more occurrences of the one-
          character regular expression.   If  there  is
          any  choice, the longest leftmost string that
          permits a match is chosen.

     +    A regular expression followed by `+' (a  plus
          sign)  is  a  regular expression that matches
          one or more occurrences of the  one-character
          regular  expression.  If there is any choice,
          the longest leftmost string  that  permits  a
          match is chosen.

     ?    A regular expression followed by `?' (a ques-
          tion  mark)  is  a  regular  expression  that
          matches zero or one occurrences of  the  one-
          character  regular  expression.   If there is
          any choice, the longest leftmost string  that
          permits a match is chosen.

     |    Alternation:    two    regular    expressions
          separated  by  `|'  or NEWLINE match either a
          match for  the  first  or  a  match  for  the
          second.

     ()   A regular expression enclosed in  parentheses
          matches a match for the regular expression.

     The order of precedence of operators at the same parenthesis
     level  is  `[ ]'  (character  classes),  then  `*'  `+'  `?'
     (closures),then  concatenation,  then  `|'  (alternation)and
     NEWLINE.

----- Revision r1.2 - 23 Aug 2000 - 06:58 GMT - PeterThoeny
Copyright © 1999-2003 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback.