Regular Expressions

Revision as of 20:26, 31 January 2013 by Lchrisman (Talk | contribs) (Documented \R -- the general newline)


Regular expressions are a concise and powerful, but cryptic, formalism for identifying patterns of text to match. They can be quite useful for parsing text files that have minor variability in their formats. They play a prominent role in several programming languages, most notably Perl and Python.

Starting with release 4.2, Analytica provides very powerful (Perl-compatible) regular expression processing within several of its built-in text functions, notably FindInText, SplitText, and TextReplace. Each of these functions takes a pattern, which is interpreted as a regular expression when you also specify an optional parameter: re:True. For example:

{To find the position of a seven-letter word:}
FindInText("\b\w{7}\b","Now is the time for all good men to come to the aid of their country",re:1) → 62
{Split on any word having two repeated letters,}
SplitText("When in the course of human events, it becomes necessary for ...","[^\w]*\b\w*(\w)\w*\1\w*\b[^\w]*",re:1)→
        ["When in the course of human", "it", "", "for ..."]

Basics of Regular Expression

Regular expressions consist of basic uninterpreted characters (such as the letters and digits), and several special characters that are interpreted. A simple sequence of non-special characters, like "this", is a simple regular expression that matches when that precise sequence of characters occurs anywhere within the subject text.

The power of regular expressions comes from the special sequences that can be used to specify large classes of matching patterns. For example, the dot character means match any character, so that the regular expression "t..s" matches anywhere a "t" is followed by any two characters and then by "s", such as "this", "ttts", "t as", etc. If you want to match only to a dot in the subject text, then you preceed the dot with a backslash, e.g., "t.\." matches "th." and "ts." but not "ths". In general, if you want to use any of the special characters as literals, you preceed them with a backslash.

Other special characters are these:

  \        general escape character with several uses
  ^        assert start of string (or line, in multiline mode)
  $        assert end of string (or line, in multiline mode)
  .        match any character except newline (by default)
  [        start character class definition
  |        start of alternative branch
  (        start subpattern
  )        end subpattern
  ?        extends the meaning of (
           also 0 or 1 quantifier
           also quantifier minimizer
  *        0 or more quantifier
  +        1 or more quantifier
           also "possessive quantifier"
  {        start min/max quantifier
  \Q...\E  Treat all characters between \Q and \E as literals

Part of a pattern that is in square brackets is called a "character class". In a character class the only metacharacters are:

  \      general escape character
  ^      negate the class, but only if the first character
  -      indicates character range
  [      POSIX character class (only if followed by POSIX syntax)
  ]      terminates the character class

You can refer to several non-printing characters using the following sequences:

  \a        alarm, that is, the BEL character (hex 07)
  \cx       "control-x", where x is any character
  \e        escape (hex 1B)
  \f        formfeed (hex 0C)
  \n        linefeed (hex 0A)
  \r        carriage return (hex 0D)
  \t        tab (hex 09)
  \R        any newline character, equivalent to (?>\r\n|\n|\x0b|\f|\r|\x85)
  \ddd      character with octal code ddd, or backreference
  \xhh      character with hex code hh
  \x{hhh..} character with hex code hhh..

Several character groups have special escape sequences, including:

    \w	     Match a "word" character (letters plus "_")
    \W	     Match a non-"word" character
    \s	     Match a whitespace character
    \S	     Match a non-whitespace character
    \d	     Match a digit character
    \D	     Match a non-digit character

And several escape characters match particular points within text that correspond to a position but not to an actual character:

  \b     matches at a word boundary
  \B     matches when not at a word boundary
  \A     matches at the start of the subject
  \Z     matches at the end of the subject
          also matches before a newline at the end of the subject
  \z     matches only at the end of the subject
  \G     matches at the first matching position in the subject

The full specification of regular expression patterns supported is described at Pcre Patterns Man Page.

Multi-line matching

By default, a regular expression can be used to match over multiple lines in the source text. The caret (^) and dollar ($) patterns match only the very first and very last character in the entire text, and don't match to the first character in a particular line. The \r and \n patterns can, of course, be used to match to line breaks (in most cases within Analytica, lines will be terminated with \r, but you aren't sure which type of line break you are dealing with, you can always use (?:\r\n)|\r|\n.

You can instruct the matcher to operate in a multi-line mode, in which the text is treated as if composed of separate lines, where a pattern exists on a single line. In this mode, caret (^) matches each line start and dollar ($) matches each line end. To use this mode, begin the regular expression with (?m).

In theory (according to the Pcre library documenation), you should be able to control which newline character combinations are recognized as the beginning and end of the line. We haven't seen this work, so it may not actually have an effect. To indicate that any newline character combination should be recognized, start the regular expression with (*ANY), as in: "(*ANY)^\w\d{5}" (which would match to a line within the text beginning with a letter and 5 digits). The (*ANY) prefix considers any standard new-line combination (CR, LF, CRLF) to denote a line break.

Three conventions exist for new lines in text file formats. CR is the standard on the Mac. LF is standard on Unix. CRLF (two characters) is the standard in Windows. Analytica's functions like [ReadTextFile] typically convert to just CR. Excel on Windows (and in CSV files) may use CR for new-rows and LF for new-lines within a single cell. So, depending on where your data is coming from, there are sometimes cases in which you may want to use a multi-line mode, but only with a particular new-line character or combination recognized. The (*ANY) prefix recognizes any of these standard conventions as denoting a newline. (*CR) recognizes only CR, (*LF) recognizes on LF, and (*CRLF) recognizes only the CRLF combination. Note that each of these is a prefix that puts the matcher into a multi-line mode -- the character combinations (*CR) would not appear within the regular expression.

Finding Patterns in Text

The FindInText function, with several optional parameters, can be used to find patterns in text.

FindInText(pattern, text, caseInsensitive, re, return, subPattern)

  • pattern: the regular expression
  • text: the subject text being searched
  • caseInsensitive: When set to 1, matches 'a' to 'A', etc. Matches are case-sensitive by default.
  • re: Must be non-zero for pattern to be interpreted as a regular expression.
  • return: Specifies what information should be returned, as follows:
    • 'P' (or 'Position'): The position in the subject text where the matched pattern was found, or zero if not found.
    • 'L' (or 'Length'): The length of the match in the subject text.
    • 'S' (or 'SubPattern'): The subtext matched by the pattern
    • '#' (or '#SubPatterns'): The number of subpatterns in the regular expression.
  • subPattern: Which subpattern to return information on. See below.

When using FindInText, you have four different options for what information can be returned. By default, the position of the match (or zero if there is no match) is returned, but alternatively you can have it return the length of the match or the actual text that was successfully matched to . For example:

FindInText("[an]+", "A banana in a cabana", re:1, return:'S') → "anana"

If you want to obtain multiple items of information (such as the position, location and matching text) all at the same time, without repeating the match, pass an array to the return parameter.

Subpatterns

You can group subpatterns in a regular expression using parentheses. You can then extract the values matches to a particular subpattern by specifying which subpattern you are interested in using the subPattern parameter. The zeroth subpattern always corresponds to the full pattern, and from there grouped expressions are numbered in a depth-first order. You can also specify a group using parentheses whose contents is not to be retained using (?:...)

For example:

Index I := 0..4;
FindInText("([\w_]+)\s*:\s*((\d*,){4})(\d*),", "NodeInfo: 1,1,1,1,1,1,0,,0,", re:1, return:'S', subPattern:I)
→
0 "NodeInfo : 1,1,1,1,1,"
1 "NodeInfo"
2 "1,1,1,1,"
3 "1,"
4 "1"

You can see here that subPattern:4 in this example extracts the 5th number in the comma-separated list.

To figure out how many subPatterns are present, you can set the return parameter to '#'. If return contains only '#' (i.e., it isn't an array with other 'P', 'L' or 'S' elements), it will determine the number of subPatterns in the regular expression without actually executing a matching search. Thus, if you wanted to pass an index to subPattern, you can figure out how long to make the index before executing the match.

There can be many groupings, and the number and order of groups may change as you debug your regular expression, so using numbered subpatterns is not always the best. You can instead use named subpatterns. The syntax for naming a group is: (?<name>...), or (?'name'....) or (?P<name>...). When you have named a subpattern, you can extract its value by passing the textual name to the subPattern parameter.

FindInText("([\w_]+)\s*:\s*((\d*,){4})(?<border>\d*),", 
               "NodeInfo: 1,1,1,1,1,1,0,,0,", re:1, 
               return:'S', subPattern:'border') → "1"

Duplicate Subpatterns

Cases frequently arise in which there are two or more alternative syntaxes for subpattern, requiring two subpatterns within the regular expression to have the same name, but usually these are disjunctive. For example, in a standard Excel-compatible CSV format, a cell with no comma or new-line characters does not need to be quoted, but if the cell contains a comma, quotes must be placed around it. For example, a line of a CSV file might be:

San Jose,"1,006,102",10,Chuck Reed,"SJ,San José,SJC"

This CSV entry has 5 items separated by commas, but two items have internal commas and thus are quoted. Thus, each item matches one of two possible regular expressions, either: ([^,]+) or "(.*?)". Notice that the parenthesis in the second case do not include the quotes, since we do not which to include that in the pattern. To match either, we form a disjunction, but since they refer to the item, we name both branches with the same subpattern name:

("(?<city>.+?)")|(?<city>[^,]+?),\s*("(?<pop>.+?)")|(?<pop>[^,]+?)

Because the two subpatterns named city are disjunctive, only one of them will match. So, when you request the subpattern "city", you'll get the one which matched (the second in the example). Similarly, only one "pop" subpattern will match, in this example the first, so you'll get info for the one that actually matched.

You could have multiple matches to a subpattern (either named or numbered), as occurs with the regular expression "b(a)*c" applied to "dbaaacd". There is a limitation here in that you can only get the data for one of the repeated matches, the last one.

Splitting on a Pattern

You can provide a regular expression as the separator to the SplitText function. This makes it possible to split text into parts in such a way that allows multiple types of separators, variable length separators, or uncertainty about what the separator will be.

For example, to split on any punctuation character:

 SplitText( text, "[\.\?,!]", re:1 )

Or to split on any number of spaces, so that you don't get blank spaces between separators:

 SplitText(text, "\s+", re:1 )

Notice that the parameter re:1 must be specified to cause the separator to be interpreted as a regular expression.

Substitutions

The TextReplace function accepts a regular expression as its pattern when the re:1 parameter is specified.

TextReplace(text,pattern,subst,all,caseInsensitive,re)

  • text : the subject text being matched to
  • pattern: The regular expression
  • subst : the text to be substituted for the subtext that matches pattern
  • all : 0=replace only the first occurrence (default), 1=replace every occurrence
  • caseInsensitive: 1='A' matches 'a', etc. CaseSensitive by default.
  • re: Must be set to 1 for regular expressions

It is recommended that you use a named-parameter calling syntax for the optional parameters. Here are some examples:

TextReplace("3.141592654", "1|5|9", "0", re:1 ) → "3.041592564"
TextReplace("3.141592654", "1|5|9", "0", re:1, all:1 ) → "3.040002604"
TextReplace("3.141592654", "(1|5|9)+", "0", re:1, all:1 ) → "3.140002654"

SubPattern Substitutions

When regular expressions are used, the subst parameter may refer to subPattern groupings that appear in the pattern parameter. The matching text for those is substituted accordingly. \0 denotes the full text matched by the full regular expression, \1 is the first subpattern, \2 the second, up to \9.

You can also refer to named subpatterns using <name> in the subst parameter. Again, the subtext matching the corresponding named subpattern is substituted. Some examples:

TextReplace("3.141592", "(\d)", "\1\1", re:1 ) → "33.141592"
TextReplace("3.141592", "(\d)", "\1\1", re:1, all:1 ) → "33.114411559922"
TextReplace("time", "(.)(.)(.)(.)", "\4\3\2\1", re:1, all:1 ) → "emit"
TextReplace("543,632","(?<x>\d+),(?<y>\d+)", "<y>,<x>", re:1, all:1) → "632,543"

Credits

Analytica makes use of the Perl Compatible Regular Expression library, written by Philip Hazel (email: ph10 at cam.ac.uk) of the University of Cambridge Computing Service, Cambridge, England. Copyright (c) 1997-2008 University of Cambridge All rights reserved.

The library is included in Analytica under the "BSD" license published with the PCRE release 7.8 distributable.

Comments


You are not allowed to post comments.