Main Page Content
Regular Expression Basics
Regular expressions, sometimes referred to as regex, grep, or pattern matching, can be a very powerful tool and a tremendous time-saver with a broad range of application. As an extended form of find-and-replace, you can use a regular expression to do things such as perform client-side validation of email addresses and phone numbers, search multiple documents for strings and patterns you wish to change or remove, or extract a list of links from source code. Regex is supported by most languages and tools, but because there can be varying implementations, this article will cover basic principles that are commonly used.
Literals and Metacharacters
If you've seen a regular expression before and thought it looked like alien
space-algebra, it does, but have no fear - you'll be fluent in alien space-algebra in no time! To make the most of the power of regex, you need to be familiar with a few classifications of characters. Literals are normal text characters and can include whitespace (tabs, spaces, newlines, etc.). Unless modified by a metacharacter, a literal will match itself on a one-for-one basis. Metacharacters' power lies in how they are arranged and interpreted as wildcards. Metacharacters can be escaped with a backslash (\) to find instances of themselves, for instance, if you need to find a caret (^) or a backslash, as well as used in nested groups or other combinations.Below is a list of some metacharacters and character classes for a quick glance - each will be explained in further detail with examples. Keep in mind that a "match" can be as simple as a single character or as complex as a sequence of literals and metacharacters in nested and compounded combinations.
Metacharacter | Match |
---|---|
\ | the escape character - used to find an instance of a metacharacter like a period, brackets, etc. |
. (period) | match any character except newline |
x | match any instance of x |
^x | match any character except x |
[x] | match any instance of x in the bracketed range - [abxyz] will match any instance of a, b, x, y, or z |
(pipe) | an OR operator - [x y] will match an instance of x or y |
() | used to group sequences of characters or matches |
{} | used to define numeric quantifiers |
{x} | match must occur exactly x times |
{x,} | match must occur at least x times |
{x,y} | match must occur at least x times, but no more than y times |
? | preceding match is optional or one only, same as {0,1} |
* | find 0 or more of preceding match, same as {0,} |
+ | find 1 or more of preceding match, same as {1,} |
^ | match the beginning of the line |
$ | match the end of a line |
Detailed descriptions of regex operators
Within these descriptions, x is used as a placeholder for examples - x can
be an actual x or it can be an entire sequence like href="http://www.evolt.org", <DIV>, or ((\.\.)?/[a-z]+\.jpg).. - Matches any one character except newline and is generally used with quantifiers, which will be explained below. For instance,
.{3} would find three-letter wordsx - Matches any instance of x and can include
specific character sets or ranges, for instance, [wxyz] would match any instance of w, x, y, or z, but not wz, yx, or other combinations of the given character set, unless it was followed by a quantifier.^x - Matches any character that is not x and can also be used in a range. For
example, <[^abel]+> would match one or more letters that are not a, b, e, or l, and which are surrounded by < and >, thus it would match <font> but not <table>.[x] - Matches any character in the given range. Examples of a range would be
the expression [0-9], which would find a single digit, or [a-z], which would find a single lower case character. You can combine ranges as well - [A-Za-z0-9] will find a single upper or lower case character or digit. You may also combine ranges with commas, such as [0-3, 5-8] which would find any digit that isn't 4 or 9.- The OR operator can be used at the character level or combined in sequences.
[x y] will find instances of x or y and you aren't limited to just two objects - [w x y z] is perfectly valid.() - Parentheses are used to group operators much like basic algebra and are
also used to delineate a backreference, which is the way you can do replaces with matches. (Backreferences get their own section below). A simple example would look something like: www\.([a-z]+)\.com which will find www.anycharactersathroughzhere.com.{} - Curly brackets (or braces) are used to define numeric quantifiers, which
allow you to specify the optional, minimum, or maximum number of occurrences in the match. x{3} would find exactly 3 occurrences of x. x{3,} matches on at least 3 occurrences of x. x{3,5} matches at least 3 occurrences of x and no more than 5.? - The preceding match is optional or must match exactly one time. An example
would be: ((\.\.)?/[a-z]+\.jpg) which matches a path to an image file ending in .jpg and could start with a ../ or just a /. A ./ or ../../ would fail to match that particular expression.* - Matches the preceding character or group 0 or more times.
Note that this is not the same as the use of the ? listed above. z* can match no z, z, or for those readers who have already fallen asleep, zzzzzzzzzzzzzzzzzzzzzzz.+ - Matches the preceding character or group 1 or more times. In comparison
to the previous example, z+ would have to match at least z or zz or zzz and so on.^ - Used to force a match to the beginning of a line. Note that this is not
the same as a character exclusion such as [^xyz], which would match any characters that are not x, y, or z. ^Hello would match at the beginning of a line such as Hello Chris and would not match Chris said Hello.$ - Used to force a match to the end of a line. $end would match at the end
of a line such as This is the end and would not match end this article already!The various operators and metacharacters listed above are pretty standard across
most implementations of regex. POSIX class names and character class shorthands are shortcuts to specify character types like digits, whitespace, and so on.POSIX (Portable Operating System Interface) classes should be more consistent
across languages and applications but there may not be an exact parallel between certain class shorthands and POSIX classes, and either class type may not always be fully supported. If they are supported, POSIX classes can be useful since they have a little more precision when it comes to things like whitespace and other non-alphanumeric characters.POSIX Class | Match |
---|---|
[:alnum:] | alphabetic and numeric characters |
[:alpha:] | alphabetic characters |
[:blank:] | space and tab |
[:cntrl:] | control characters |
[:digit:] | digits |
[:graph:] | non-blank (not spaces and control characters) |
[:lower:] | lowercase alphabetic characters |
[:print:] | any printable characters |
[:punct:] | punctuation characters |
[:space:] | all whitespace characters (includes [:blank:], newline, carriage return) |
[:upper:] | uppercase alphabetic characters |
[:xdigit:] | digits allowed in a hexadecimal number (i.e. 0-9, a-f, A-F) |
Character class | Match |
---|---|
\d | matches a digit, same as [0-9] |
\D | matches a non-digit, same as [^0-9] |
\s | matches a whitespace character (space, tab, newline, etc.) |
\S | matches a non-whitespace character |
\w | matches a word character |
\W | matches a non-word character |
\b | matches a word-boundary (NOTE: within a class, matches a backspace) |
\B | matches a non-wordboundary |
Think dif{2}erently
Many Macintosh applications can easily handle regular expressions, but that's
not what I mean here. The philosophy of regex is one of surgical precision and extreme logic, and you have to play by the rules. Like doing a complex database query, you have to know exactly what you want and exactly how to get it or you'll end up with either way more data than you need or not enough information. The concepts of AND, OR, wildcards, and the liberal use of parentheses all come into play with regex. You have to carefully create an expression that meets your needs but is neither too restrictive nor too inclusive or the dark side of regular expressions will rear its ugly head.A warning about "greediness"
With true power, comes an unhealthy dose of greed. Regular expressions are very greedy. They may seem nice and friendly, but they'll take all they can get. What this means is that a regex will try to match as much as it can, since it's not smart enough to stop on the earliest possible match. It assumes you want the "whole thing", which is why you need to create a surgical strike of an expression. You can take care of a broken toe by amputating above the knee, but then where does that leave you? (Hopping mad, probably).
A great example of regex greediness is the expression:
<a href=".*">.*</a>
At first glance, it appears this expression will find an href tag (having no
extra attributes) with a reference containing just about any URL, followed by ">, then anything in the link text, then the closing </a>. You could use this to get a list of all the links in a web page. Sounds useful and looks mostly harmless, right? What you end up with is something like this:<a href="http://sample.url.here">Click this!</a>. Some text goes <a href="../text.htm">here</a>. Maybe several paragraphs go here. More text goes <a href="/less/is/more.htm">here</a>. Another big block of text, text, and more text. <a href="end.htm">The End</a>
The reason you get a whole block of text mixed with links as a single match
instead of a simple list containing each link is because the sub-expression .* is where the greed kicks in. The .* really does mean "match anything" so it merrily goes along until it can't match anything else, which matches up to the very last </a> it can find and grabs everything in between along the way. It started at the toe and went straight to the thigh, without even thinking about slowing down at the knee.Here's where we put a splint on the toe instead of amputating the whole leg.
Break down the parts of this expression:<a href="[^"]+">[^<]+</a>
You start with the <a href=" and then you see [^"]+">. If you've been
following along with the rules, you know that this means find at least one of any character except a double-quote, then find the first instance of a double-quote, then a >. The same principle applies to the next part - [^<]+</a> finds at least one of any character except a <, then matches the first literal instance of </a>. Search with this expression and you get a nice short list of complete href tags. Conquer the greed! A clear understanding of the rules of regex and the various operators is paramount and it will take patience as well as experimentation with your logic to learn to tune an expression to yield exactly what you need.Backreferences
Using a backreference is how you finally get to witness the real power of regular
expressions. Extracting a list of links from a page of source is useful, but nowhere as useful as being able to do something with that data. Parentheses can be used to "remember" a subexpression, and a backreference in the form of \digit is how you refer to that particular group. Parentheses are counted from left to right within the expression, so the first open parentheses group has a backreference of \1, the second has a backreference of \2, and so on. You can use the memory-like functionality of a backreference in a replace string.A good example of this uses the href expression from above. You can get a list
of complete hrefs from some source with the expression <a href="[^"]+">[^<]+</a>. Let's say you need to find all external links on a web site and remove the href tag, but leave the link text intact, and we'll assume for this example that none of your local links start with http://. You would add parentheses to your expression like this:<a href="http://[^"]+">([^<]+)</a>
You would then perform a find with this expression and simply replace with
\1. The parentheses "memorize" the link text and the \1 calls it into the replace, leaving you with just the link text e.g. some text about <a href="http://www.evolt.org">evolt</a> results in some text about evolt.A more interesting example might be a transposition using more than one backreference.
Pretend you have a text list of web site users in the form of LastName, FirstName and you want a list of names in a FirstName LastName format. The expression, ([^,]+),\s(.+) would find Spruck, Chris, since ([^,]+), matches any number of characters that aren't a comma, followed by a comma, then a space, then (.+) finds any number of characters again. Notice where I placed both sets of parentheses. To change Spruck, Chris to the preferred format, you would replace that with \2\s\1, yielding Chris Spruck.When you're doing replaces, it's very important that you test your expressions
on backup copies of files, or even a dummy test file of your own creation, so if your expression is off by a parenthesis or something else, you haven't ruined your files permanently. Once you know your expression works on a sample, then go ahead and work on all your files. If you do run an expression that gives you unintended results, you can probably run another one again to correct the mishap. Don't ask how I know this.Sometimes it may also be useful to run more than one expression over the same
set of data to make it easier to catch every last bit that you need with a second expression. For instance, you might want to add quotes to all your tag attributes if some are unquoted, then run another expression that somehow operates based on the quotes.A few practical examples
Get a list of IP addresses from a server log:
(\d{1,3}\.){3}\d{1,3} - This expression will find three instances of a one
to three digit number followed by a period, then one to three more digits, e.g. 206.159.10.1Find doubled words in text such as "Rate this article high high, please!":
\s([A-Za-z]+)\s\1 - This expression will match a space, followed by a word
of any length (which is later recalled by using the parentheses for a backreference), then a space again. The backreference, \1, then picks up the second instance of the same word. You could then simply replace the match with \1, which will remove the second instance of the word.Remove FONT tags from your web pages:
<(FONT font)([ ]([a-zA-Z]+)=(" ')[^"']+(" '))*[^>]+>([^<]+)(</FONT> </font>)
and replace with the backreference \6 - This expression looks quite complicated, but I wanted to show an example with some more involved logic. A simpler example that finds the same string will follow this explanation. <(FONT font) accounts for an upper or lower case tag. ([ ]([a-zA-Z]+)= matches a space followed by any attribute name and an =. The next subexpression, (" ')[^"']+(" '), finds the leading double or single quotes on the attribute(s), then any attribute value that's not a double or single quote, i.e. Arial, +5, #c3d4ff, etc., then the closing double or single quote. Notice that the subexpression for the entire attribute is enclosed in parentheses and followed by an asterisk - ([ ]([a-zA-Z]+)=(" ')[^"']+(" '))*. This allows you to find a tag with either no attributes or any number that may exist. [^>]+> then matches anything up to the first > (similar to the "greediness" example above). The backreference is defined next as ([^<]+), which will capture any text between the opening and closing font tags, and is referred to as \6 because it's the sixth parenthetical group in the entire expression. Then (</FONT> </font>) accounts for the closing font tag in either case.<(FONT font)[^>]*>[^<]*(</FONT> </font>) is a simpler example that accomplishes the same thing as the expression explained
above. The difference is that it is much less picky about what is between the font tags, so if you have inconsistent tag syntax, it will probably capture the various instances you may have. On the other hand, if you have any extra junk characters in your search data, you may catch things you didn't intend, which is why you should test your expressions ahead of time.A brief history of the 31 Flavors
There are a number of applications and languages that support regular expressions,
but unfortunately, not all of them support regex in quite the same way. Although regular expressions had their origins in neurophysiology in the 1940s and were developed by theoretical mathematicians in the 1950s and 1960s, the evolution and subsequent divergence of regex implementations was due to the independent development of various Unix tools such as grep, awk, sed, Emacs, and others. [1]Today, it's probably safe to say that Perl has the most robust regex engine
in common use. Other languages and applications that have some form of regex support or pattern matching (and this by no means is a complete list) include: JavaScript, VBScript, PHP, Python, Tcl, Java, C, Macromedia Dreamweaver/Ultradev, ColdFusion and ColdFusion Studio, BBEdit, NoteTabPro, TextPad, UltraEdit, the XML Schema and XPath Recommendations, the various Unix tools used for text processing and their clones, and just about any modern application with a Find function.Conclusion
Regular expressions are a powerful tool to keep in your web belt. They can
appear daunting, but by learning a few simple rules, you can save yourself from hours of time doing manual find-and-replaces the slow, boring way.I'll close with what may be the world's first (and undoubtedly the world's
worst) regular expression joke:What did one regex say to the other?
.+
Other Resources
[1] Mastering
Regular Expressions - Friedlwww.regexlib.com
www.webreference.com/js/column5/
All the regular expressions in this article were tested using ColdFusion
Studio 4.5.2, so you may encounter slight differences in different applications or languages. Thanks to Sean Palmer for some expression testing.