Regular Expression Basics

Posted on 22 Mar 2002

in Site Development

by Chris Spruck (sprocket)

Rated 4.51 (Ratings: 51)

Want more?

More articles in Site Development

Chris Spruck

Member info

User since: 18 Sep 2000

Articles written: 1

Regular expressions, sometimes referred to as regex, grep, or pattern matching, can be a very powerful tool and a tremendous time-saver with a broad range of application. As an extended form of find-and-replace, you can use a regular expression to do things such as perform client-side validation of email addresses and phone numbers, search multiple documents for strings and patterns you wish to change or remove, or extract a list of links from source code. Regex is supported by most languages and tools, but because there can be varying implementations, this article will cover basic principles that are commonly used.

Literals and Metacharacters

If you've seen a regular expression before and thought it looked like alien

space-algebra, it does, but have no fear - you'll be fluent in alien space-algebra

in no time! To make the most of the power of regex, you need to be familiar

with a few classifications of characters. Literals are normal text

characters and can include whitespace (tabs, spaces, newlines, etc.). Unless

modified by a metacharacter, a literal will match itself on a one-for-one basis.

Metacharacters' power lies in how they are arranged and interpreted

as wildcards. Metacharacters can be escaped with a backslash (\) to find instances

of themselves, for instance, if you need to find a caret (^) or a backslash,

as well as used in nested groups or other combinations.

Below is a list of some metacharacters and character classes for a quick glance - each will be explained in further detail with examples. Keep in mind that a "match" can be as simple as a single character or as complex as a sequence of literals and metacharacters in nested and compounded combinations.

Metacharacter	Match
\	the escape character - used to find an instance of a metacharacter like a period, brackets, etc.
. (period)	match any character except newline
x	match any instance of x
^x	match any character except x
[x]	match any instance of x in the bracketed range - [abxyz] will match any instance of a, b, x, y, or z
(pipe)	an OR operator - [x y] will match an instance of x or y
()	used to group sequences of characters or matches
{}	used to define numeric quantifiers
{x}	match must occur exactly x times
{x,}	match must occur at least x times
{x,y}	match must occur at least x times, but no more than y times
?	preceding match is optional or one only, same as {0,1}
*	find 0 or more of preceding match, same as {0,}
+	find 1 or more of preceding match, same as {1,}
^	match the beginning of the line
$	match the end of a line

Detailed descriptions of regex operators

Within these descriptions, x is used as a placeholder for examples - x can

be an actual x or it can be an entire sequence like href="http://www.evolt.org",

<DIV>, or ((\.\.)?/[a-z]+\.jpg).

. - Matches any one character except newline and is generally used with quantifiers, which will be explained below. For instance,

.{3} would find three-letter words

x - Matches any instance of x and can include

specific character sets or ranges, for instance, [wxyz] would match any instance

of w, x, y, or z, but not wz, yx, or other combinations of the given character

set, unless it was followed by a quantifier.

^x - Matches any character that is not x and can also be used in a range. For

example, <[^abel]+> would match one or more letters that are not a, b,

e, or l, and which are surrounded by < and >, thus it would match <font>

but not <table>.

[x] - Matches any character in the given range. Examples of a range would be

the expression [0-9], which would find a single digit, or [a-z], which would

find a single lower case character. You can combine ranges as well - [A-Za-z0-9]

will find a single upper or lower case character or digit. You may also combine

ranges with commas, such as [0-3, 5-8] which would find any digit that isn't

4 or 9.

- The OR operator can be used at the character level or combined in sequences.

[x y] will find instances of x or y and you aren't limited to just two objects

- [w x y z] is perfectly valid.

() - Parentheses are used to group operators much like basic algebra and are

also used to delineate a backreference, which is the way you can do replaces

with matches. (Backreferences get their own section below). A simple example

would look something like: www\.([a-z]+)\.com which will find www.anycharactersathroughzhere.com.

{} - Curly brackets (or braces) are used to define numeric quantifiers, which

allow you to specify the optional, minimum, or maximum number of occurrences

in the match. x{3} would find exactly 3 occurrences of x. x{3,} matches on at
least 3 occurrences of x. x{3,5} matches at least 3 occurrences of x and

no more than 5.

? - The preceding match is optional or must match exactly one time. An example

would be: ((\.\.)?/[a-z]+\.jpg) which matches a path to an image file ending

in .jpg and could start with a ../ or just a /. A ./ or ../../ would fail to

match that particular expression.

* - Matches the preceding character or group 0 or more times.

Note that this is not the same as the use of the ? listed above. z* can match

no z, z, or for those readers who have already fallen asleep, zzzzzzzzzzzzzzzzzzzzzzz.

+ - Matches the preceding character or group 1 or more times. In comparison

to the previous example, z+ would have to match at least z or zz or zzz and

so on.

^ - Used to force a match to the beginning of a line. Note that this is not

the same as a character exclusion such as [^xyz], which would match any characters

that are not x, y, or z. ^Hello would match at the beginning of a line such

as Hello Chris and would not match Chris said Hello.

$ - Used to force a match to the end of a line. $end would match at the end

of a line such as This is the end and would not match end this
article already!

The various operators and metacharacters listed above are pretty standard across

most implementations of regex. POSIX class names and character class shorthands

are shortcuts to specify character types like digits, whitespace, and so on.

POSIX (Portable Operating System Interface) classes should be more consistent

across languages and applications but there may not be an exact parallel between

certain class shorthands and POSIX classes, and either class type may not always

be fully supported. If they are supported, POSIX classes can be useful since

they have a little more precision when it comes to things like whitespace and

other non-alphanumeric characters.

POSIX Class	Match
[:alnum:]	alphabetic and numeric characters
[:alpha:]	alphabetic characters
[:blank:]	space and tab
[:cntrl:]	control characters
[:digit:]	digits
[:graph:]	non-blank (not spaces and control characters)
[:lower:]	lowercase alphabetic characters
[:print:]	any printable characters
[:punct:]	punctuation characters
[:space:]	all whitespace characters (includes [:blank:], newline, carriage return)
[:upper:]	uppercase alphabetic characters
[:xdigit:]	digits allowed in a hexadecimal number (i.e. 0-9, a-f, A-F)

Character class	Match
\d	matches a digit, same as [0-9]
\D	matches a non-digit, same as [^0-9]
\s	matches a whitespace character (space, tab, newline, etc.)
\S	matches a non-whitespace character
\w	matches a word character
\W	matches a non-word character
\b	matches a word-boundary (NOTE: within a class, matches a backspace)
\B	matches a non-wordboundary

Think dif{2}erently

Many Macintosh applications can easily handle regular expressions, but that's

not what I mean here. The philosophy of regex is one of surgical precision and

extreme logic, and you have to play by the rules. Like doing a complex database

query, you have to know exactly what you want and exactly how to get it or you'll

end up with either way more data than you need or not enough information. The

concepts of AND, OR, wildcards, and the liberal use of parentheses all come

into play with regex. You have to carefully create an expression that meets

your needs but is neither too restrictive nor too inclusive or the dark side

of regular expressions will rear its ugly head.

A warning about "greediness"

With true power, comes an unhealthy dose of greed. Regular expressions are very greedy. They may seem nice and friendly, but they'll take all they can get. What this means is that a regex will try to match as much as it can, since it's not smart enough to stop on the earliest possible match. It assumes you want the "whole thing", which is why you need to create a surgical strike of an expression. You can take care of a broken toe by amputating above the knee, but then where does that leave you? (Hopping mad, probably).

A great example of regex greediness is the expression:

<a href=".*">.*</a>

At first glance, it appears this expression will find an href tag (having no

extra attributes) with a reference containing just about any URL, followed by

">, then anything in the link text, then the closing </a>. You could

use this to get a list of all the links in a web page. Sounds useful and looks

mostly harmless, right? What you end up with is something like this:

<a href="http://sample.url.here">Click this!</a>. Some text goes
<a href="../text.htm">here</a>. Maybe several paragraphs go here.
More text goes <a href="/less/is/more.htm">here</a>. Another big
block of text, text, and more text. <a href="end.htm">The End</a>

The reason you get a whole block of text mixed with links as a single match

instead of a simple list containing each link is because the sub-expression

.* is where the greed kicks in. The .* really does mean "match anything"

so it merrily goes along until it can't match anything else, which matches up

to the very last </a> it can find and grabs everything in between along

the way. It started at the toe and went straight to the thigh, without even

thinking about slowing down at the knee.

Here's where we put a splint on the toe instead of amputating the whole leg.

Break down the parts of this expression:

<a href="[^"]+">[^<]+</a>

You start with the <a href=" and then you see [^"]+">. If you've been

following along with the rules, you know that this means find at least one of

any character except a double-quote, then find the first instance of

a double-quote, then a >. The same principle applies to the next part - [^<]+</a>

finds at least one of any character except a <, then matches the first literal

instance of </a>. Search with this expression and you get a nice short

list of complete href tags. Conquer the greed! A clear understanding of the

rules of regex and the various operators is paramount and it will take patience

as well as experimentation with your logic to learn to tune an expression to

yield exactly what you need.

Backreferences

Using a backreference is how you finally get to witness the real power of regular

expressions. Extracting a list of links from a page of source is useful, but

nowhere as useful as being able to do something with that data. Parentheses

can be used to "remember" a subexpression, and a backreference in the form of

\digit is how you refer to that particular group. Parentheses are counted

from left to right within the expression, so the first open parentheses group

has a backreference of \1, the second has a backreference of \2, and so on.

You can use the memory-like functionality of a backreference in a replace string.

A good example of this uses the href expression from above. You can get a list

of complete hrefs from some source with the expression <a href="[^"]+">[^<]+</a>.

Let's say you need to find all external links on a web site and remove the href

tag, but leave the link text intact, and we'll assume for this example that

none of your local links start with http://. You would add parentheses to your

expression like this:

<a href="http://[^"]+">([^<]+)</a>

You would then perform a find with this expression and simply replace with

\1. The parentheses "memorize" the link text and the \1 calls it into the replace,

leaving you with just the link text e.g. some text about <a href="http://www.evolt.org">evolt</a>

results in some text about evolt.

A more interesting example might be a transposition using more than one backreference.

Pretend you have a text list of web site users in the form of LastName, FirstName

and you want a list of names in a FirstName LastName format. The expression,

([^,]+),\s(.+) would find Spruck, Chris, since ([^,]+), matches any number of

characters that aren't a comma, followed by a comma, then a space, then (.+)

finds any number of characters again. Notice where I placed both sets of parentheses.

To change Spruck, Chris to the preferred format, you would replace that with

\2\s\1, yielding Chris Spruck.

When you're doing replaces, it's very important that you test your expressions

on backup copies of files, or even a dummy test file of your own creation, so

if your expression is off by a parenthesis or something else, you haven't ruined

your files permanently. Once you know your expression works on a sample, then

go ahead and work on all your files. If you do run an expression that gives

you unintended results, you can probably run another one again to correct the

mishap. Don't ask how I know this.

Sometimes it may also be useful to run more than one expression over the same

set of data to make it easier to catch every last bit that you need with a second

expression. For instance, you might want to add quotes to all your tag attributes

if some are unquoted, then run another expression that somehow operates based on

the quotes.

A few practical examples

Get a list of IP addresses from a server log:

(\d{1,3}\.){3}\d{1,3} - This expression will find three instances of a one

to three digit number followed by a period, then one to three more digits, e.g.

206.159.10.1

Find doubled words in text such as "Rate this article high high, please!":

\s([A-Za-z]+)\s\1 - This expression will match a space, followed by a word

of any length (which is later recalled by using the parentheses for a backreference),

then a space again. The backreference, \1, then picks up the second instance

of the same word. You could then simply replace the match with \1, which will

remove the second instance of the word.

Remove FONT tags from your web pages:

<(FONT font)([ ]([a-zA-Z]+)=(" ')[^"']+(" '))*[^>]+>([^<]+)(</FONT> </font>)

and replace with the backreference \6 - This expression looks quite complicated,

but I wanted to show an example with some more involved logic. A simpler example

that finds the same string will follow this explanation. <(FONT font) accounts

for an upper or lower case tag. ([ ]([a-zA-Z]+)= matches a space followed by

any attribute name and an =. The next subexpression, (" ')[^"']+(" '),

finds the leading double or single quotes on the attribute(s), then any attribute

value that's not a double or single quote, i.e. Arial, +5, #c3d4ff, etc., then

the closing double or single quote. Notice that the subexpression for the entire

attribute is enclosed in parentheses and followed by an asterisk - ([ ]([a-zA-Z]+)=(" ')[^"']+(" '))*.

This allows you to find a tag with either no attributes or any number that may

exist. [^>]+> then matches anything up to the first > (similar to the

"greediness" example above). The backreference is defined next as

([^<]+), which will capture any text between the opening and closing font

tags, and is referred to as \6 because it's the sixth parenthetical group in

the entire expression. Then (</FONT> </font>) accounts for the closing

font tag in either case.

<(FONT font)[^>]*>[^<]*(</FONT> </font>) is a simpler example that accomplishes the same thing as the expression explained

above. The difference is that it is much less picky about what is between the

font tags, so if you have inconsistent tag syntax, it will probably capture

the various instances you may have. On the other hand, if you have any extra junk

characters in your search data, you may catch things you didn't intend, which

is why you should test your expressions ahead of time.

A brief history of the 31 Flavors

There are a number of applications and languages that support regular expressions,

but unfortunately, not all of them support regex in quite the same way. Although

regular expressions had their origins in neurophysiology in the 1940s and were

developed by theoretical mathematicians in the 1950s and 1960s, the evolution

and subsequent divergence of regex implementations was due to the independent

development of various Unix tools such as grep, awk, sed, Emacs, and others.

[1]

Today, it's probably safe to say that Perl has the most robust regex engine

in common use. Other languages and applications that have some form of regex

support or pattern matching (and this by no means is a complete list) include:

JavaScript, VBScript, PHP, Python, Tcl, Java, C, Macromedia Dreamweaver/Ultradev,

ColdFusion and ColdFusion Studio, BBEdit, NoteTabPro, TextPad, UltraEdit, the

XML Schema and XPath Recommendations, the various Unix tools used for text processing

and their clones, and just about any modern application with a Find function.

Conclusion

Regular expressions are a powerful tool to keep in your web belt. They can

appear daunting, but by learning a few simple rules, you can save yourself from

hours of time doing manual find-and-replaces the slow, boring way.

I'll close with what may be the world's first (and undoubtedly the world's

worst) regular expression joke:

What did one regex say to the other?

Other Resources

[1] Mastering

Regular Expressions - Friedl

www.regexlib.com

www.webreference.com/js/column5/

All the regular expressions in this article were tested using ColdFusion

Studio 4.5.2, so you may encounter slight differences in different applications
or languages. Thanks to Sean Palmer for some expression testing.

Chris' favorite regular expression is a smirk with an optional wink. He lives in Atlanta, Georgia and dreams of being back on the coast. He probably needs more info in his bio. (He almost never refers to himself in the third person.)

Start of page header

Other Fine Evolt.org Sites

Navigation Starts

Submit

Article Categories

Highest rated articles

Help Support evolt.org

Main Page Content