The Robservatory

Robservations on everything…

 

The little I know about regex…and where to learn more

First off, regex is shorthand for a regular expression. And what, exactly, is a regular expression? According to the linked Wikipedia page, a regular expression is…

…in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings.

That’s a mouthful, but what it means is that you can write some really bizarre looking code that will transform text from one form to another form. And if you know just a bit of regex, and where to go to look up what you don’t know, then you can use regex to do many useful things.

For example, consider this filename on a scanned-to-PDF receipt:

The Party Place [party supplies] - 02-06-2017

Perhaps you’d prefer it if the date came first, in year-month-day order, so that your receipts were ordered by date, like this:

2017-02-06 - The Party Place [party supplies]

Sure, you could manually rename this one file, but what if you have 500 receipts that you need to rename? Enter regular expressions—they’ll let you do this text manipulation, and many more. What follows is a very brief summary of my knowledge of regex, along with pointers to sites where I go when (very often) the problem I need to solve is beyond my regex skill level.

First off, where can you use regex? In many places; in my case, my need to learn some regex came from our own app Name Mangler, which supports it in renaming operations. I also use regex in BBEdit (and the free TextWrangler), where it’s available via a “grep” checkbox in the Find dialog.

In addition, you can use regex with any numer of Terminal programs, including sed, awk, perl, and ruby.

While the syntax may be somewhat…OK, incredibly…obtuse, there’s no doubt that knowing some regex can help you manipulate text in ways that would otherwise be really time consuming. Consider the filename example from above; you can rearrange the filename into the preferred order with these regex structures:

Find: (.*)\ -\ ([\d]{2})-([\d]{2})-([\d]{4})
Replace: $4-$2-$3 - $1

Just how exactly does all that gibberish work to transform a filename? Follow me now as I attempt to break it down into plain language; regex pros, please note that I’m sure I’ll get some of this wrong, so feel free to correct me.

A regex generally proceeds through the text from left to right (there are ways to find relative to the end), so let’s look at each piece of the find structure in that order.

(.*)

First, about those parentheses. A set of parentheses in regex means “put whatever matches the regex inside the parentheses into a variable for future use (typically in the ‘replace’ step).” The variables are assigned in order, with the first set of parentheses becoming $1, the next $2, etc.

The .* says to match any character (the dot) zero or more times (the star). This literally means “match anything, and put it in $1.” You might think you’d wind up with the entire filename in $1, and you’d be right…except there’s more regex that limits what it matches…

\ -\ 

This bit finds the “space hyphen space” (I’ve colored the spaces gray so you can see them) that separates the name from the date in the original filename. This also limits the first bit of the regex’s “match anything” structure: It will match anything up to the “space hyphen space.” So we wind up with the text of the filename in $1, not the entire string.

There are no parentheses around this part because I don’t need to keep it; I just need to find it so I can then move on to the next part of the filename.

Note: To get ultra-geeky for a sec, this bit is written in Name Mangler’s syntax, which is based on PCRE in free-spacing mode, which requires spaces to be escaped—hence the backslashes around the two spaces in the string. Other forms of regex don’t require the spaces to be escaped.

([\d]{2})-

Another set of parentheses, so these results will become $2. The [\d] is a shorthand character class which means “a digit between 0 and 9.” The {2} says “find two of the previous thing in a row,” that is, two digits. Finally, the trailing - is the hyphen that separates the month from the day—it’s not in the parentheses because I don’t need to keep it. The net result of this step is that the month is stored in $2.

([\d]{2})-

Identical to the previous bit, except that this will be stored in $3, as it’s the third set of parentheses. This pulls the month out and stores it, again dropping the hyphen.

([\d]{4})

This is very similar to the previous two steps, but now, we find four digits—the year—and it gets stored in $4.

So after executing this find, we have four variables with these values:

  • $1 – The Party Place [party supplies]
  • $2 – 02
  • $3 – 06
  • $4 – 2017

With those values in the variables, the replace bit is now relatively self-explanatory:

$4-$2-$3 - $1

Year first, then month, day, and filename, with hyphens and spaces inserted where needed. As you can see, regex can do some powerful manipulation on strings, whether those strings are in a file or are a series of filenames. I’ve only scratched the surface here, because honestly, that’s about how deep my knowledge goes.

When I reach my regex knowledge limits, here are the sites where I go for more help.

Regular-Expressions.info

This is the “big one,” a massive site with tons of detail on everything regex. There are examples, definitions, rules, exceptiopns, and so much more. It’s not a light read, but when I’m stuck, I can almost always find what I’m looking for here.

The Wikipedia regex page

There’s not as much detail here, but there are some nice tables showing syntax for various regex constructs. There’s also a brief history of regex, if such things interest you

Online regex tester

I use this site—there are many others—to test my regex builds against the text I want them to act on. As a relative regex neophyte, I spend a lot of time on this site, checking my expressions before putting them to use.

It has a couple nice features that I really find useful. First, if you hover over your regex, pop-ups explain what each little section does (this also works when hovering over the replacement regex):

Second, if you hover over the test string, a pop-up shows exactly what your regex has captured:

There may be other sites that do this, too, but this is the first one I found, and I stopped looking at that point.

MSDN regex quick reference

Microsoft’s developer site has a nice regex reference, with lots of easy-to-read tables explaining various regex constructs.

More learning links

I haven’t used all of these extensively, but I browse them on occasion…

For a good overview of regex in general, this intro on Stack Overflow is well written and relatively easy to follow.

If you’re looking for a hands-on tutorial, I’ve recently discovered the RegexOne tutorials. There are 16 tutorials and nine sample problems, and they do an excellent job at moving from really simple stuff to some really complex stuff. Each lesson has a little “solve this” bit at the bottom, and if you view the solution, it’s fully explained so you can see how it works.

Regex syntax is complicated, but you don’t need to know a lot of it to do good stuff. Just use it cautiously—if you’re working on filenames, always make sure they’re backed up before you start, just in case. Use the linked references to learn more, and use one of the online testers to insure that what you think you’re doing is what you’ll actually be doing.

Addendum: After this went live, someone pointed me to Patterns, a $3 App Store app that lets you build and test regex. I haven’t bought it yet, mainly because there are a few recent reports of crashes. If/when I do buy it, I’ll post a review.

3 Comments

Add a Comment
  1. Zed Shaw used to have a great “learn the hard way” regex tutorial. I guess he’s taken it down while working on a book release. https://learncodethehardway.org/regex/

    I feel the best way to learn regex, is the simplest. Start with simple substitutions and finds with grep/vim/sed/etc. Next thing you know, you’ll be cursing the apps that don’t have regex builtin.

  2. Great explanation for someone with zero knowledge of regex, Rob. Thank you! I felt you were writing to me, clarifying your Feb 24 comment. Now I’m looking for a problem to solve using regex.

Leave a Reply

The Robservatory © 2017 Built from the Frontier theme