Regular Expressions Demystified – A Mini DSL for Regex in Python

I’ve always had a hard time with regular expressions: I know that they are useful, but I use them so rarely that I cannot get a hold of all the syntax.

So, now is the time to write an article for myself to remember all the stuff.

What are Regular Expressions aka RegEx?

RegEx are a sequence of characters which help you to search patterns in text.

Say you have an input string which contains whitespaces, tabs and line break:

input_string = "   \tJoernBoegeholz  \n"

You will certainly agree that it won’t be a good idea to use this string as e.g. a username.  If a username is necessary to login into a system, a user will not remember if he accidentially typed a whitespace character in to form field.. So we have to replace the whitespaces, tabs and linebreak.

output = input_string.replace(" ", "")
output = output.replace("\t", "")
output = output.replace("\n", "")

This is a bit messy, with RegEx we can use the “\s” Metacharacter

output = re.sub("\s", "", input_string)

From the Python Doc:

“When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].”

Please take this just as an example, in production code You would use “strip()” to remove leading and trailing whitespaces.

Ok, here is the catch: I cannot remember the meta-characters. That makes working with RegEx cumbersome for me. To make a virtue out of necessity I started to write a domain specific language to facilitate working with regex in python.

First step

All meta-characters are represented as a constant.

ANY_CHAR = '.'
DIGIT = '\d'
NON_DIGIT = '\D'
WHITESPACE = '\s'
NON_WHITESPACE = '\S'
ALPHA = '[a-zA-Z]'
ALPHANUM = '\w'
NON_ALPHANUM = '\W'

Second Step

We wrap the multiplier in convenience methods.

def zero_or_more(string):
    return string + '*'

def zero_or_once(string):
    return string + '?'

def one_or_more(string):
    return string + '+'

Third Step

As syntactic sugar we introduce a class which encapsulates the pattern:

class Pattern:

    def __init__(self):
        self.pattern = ''

    def starts_with(self, start_str):
        self.pattern += start_str
        return self

    def followed_by(self, next_string):
        self.pattern += next_string
        return self

    def __str__(self):
        return self.pattern

    def __repr__(self):
        return self._regex

Result

Instead of writing

pattern = "\d\D+\s{2,4}"

you can now write

pattern = Pattern()
pattern.starts_with(DIGIT)\
    .followed_by(one_or_more(NON_DIGIT))\
    .followed_by(between(2, 4, WHITESPACE))

which is more human readable.

My first PyPI package

After using

pip install <module_name>

for a couple of years, I wanted to know how I can upload a new package to PyPI or the “Python Package Index”. This great tutorial helped me a lot to do the first steps.

At the moment it’s a pet project, but if you are interested You can use the code via pip install easy_pattern.

Links

PyPi

Github

4 Replies to “Regular Expressions Demystified – A Mini DSL for Regex in Python”

  1. An impressive share, I just now with all this onto a colleague who had been performing a little analysis for this. And that he in truth bought me breakfast since I discovered it for him.. smile. So ok, i’ll reword that: Thnx for the treat! But yeah Thnkx for spending some time to debate this, I’m strongly over it and enjoy reading regarding this topic. When possible, as you become expertise, might you mind updating your blog site with an increase of details? It truly is highly a good choice for me. Massive thumb up in this article!

Leave a Reply to Jörn Cancel reply

Your email address will not be published. Required fields are marked *