Regular Expressions Demystified – A Mini DSL for Regex in Python

I’ve always had a hard time with regular expressions: I know that they are useful, but I use them so rarely that I cannot get a hold of all the syntax.

So, now is the time to write an article for myself to remember all the stuff.

What are Regular Expressions aka RegEx?

RegEx are a sequence of characters which help you to search patterns in text.

Say you have an input string which contains whitespaces, tabs and line break:

input_string = "   \tJoernBoegeholz  \n"

You will certainly agree that it won’t be a good idea to use this string as e.g. a username.  If a username is necessary to login into a system, a user will not remember if he accidentially typed a whitespace character in to form field.. So we have to replace the whitespaces, tabs and linebreak.

output = input_string.replace(" ", "")
output = output.replace("\t", "")
output = output.replace("\n", "")

This is a bit messy, with RegEx we can use the “\s” Metacharacter

output = re.sub("\s", "", input_string)

From the Python Doc:

“When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].”

Please take this just as an example, in production code You would use “strip()” to remove leading and trailing whitespaces.

Ok, here is the catch: I cannot remember the meta-characters. That makes working with RegEx cumbersome for me. To make a virtue out of necessity I started to write a domain specific language to facilitate working with regex in python.

First step

All meta-characters are represented as a constant.

ANY_CHAR = '.'
DIGIT = '\d'
ALPHA = '[a-zA-Z]'

Second Step

We wrap the multiplier in convenience methods.

def zero_or_more(string):
    return string + '*'

def zero_or_once(string):
    return string + '?'

def one_or_more(string):
    return string + '+'

Third Step

As syntactic sugar we introduce a class which encapsulates the pattern:

class Pattern:

    def __init__(self):
        self.pattern = ''

    def starts_with(self, start_str):
        self.pattern += start_str
        return self

    def followed_by(self, next_string):
        self.pattern += next_string
        return self

    def __str__(self):
        return self.pattern

    def __repr__(self):
        return self._regex


Instead of writing

pattern = "\d\D+\s{2,4}"

you can now write

pattern = Pattern()
    .followed_by(between(2, 4, WHITESPACE))

which is more human readable.

My first PyPI package

After using

pip install <module_name>

for a couple of years, I wanted to know how I can upload a new package to PyPI or the “Python Package Index”. This great tutorial helped me a lot to do the first steps.

At the moment it’s a pet project, but if you are interested You can use the code via pip install easy_pattern.




