Regular Expressions Demystified - A Mini DSL for Regex in Python

Table of Contents

Motivation

Every Junior Developer needs some pet projects to try out some techniques he or she is not familiar with already.

Because I’ve always had a hard time with regular expressions (I know that they are useful, but I use them so rarely that I cannot get a hold of all the syntax) I’ve started a little project to ease up the use of RegEx.

What are Regular Expressions aka RegEx?

RegEx are a sequence of characters which help you to search patterns in text.

Say you have an input string which contains whitespaces, tabs and line break:

input_string = " \tJoernBoegeholz \n"

You will certainly agree that it won’t be a good idea to use this string as e.g. a username. If a username is necessary to login into a system, a user will not remember if he accidentially typed a whitespace character in to form field.. So we have to replace the whitespaces, tabs and linebreak.

output = input_string.replace(" ", "") 
output = output.replace("\t", "") 
output = output.replace("\n", "")

This is a bit messy, with RegEx we can use the “\s” Metacharacter

output = re.sub("\s", "", input_string)

From the Python Doc:

“When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].”

Please take this just as an example, in production code You would use “strip()” to remove leading and trailing whitespaces.

OK, here is the catch: I cannot remember the meta-characters. That makes working with RegEx cumbersome for me.

First step

All meta-characters are represented as a constant.

ANY_CHAR = '.' 
DIGIT = '\d' 
NON_DIGIT = '\D' 
WHITESPACE = '\s' 
NON_WHITESPACE = '\S' 
ALPHA = '[a-zA-Z]' 
ALPHANUM = '\w' 
NON_ALPHANUM = '\W'

Second Step

We wrap the multiplier in convenience methods.

def zero_or_more(string): 
    return string + '*' d

ef zero_or_once(string): 
    return string + '?' 

def one_or_more(string): 
    return string + '+'

Third Step

As syntactic sugar we introduce a class which encapsulates the pattern:

class Pattern:

    def __init__(self):
        self.pattern = ''

    def starts_with(self, start_str):
        self.pattern += start_str
        return self

    def followed_by(self, next_string):
        self.pattern += next_string
        return self

    def __str__(self):
        return self.pattern

    def __repr__(self):
        return self._regex/code>

Result

Instead of writing

pattern = "\d\D+\s{2,4}"

you can now write

pattern = Pattern()
pattern.starts_with(DIGIT)\
    .followed_by(one_or_more(NON_DIGIT))\
    .followed_by(between(2, 4, WHITESPACE))

which is more human readable.

My first PyPI package

After using

pip install <module_name>

for a couple of years, I wanted to know how I can upload a new package to PyPI or the “Python Package Index”, so I’ve written another tutorial:

Distributing your own package on PyPi

At the moment it’s a pet project, but if you are interested You can use the code via

pip install easy_pattern.

Links

PyPi

Github

Regular Expressions Demystified – A Mini DSL for Regex in Python