Even more Beautiful Code (C → Haskell)

Edit

The best thing about blogging haskell is its so easy to understand that other people show you how to make your code better- edits have been made to make the code more concise. Thanks Neil, dan, and ssylvan on reddit.

Intro

In the book Beautiful Code some code is more beautiful than other code. But there is little doubt that the opening chapter presents a regular expression matcher in a beautifully succinct fashion. As the author Brian Kernighan states: “I don’t know of another piece of code that does so much in so few lines while providing such a rich source of insight and further ideas.”

The example is written in C. I wondered if a more beautiful version could be created using a newer programming language. Kernighan points out that it takes more code to produce the equivalent in Java, which is not suprising. But he also points to the strengths of C. “Pointers … are used to create compact expressions that naturally express the extracting of individual characters and advancing to the next character. Array indexing of substrings can achieve the same effect, but in this code, pointers do a better job …” But we can maintatin the advantages of pointers and work at a slightly higher level by using lists. Since this code also uses recursion and pattern matching, Haskell is a very natural replacement.

The following is a translation of the regular expression C code which can be seen at the oreilly book site by clicking on “implementation” under “Chapter 1: A Regular Expression Matcher”

This regular expression engine supports ’.’ to match any character, ’*’ to match multiples of a character, and ’^’ and ’$’ for begginning and ending anchoring.

> module BasicRegex (match, matchShell) where
> import Data.List( tails )
> 
> match :: [Char] -> [Char] -> Bool
> match ('^':regex) text = matchhere regex text
> match regex text = any (matchhere regex) (tails text)
>
> matchhere :: [Char] -> [Char] -> Bool
> matchhere [] _ = True
> matchhere (c:'*':regex) text = matchstar c regex text
> matchhere regex [] = regex == ['$']
> matchhere (r:regex) (t:text) = charMatch r t && matchhere regex text
> 
> -- non-greedy star matching
> matchstar :: Char -> [Char] -> [Char] -> Bool
> matchstar c regex [] = matchhere regex []
> matchstar c regex (t:text) =
>   matchhere regex (t:text) || charMatch c t && matchstar c regex text
>
> charMatch :: Char -> Char -> Bool
> charMatch c t = c == '.' || c == t

There it is, a very basic regular expression matcher weighing in at 20 16 lines of code.

When I first saw the C code, my reaction was “wow, they just implemented basic regex functionality in 30 lines of C.” Looking at the Haskell version, it is much easier to just say “of course it is an implementation of regex functionality in 20 lines of haskell.”

Kernighan also shows a greedy star matcher, which can be seen here by clicking on the “Alternatives” section. Although curiously, unlike the rest of the code it does not use recursion. Perhaps this is because of a greater risk of stack overflow. But if it it translated it into Haskell using pure recursion, more insight can be gained into how similiar it is to the non-greedy matchstar

> -- Greedy star matcher
> matchstarG c regex [] = matchhere regex []
> matchstarG c regex (t:text) = charMatch c t && matchstarG c regex text ||
>                               matchhere regex (t:text)

Now it is obvious that the difference is whether one first tries to match the star (greedy) or tries to match the rest of the regex (nongreedy).

From here we can make some small modifications to complete some of the additional excercises suggested by Kernighan.

Add some more metacharacters ’+’ and ’?’

> matchhere_ (r:'+':regex) (t:text) =
>   charMatch r t && matchstar r regex text
> matchhere_ (r:'?':regex) (t:text) =
>   (charMatch r t && matchhere regex text) || matchhere regex (t:text)

Add escaping for metacharacters

This won’t allow escaping of a dot, but works for the rest.

> -- escape any charcter, put this at the top of matchhere definitions!
> matchhere_ ('\\':regex) (t:text) = matchhere regex text

Modify the regex to behave like a shell (file globbing)

In shell globbing, there is implicit anchoring, ’*’ matches any string, and ’?’ matches any charcter We can accomplish this by mapping shell metacharcters to our previous metacharters.

> isMeta c = any (== c) "*+?."
> anchorize str = ('^':str ++ ['$'])
>
> matchShell :: [Char] -> [Char] -> Bool
> matchShell shell text       = match (anchorize (shellConvert shell)) text
>
> shellConvert :: [Char] -> [Char]
> shellConvert []             = []
> shellConvert ('\\':s:shell) =  '\\':s:(shellConvert shell)
> shellConvert ('*':shell)    = '.':'*':(shellConvert shell)
> shellConvert ('?':shell)    =     '.':(shellConvert shell)
> shellConvert (s:shell)
>   | isMeta s  = '\\':s:(shellConvert shell)
>   | otherwise = s:(shellConvert shell)

Implementing character ranges (like ‘[1-9]’), proper escaping and alternation

The fact that we have not yet escaped the ’.’ character should be a clue as to how hard it is to parse the regular expression while matching. More complex regular expression are very difficult to parse by hand like we have been doing, and real regex engines always compile the regular expression before doing any matching. Infact, after toying around with this code, I cannot call it beautiful anymore, because it is not at all extendable. For me, code is beautiful when it can be easily understood, and easily changed and extended. You can find a simple haskell regex engine that supports some character ranges and alternation here. The author uses haskell’s parsing library.

Conclusions

The code that Kernighan believed to contain a great deal of compact wisdom has benn made more compact. But converting C code to Haskell did not give a huge reduction in characters or lines of code for this simple exercise. The conversion cut the lines of code in half. It also give more declaritive code that is safer and much more readable and understandable. More of these advantages will be realized as the regex engine is made more advanced.

We have certainly given up some performance, although I haven’t benchmarked this implementation. It is possible to get some performance back without changing the code (but removing support for unicode) by using the ByteString library, which would be more equivalent to characters in C.

There is something to learn about design here. Interpreting the regular expression and matching the text with it are two distinct activities that must be seperated to make the code more flexible.

Code

The body of this article is a literate Haskell program. You can copy the text, then paste it in a file with the extension .lhs, then execute it using a haskell compiler or interpreter like GHCi.

Attached is code for testing the implementation

© Greg Weber. All original code snippets are placed in the public domain.
Written on Mon Feb 4 20:00:22 2008.
Tagged as C, haskell, regex.
Send author an email.