Project Overview
During my work at VuNet Systems, I frequently worked with Regular Expressions, which sparked my curiosity about how regex engines actually work under the hood.
To understand the theory behind regular languages and parsing algorithms, I self-studied online courses on Theory of Computation and Compilers. Armed with this knowledge, I built a complete regex engine from scratch in Python—without using Python’s built-in re module.
The engine supports all fundamental regex syntax including quantifiers (both greedy and lazy), character classes, capture groups, anchors, and alternation. It implements a recursive descent parser to convert regex patterns into an Abstract Syntax Tree (AST), followed by a backtracking matcher for pattern matching. The architecture roughly follows Java’s java.util.regex module.
Key Features
- Complete Parsing Pipeline: Recursive descent parser that converts regex patterns into an AST representation
- Advanced Quantifiers: Support for
*,+,?,{n,m}with both greedy and lazy variants - Character Classes: Including ranges
[a-z], negation[^abc], and shortcuts like\d,\w,\s - Capture Groups: Regular
(...), non-capturing(?:...), and named groups(?<name>...) - Anchors: Position assertions with
^,$,\b,\B
Tech Stack
- Language: Python
Limitations & Future Work
While functional for educational purposes, the engine lacks some advanced features like lookahead/lookbehind assertions and backreferences. Future enhancements could include NFA/DFA compilation for improved performance and an interactive regex debugger.
GitHub Repository
Full source code and documentation: miniRegex on GitHub