homepage

Inverse parentheses

Have you ever noticed that lots of programming languages let you use parentheses to group operands, but none use them to ungroup them? No? Well let’s pretend this is a normal thing to be thinking about, and see what we can do about it.

Grouping with parentheses is relatively easy to add to a language grammar. The rule that accepts atomic things like 37 simply needs to also accept an opening paren,1 at which point it will recursively parse an entire expression, and then eat a closing paren.

def parse_atom(lex):
    r = next(lex)
    if r[0] == 'integer':
        return int(r[1])
    elif r[0] == '(':
        expr = parse_expr(lex)
        s = next(lex)
        if s[0] != ')':
            raise ParseError("missing close paren")
        return expr
    else:
        raise ParseError(f"unexpected {r[0]}")

Anti-grouping isn’t quite as straightforward. Our parser can’t follow the structure of the parentheses, because then it wouldn’t be following the structure of the expression—the whole point is that these are dissimilar.

I don’t know if it’s possible to write a pure parser that does this. But purity is overrated anyway. I decided to take inspiration from another language with a weird grouping system.

Python

Did you know that Python’s grammar has braces? You just don’t type them. The tokeniser3 keeps track of the indentation level and inserts special tokens when it changes. The parser itself doesn’t need to worry about counting whitespace; it just sees blocks of statements bracketed by INDENT and DEDENT,4 which are easy to parse.

As it happens, Python’s tokeniser also knows when it’s inside a parenthesised expression. Indentation inside parens is not significant, and this is implemented by tracking the paren nesting depth and suppressing INDENT and DEDENT while it’s non-zero.

What if we used the same trick? Instead of trying to do all this in the parser somehow, the tokeniser could track its nesting depth, and emit a “friendliness” score for each token. Then we can simply parse operators in ascending order of friendliness.

In this model 1 + (2 * 3) will yield the following token stream:

1
+ (0)
(
2
* (-1)
3
)

We’ll leave the parentheses in the token stream, but all the parser needs to do with them is generate a syntax error if it finds one in the wrong place. Grouping will be handled entirely by the precedence levels embedded in the token stream.5

A not-so-infinite climb

The tokeniser hack solves our parsing problem, but it creates another one: our language now has infinitely many precedence levels. I don’t feel like trying to do that with handrolled recursive descent, but a rummage through school textbooks suggests a precedence climbing parser is what we need. It deals with operators in the order it meets them, so having infinitely many possible precedences won’t bother it.

I hacked this together and it’s appropriately silly:

> (1 + 2) * 3
1 + 2 * 3
> 1 + (2 * 3)
(1 + 2) * 3

Something I particularly enjoy about the implementation I landed on is that if you increase friendliness instead of decreasing it, you end up with an ordinary6 parser. It’s also a good platform for other questionable syntactic innovations, like a language with no parentheses at all, using whitespace to weaken binding.

Future work

While we’ve achieved a lot here today,7 we’ve also raised some important new questions. For instance, is it always necessary to double-parenthesise expressions in more complex cases?

> ((1 * 2)) + (3 * 4)
1 * ((2 + 3) * 4)

And is it possible to have an anti-grouping parser that gives an involution when hooked up to an ordinary printer?

These are promising avenues for deeper study, and I’d love to hear from anyone who chooses to take them on.


  1. Does it bother anyone else that “parentheses” (meaning round brackets) doesn’t seem to have a satisfactory2 singular form? 

  2. To me, “parenthesis” sounds like a pair of parentheses, possibly together with the text they bracket, rather than a single paren. 

  3. Also called a lexer or occasionally a scanner. I’m going to stick with “tokeniser” because it feels the most like a real word. 

  4. Sadly these are just internal names and not spellings, so you can’t use them to write exciting one-liners. 

  5. I have ignored the order of operations in this post to keep things simple, but my implementation does handle it. Operators are looked up in a table and their “natural” precedence is paired with their friendliness, so in fact the precedences of + and * from the example would be (0,0) and (-1,1) respectively. 

  6. Outwardly. 

  7. Actually early 2024, but it took me until today to get around to writing it up. 

 ↑ back to top