Arc Forum | "Hygiene" via raw objects in macroexpansions

Arc Forum

"Hygiene" via raw objects in macroexpansions

5 points by waterhouse 2855 days ago | 7 comments

I see a use for the general idea of macros where the expansion uses some functions (or macros or whatever) that you might not even know had definitions, and where it doesn't make a difference if the macro call appears inside a 'let that rebinds the name of one of the functions the macroexpansion uses. This seems to be where some people go for "hygienic macros".

But I see what seems like a very simple way to do it: First, make sure that function objects evaluate to themselves[1], as do macros, and special forms like if.[2] Second, arrange for macroexpansions to contain, not the names of the objects they use, but the objects themselves. For illustration, here is the definition of 'for in arc.arc:

  (mac for (v init max . body)
    (w/uniq (gi gm)
      `(with (,v nil ,gi ,init ,gm (+ ,max 1))
         (loop (assign ,v ,gi) (< ,v ,gm) (assign ,v (+ ,v 1))
           ,@body))))

It becomes something like this:

  (mac for2 (v init max . body)
    (w/uniq (gi gm)
      `(,with (,v nil ,gi ,init ,gm (,+ ,max 1))
         (,loop (,assign ,v ,gi) (,< ,v ,gm) (,assign ,v (,+ ,v 1))
           ,@body))))

Two things become clear. First, quasiquoting seems kind of stupid because almost every term has a comma or comma-at, and, actually, the only terms without commas are the constants nil and 1, which could be comma'd. I'm thinking about what the best syntax for this use case would be; likely quasiquote but where all the bare terms are evaluated and you need an explicit ' to turn that off. Raw list operations certainly look worse.

The second thing is that "assign", being a special form in Arc, doesn't actually have a definition, so if I want to even macroexpand it for demonstration, I have to give it one. Thus:

  arc> (macex1 '(for2 i 1 10 prn.i))
  Error: "_assign: undefined;\n cannot reference undefined identifier"
  arc> (= assign 'ASSIGN-OBJECT)
  ASSIGN-OBJECT
  arc> (macex1 '(for2 i 1 10 prn.i))
  ; after manual prettifying...
  
  (#(tagged mac #<procedure: with>)
      (i nil gs3342 1 gs3343 (#<procedure:+> 10 1))
    (#(tagged mac #<procedure: loop>)
        (ASSIGN-OBJECT i gs3342)
        (#<procedure:<> i gs3343)
        (ASSIGN-OBJECT i (#<procedure:+> i 1))
      prn.i))

The macroexpansion contains no references to the symbol '+, and so if the whole expression were '(let + - (for i 1 10 prn.i)), we'd expect the same behavior as without the let.[3]

Now, this does look worse than the macroexpansion of arc.arc's for:

  (with (i nil gs1722 1 gs1723 (+ 10 1))
    (loop
        (assign i gs1722)
        (< i gs1723)
        (assign i (+ i 1))
      prn.i))

But that could be addressed with better printing style. Try this:

  (#M:with (i nil gs3342 1 gs3343 (#F:+ 10 1))
    (#M:loop
        (#S:assign i gs3342)
        (#F:< i gs3343)
        (#S:assign i (#F:+ i 1))
      prn.i))

It does make it more difficult for macroexpansions to be written in a machine-readable way (such that passing it to 'read yields an equivalent structure). Though, if you wanted to give a meaning to the above, "#M:<name>" could be a reader macro that returns "the object that the global variable 'name refers to (and possibly throw an exception if it's not a macro)". Also, in Common Lisp, reading gensyms like #:G366 back in causes their identities to diverge (each instance of #:G366 it reads is a new uninterned symbol), so I guess 'read-able macroexpansions are probably not a real use case anyway.

Using this approach has other implications. The global references are resolved at macroexpansion time instead of runtime. For an interpreter of the type I describe, which handles first-class macros as a matter of course, this is no change at all. But for a compiler that hopes to expand the macros once, ahead of time (e.g. when the function whose body uses the macro is defined), it's trickier. What happens when one of those global references gets redefined? The dumbest approach is to re-expand all macros when that happens; I think some compiled systems explicitly say "macros might get expanded multiple times, don't do anything stupid". If some function modifies a global variable—let's say, a stack represented as a list—repeatedly, then... It might make sense for the macro-function to explicitly quote that global variable, so that it remains a variable reference in the macroexpansion. And maybe such a compiled system could warn if there is compiled code that, deterministically, redefines a variable that a macro-function references.

Anyway, that's my thinking so far. Questions:

1. Have people done this already?

2. What do you all think of this?

[1] "What happens when a Common Lisp macro returns a list whose car is a function? (Not the name of a function, mind you, but an actual function.) What happens is what you'd expect, in every implementation I've used. But the spec doesn't say anything about this." http://www.paulgraham.com/ilc03.html

[2] At least from the perspective of an interpreter, it seems to me that 'if and 'fn should be bound to special objects, and that the interpreter, when evaluating a form like '(if x y z), should evaluate the car of the expression, discover that it evaluates to the if-object, and use that to decide how it handles the rest of the form. The expression should be equivalent to, e.g., '((car (list if)) x y z). Likewise, '(<an expression evaluating to the for macro> i 1 10 prn.i) should print the numbers from 1 to 10. First-class macros and special forms, in other words. The common case of macros with global names can be handled by an inlining compiler; as for unexpectedly getting a macro in the functional position at runtime, if it's really needed for optimizations, at worst you could have a runtime flag that turns that into an exception, and therefore the compiler is allowed to assume that dead variables won't end up being referenced by a surprise macroexpansion.

[3] The interpreter will run (eval '(for i 1 10 prn.i) env), where env is an environment containing the "+ -> -" binding; then it'll take the car, look up 'for, obtain the macro it's bound to, then call the macro-function with the arglist '(i 1 10 prn.i), after which it'll evaluate the macro-function's body in its saved lexenv (which is probably nil) augmented by bindings for the arguments 'v, 'init, 'max, and 'body.

4 points by rocketnia 2854 days ago | link

This came up during ar development, although I don't remember who brought it up. I think aw was about to implement it but was concerned that it would be annoying at the REPL to have all the program's bindings be looked up eagerly (aka for them to not be "hackable"). It could be annoying for mutually recursive macros, too, although those are already tricky enough that it probably wasn't a big concern at the time.

I remember recommending an upgrade: Instead of generating:

  `(,my-func ,a ,b)

Generate something that preserves the late binding of mutable global variables, like this:

  `((',(fn () my-func)) ,a ,b)

I seem to remember aw wasn't convinced this cruft was going to be worth whatever hygiene it gained, even after I recommended a syntactic upgrade:

  `(,late.my-func ,a ,b)

Since aw values concision a lot, perhaps the issue was that most programmers would surely just write ,my-func in the hope that it would be rare (or even incorrect) to ever want to rebind it, and then other programmers would suffer from that decision.

But Pauan followed a design path like this in Nulan and/or Arc/Nu. In Pauan's languages... actually, I think I remember a couple of approaches, and I don't remember which ones were real.

One approach I think I remember is that `(my-func a b (another-func c) d) inserted mutable boxes for my-func and another-func into the result, but it would just implicitly unquote a, b, c, and d, because variables at the beginning of a list are likely to be mutable globals and variables in other positions are likely to refer to gensyms.

There might have even been an auto-gensym system in that quasiquote operator at some point.

---

I liked this approach a lot at the time. That was the same time I was working on Penknife. I was trying to avoid name collision with first-class namespaces in Penknife, but I still wanted macros to work, and I was trying to take a very similar approach. (I couldn't take exactly the same approach because my macroexpansion results were strings; instead, I think I allowed regions of the macroexpansion result to be annotated with a first-class namespace to use for name lookup.)

When Penknife's compile times were abysmally long, I started to realize that even if I found an optimization, this was going to be a problem with macros in general. Anyone can write an inefficient macro, and anyone who can put up with it can build a lot of stuff on top of it that other users won't be able to appreciate. So I started to require separate compilation in my language designs.

With separate compilation in mind as a factor for the language design, it no longer made sense to put unserializable values into the compiled code. Instead, in Penknife's case, I devised a system of namespace paths to replace them. The namespaces were still first-class values, but one thing you could do with a Penknife macro was get hold of its first-class definition-site namespace, so the macroexpanded code could refer to variables in distant namespaces by specifying a chain of names of macros to look them up from. And this kept things hackable, too, since you could mutate a macro's definition-time namespace explicitly (not that many programs or REPL sessions would bother to do that).

Not long after, I set a particular goal to make a language (Era) where the built-in functionality was indistinguishable from libraries. Builtins aren't hackable from within the language, so the module system needs to make it possible (and ideally easy) for people to write non-hackable libraries.

(Technically they only need to be non-hackable to people who don't know the source code, because once you know the source code, you know you're not dealing with builtins. I intend to take advantage of this to make the language hackable after all, but it's going to take essentially a theorem prover in the module system before it's useful to tell the module system you have the source code of a module, as opposed to just using that source code to compile a new module of your own behind the module system's back.)

Anyhow, this means I haven't put hackability at the forefront for a long time.

I think the embedding-first-class-values approach will work, and I think late binding is workable (by using late.my-func) and there's a workable variation of that late binding approach to enable separate compilation too (by using namespace paths made out of chains of macro names). So I like it, but I just have this stuff to recommend for it to make it really tick. :)

---

By the way, it's good to see you! I wondered how you were doing.

-----

3 points by waterhouse 2840 days ago | link

With interpreter semantics, in which a macro gets expanded anew every time a function is called, the late binding comes for free. ;-) Then, if you want the runtime performance that comes from compilation, you optimize for the case where people are not redefining functions, and invalidate old compilation results when they do. I think that rough plan should be doable, though I haven't gotten around to implementing enough of a system to say how well it works. But I think that's the only way to get anything close to good performance in Javascript VMs (not that they expand macros, but I expect they inline function calls and such, which requires similar assumptions about global definitions), and it seems to have been done.

For separate compilation, it does seem clear that what gets serialized will be references like "the object [probably a function] named 'foo in module bar", and structures (s-expressions or otherwise) containing such references. Given that compilation implies macroexpansion, you do have to assume (or verify) that the macros from other modules are what they used to be—and that non-macros (used in functional position at least) are still non-macros. If you have a full-blown Makefile kind of build system, then by default I suppose every file's output depends on the contents of every other file that it uses; or, as an optimization, depends merely on the exact set and definitions of macros exposed from those files. (In the C++ system I encounter at work, code is separated into .cpp and .h files, and editing a .h file causes the recompilation of every .cpp file that recursively depends on it, but editing a .cpp file only causes its own recompilation. If you wanted to imitate that, I guess you'd put macros into a distinctively named set of files, and forbid exportable macros anywhere else.)

---

Thanks! I've sold out and have been working for a medium-sized company doing mostly C++ and bash (the latter is unbelievably useful) for the past 3.5 years. I make intermittent progress on the side doing other things.

-----

4 points by aw 2846 days ago | link

Especially for modules, I do like using function and macro values in macro expansions.

Consider a module which contains a macro such as `for`. I can import the macro into my module simply by copying the macro value into my module, e.g.: `(= for other-module!for)`. By using function and macro values in the macro expansion (as you do in `for2`), the functions and macros that the imported macro uses don't also need to be copied into my module.

I haven't worried myself about how most terms end up getting a comma prefix, but I agree there could be an alternative syntax where injecting the value of a symbol (instead of the symbol itself) could be the default.

To handle special forms, a solution is to come up with a globally unique random prefix long enough that no one could generate it accidentally (e.g. "xVrP8JItk2Ot"), and then rename the special forms `assign`, `fn`, etc. to `xVrP8JItk2Ot-assign`, `xVrP8JItk2Ot-fn` etc; and then add `assign` and `fn` macros etc. that expand into the prefixed special form. Now `assign`, `fn`, etc. are macros that can be used as values in a macro expansion.

(Making `fn` a macro also has the advantage that features such as default arguments can be implemented in Arc).

I was at first concerned that using such a random prefix was ugly, however eventually I realized that no one ever needs to see the prefixed versions :-)

> (aka for them to not be "hackable")

I was indeed curious about maximizing hackability as a general principle... but with modules even if a macro such as `for` is imported into my default namespace and thus expansions such as `with` wouldn't use my version, in practice it's easy enough to create a new module with my version `with` etc. and simply load the `for` source into the new module. (Or simply reload the `for` source into my default module).

    > (eval (list + 1 2))
    3

This is unusual in Lisp implementations, and I was quite pleased when I discovered that Racket supported this.

    (#M:with (i nil gs3342 1 gs3343 (#F:+ 10 1))

This is a nifty idea... hmm, is there a reason to use different prefixes for macros and functions? We could have a read syntax which evaluated to the value of the symbol (whatever it is), and that would work for both functions and macros.

-----

2 points by waterhouse 2840 days ago | link

> To handle special forms, a solution is to come up with a globally unique random prefix long enough that no one could generate it accidentally (e.g. "xVrP8JItk2Ot"), and then rename the special forms `assign`, `fn`, etc. to `xVrP8JItk2Ot-assign`, `xVrP8JItk2Ot-fn` etc;

Would this prefix be present in the source files of the language implementation? Or generated at runtime, or something else? If the latter, it seems like gensyms are the right approach. If the former, what's the use case—someone writing programs that assume someone might have redefined "assign" and want to work with the builtin? (If it's just hacking arc3.1 to do what we want, I think you can create more or less arbitrary unique objects (vectors, cons cells, a new kind of struct), give them bindings in the base Arc environment, and replace e.g. '(eq? (xcar s) 'quote) with '(eq? (xcar s) the-quote-object).)

Speaking of gensyms, I had a plan recently. I like PG's approach of just sequentially named variables, but (a) my ideal language would probably expose its interned-symbol table, and it'd therefore be more orthogonal and simpler for it to directly expose the "create a symbol without interning it" function, so uninterned symbols should be available for free; (b) it is possible that people will name a variable "gs1", so that's another reason to use true gensyms; (c) many macros use gensyms, but I might like to bootstrap a system that likes macros the expansion of which incurs zero side effects, and incrementing a gensym counter is a side effect. For (c) there might be another approach, but I came up with this one: Have the printer assign sequential numbers to each uninterned symbol it encounters (with a weak hash table). This has the nice effect that, when you do print macroexpansions, the numbers you see will begin at 1, rather than in the hundreds or thousands and varying depending on how many libraries have been loaded.

> (Making `fn` a macro also has the advantage that features such as default arguments can be implemented in Arc).

Yes, that is nice. I actually did this in my Lisp in Summer Projects submission[1] many years ago.

> is there a reason to use different prefixes for macros and functions? We could have a read syntax which evaluated to the value of the symbol (whatever it is), and that would work for both functions and macros.

Eh, no strong reason. I'd thought about doing a reverse variable lookup for every value and printing the variable name if it existed, but it makes less sense if e.g. the value 10 gets printed as #V:<some global variable that happened to be bound to 10>. Also, I figured an IDE-type setup (such as DrRacket) might use different colors or fonts to distinguish macros/functions/special forms/other. Last, you do need a way to print lambdas that don't have a global name (and might not even have a local name), and I'm guessing you'd want that to appear as #f:<some attempt at indicating line number / REPL input>, looking analogous to but different from the global case.

[1] https://github.com/waterhouse/emiya : "[O]ptional arguments are implemented by redefining "fn" (Arc's equivalent of "lambda") as a macro that expands to a call to "underlying-fn", which is bound to the fn special object--again, by the user. Then everything is recompiled, so that this redefinition does not cost anything at runtime; some care is needed here, to avoid an infinite loop, if a function used to recompile "fn" itself uses "fn"."

-----

3 points by waterhouse 2855 days ago | link

By the way, "what you'd expect" is apparently not what I expected:

  ; SBCL
  * (defmacro plus (&rest args) (cons #'+ args))
  PLUS
  * (plus 1 2 3)
  ; in: PLUS 1
  ;     (#<FUNCTION +> 1 2 3)
  ;
  ; caught ERROR:
  ;   illegal function call
  
  ; CLISP
  [1]> (defmacro plus (&rest args) (cons #'+ args))
  PLUS
  [2]> (plus 1 2 3)
  *** - EVAL: #<SYSTEM-FUNCTION +> is not a function name; try using a symbol instead
  
  ; Racket
  > (eval (list '+ 1 2))
  3
  > (eval (list + 1 2))
  3

-----

3 points by akkartik 2854 days ago | link

Check out John Shutt's thesis on Kernel: https://web.cs.wpi.edu/~jshutt/kernel.html. He takes your observation that quasiquotation is pointless to the limit, dropping it entirely and constructing his macros out of cons and friends.

-----

2 points by waterhouse 2840 days ago | link

Thanks for the pointer. He has an entire chapter on hygiene... And then there is this:

"There is no need to provide quotation here because, having failed to enforce the prohibition against embedding combiners in a macro expansion, we don’t need to embed their unevaluated names in the expansion."

It's nice that his primitive $vau grabs the current lexenv, as this enables another kind of macro-like behavior I've thought might be useful: taking subexpressions, fully macroexpanding them, and doing something with the result (e.g. determining whether a variable is ever used, and omitting a computation if not). I don't know how that would mesh with the interpreter's semantics, though...

I'll probably have to read and brood on this further.

-----