Arc Forumnew | comments | leaders | submitlogin
Multi character matching, and ssyntaxes
3 points by shader 5404 days ago | 19 comments
How hard would it be to make the primitives in arc that match a character in a string to match more than one at once or a string?

i.e.:

  arc> (pos "cd" "abcde")
  2
The reason I'm suggesting this is a) it seems useful and fully backwards compatible and b) it would allow ssyntaxes to be multi-character, and also gives arc full symbol macros, instead of just single character ones.

Also, while on the topic of ssyntaxes, I was testing some of them in my repl today and

  a!(b c)
  pr."test"
etc. exited the repl, or printed #<eof>. Why is that? I could understand if it said something like "Bad syntax at: pr."test", but why would it exit. And how hard would it be to get those to work? After all, pr.3 works, why not pr."3"? or pr!(a b)?

I'm sure I don't understand the nuances, but it would be cool if those would work.



2 points by rntz 5404 days ago | link

'posmatch already does this.

    arc> (posmatch "cd" "abcde")
    2
Doing symbol macros by hijacking the ssyntax-expansion process seems hackish, and I'm not sure if it would work. Multibyte ssyntaxes could be interesting, though.

In arc3, pr."foo" and a!(b c) complain about bad ssyntax, so that bug at least has been squashed. Making them do "the right thing" would be pretty cool but would require some hacking of the reader - at the moment, ssyntaxes aren't expanded by the reader but by the arc compiler, which compiles symbols with ssyntax in them into special forms, so for example:

    arc> 'foo.bar
    foo.bar
    arc> '|foo.bar|
    foo.bar
If you wanted to handle ssyntaxes that weren't just special symbols correctly, you'd need to hack on the reader, which would change the above behavior to:

    arc> 'foo.bar
    (foo bar)
    arc> '|foo.bar|
    foo.bar
This might break some code, although probably not any code distributed with arc itself.

-----

2 points by shader 5403 days ago | link

The idea of having symbol macros based on ssyntax was given to me by absz in http://www.arclanguage.org/item?id=9019

It doesn't seem too bad, hackwise, though I think it might have issues with layering new definitions on top of old ones if there isn't a way to remove them from the list.

What code would having that improved reader break?

-----

1 point by rntz 5403 days ago | link

Any code that relied on the ability to represent the symbol whose name is "foo.bar" (or any symbol containing a dot) as 'foo.bar rather than needing to wrap it in pipes as '|foo.bar|.

-----

1 point by absz 5404 days ago | link

Anarki, at least, supports multi-character ssyntaxes, as I committed a .. ssyntax for range. The pos thing is still a good idea.

As you saw, bad ssyntax crashes because empty parts come out as #<eof>, as you saw; why that happens, I can't tell you. But I can tell you why you can't write things like a!(b c) or pr."test". The reason is that ssyntax (symbol syntax) isn't directly part of the read procedure. Structures like (tab!frob xs.ix) are read in as-is; after this, every symbol is traversed and possibly expanded, producing something like ((tab 'frob) (xs ix)). Thus, a!(b c) is read in as the two objects a! and (b c), and a! is then expanded to (a '#<eof>); similarly, pr."test" is read in as pr. "test", and then becomes (pr #<eof>) "test". Granted, those #<eof>s shouldn't be there---probably a syntax error should be produced instead---but that's what's going on.

-----

1 point by shader 5403 days ago | link

So, maybe an arc implementation of the reader is in order? Then we could have arc level implementation of reader macros too, instead of just ssyntaxes and symbol macros.

-----

1 point by conanite 5403 days ago | link

So far, we have lib/arc-read.pack and lib/parser.arc, both in anarki, as well as lib/parsecomb.arc. Not only do you have your reader, but you have a choice of them too!

arc-read seems to be designed for easy extensibility although I haven't tried it at all. (If I had looked carefully before embarking upon parser.arc, I might have saved myself a lot of trouble).

parser.arc doesn't support the full range of scheme numbers, nor does it support |foo| symbol syntax yet, apart from ||. I'm working on a new version that will hopefully be faster - welder depends on the tokeniser for syntax highlighting.

parsecomb.arc, if I understand correctly, isn't an arc parser but a library for building parsers.

In any case, having an implementation of 'read in arc makes a lot of sense, and it's in the todo at the top of arc.arc

-----

2 points by Adlai 5404 days ago | link

Another idea along this line is to enable code like this:

  arc> (let str "Hello world!"
         (= (str "world") "arc")
         str)
  "Hello arc!"

-----

1 point by bogomipz 5403 days ago | link

Nice idea! This would eliminate both pos and posmatch.

Call the string with an index to get the character at that position, call it with a character or string to find the index.

Although, given your example, it looks like (str "world") should return a range.

-----

1 point by shader 5403 days ago | link

That doesn't sound too hard to do. Assignment and position on strings are handled by string-set! and string-ref. If those were modified to accept a string as input instead of just a numerical index, then Adlai's code would work.

Maybe we should just make two scheme functions str-set! and str-ref and use those instead, as opposed to over-writing the original functions.

This sounds like a good spot for the redef macro ;)

Anyway, because position matching and assignment are handled separately, (= (str "world") "foo") could still work even without (str "world") returning a range.

-----

1 point by bogomipz 5403 days ago | link

Yes, there just seems to be a dilemma of whether (str "world") should return an index or a range. If Arc had multiple return values, it could return the start and end indices, and a client that only uses the start index would just ignore the second value :)

-----

2 points by Adlai 5403 days ago | link

The return value should correspond to what was being searched for.

In other words, searching for one character should return an index, while searching for a substring should return a range.

There are thus four operations which would ideally be possible through ("abc" x):

  arc> (= str "hello arc!")
  "hello arc!"
  arc> (str "arc")
  6..8     ; or some other way of representing a range
  arc> (str #\!)
  9
  arc> (str 5)
  #\space
  arc> (str 4..7)   ; same as previous comment
  "o ar"
A way to take advantage of multiple values, if they were available, could be something like this:

  arc> (str #\l)
  2
  3

-----

1 point by conanite 5402 days ago | link

Just curious - wouldn't it suffice to return the index of the beginning of the matched string when running a substring search?

  arc> (str "arc")
  6
, because you already know the length of "arc", so you don't really need a range result?

Otherwise, these are great ways to occupy the "semantic space" of string in fn position.

-----

1 point by shader 5402 days ago | link

I agree with you. I don't think that returning a range is necessary.

Even if call position and assignment weren't handled separately, it would still be possible to work off of the length of the argument and the index, without needing a range.

The question is whether or not pg agrees with us enough to add it to arc3 ;)

-----

1 point by conanite 5402 days ago | link

If there are 100 arc coders in the world, there are probably 100 versions of arc also. The question is whether you want it in your arc :)

-----

1 point by shader 5402 days ago | link

True. And I do. Unfortunately, I'm busy working on several other things at once right now. If you want to start working on it, be my guest. Hopefully I'll be able to share what I've been doing soon.

-----

1 point by Adlai 5402 days ago | link

I guess that (str "world") could just return an index, because (= (str "world") "arc") has access to the entire call, and can thus calculate

  (+ (str "world") (len "world"))
to figure out what the tail of the string should be after a substitution.

-----

1 point by shader 5403 days ago | link

Well, scheme supports multiple values, so it shouldn't be too hard to get them into arc, right?

-----

1 point by conanite 5402 days ago | link

arc supports multiple values via destructuring

  (let (a b c) (fn-returning-list-of-3-things)
    ...)
In the particular case of returning multiple indexes into a string though, you don't usually know in advance how many matches there will be, so destructuring isn't an option.

-----

1 point by Adlai 5402 days ago | link

Multiple return values from a form are allocated on the stack, not on the heap. I don't 100% understand what that means, though...

One practical consequence is that you don't have to deal with later multiple values if you don't want to, but when values are returned as a list, you have to deal with them.

-----