Arc Forum | Does Arc need the character type?

Arc Forum

Does Arc need the character type?

8 points by cchooper 6329 days ago | 12 comments

Arc currently has a character type and a string type, with some functions working on only one or the other. For example:

  (string '(#\a #\b))
  => "ab"

but

  (string '("a" "b"))
  => ERROR!

Instead you have to write:

  (apply string '("a" "b"))

You could patch string to work with strings, but the deeper question is: couldn't Arc just get rid of the character type, and use strings everywhere? Character types make sense in languages like C, where "a" takes up one more byte than 'a', but in a higher-level programming language, do we really need to make that distinction? What benefit do characters give you?

Getting rid of characters would simplify the language and make programs shorter. You'd no longer need to convert characters into strings or vice versa. I think it would be a simple change too (although I'm not going to code it as it would break Anarki's compatibility with Arc). Just make every function that returns a character return a string of length one instead:

  ("bar" 1)
  => "a"

Every function that takes a character argument can take a string, and then drop all but the first character or try and do something intelligent with the whole string:

  (string '("f" "oo"))
  => "foo"

Are characters an onion in the varnish, or do they serve a useful purpose? I can't think of one, can anyone else?

P.S. Characters would still have to be used under-the-hood to implement strings. The question is whether they need to be exposed to the user, or whether they're just an implementation detail.

P.P.S. Might I point out that Lisp doesn't even have a true boolean type. Nil and T are symbols, but have served perfectly well for 40 years as booleans too. Are characters really more essential than booleans?

7 points by sross 6328 days ago | link

Lisp (or at least Common Lisp) does have a very well defined `true` boolean type.

   * (typep t 'boolean) => t
   * (typep nil 'boolean) => t
   * (typep 3 'boolean) => nil

The fact that t an nil are also symbols (and nil the empty list) is irrelevant.

As for the character issue the idea is a nice one and works well for graphic characters but issues start arising when you look outside that particular set to characters like #\Null, #\Return and #\Newline. Even more hairy are when you want to represent characters like #\Control-\x or #\Meta-\m (not portable btw) which have no reasonable representation as a single character string.

Personally I'm all for a distinct data type which allows extra information to be attached to it but within a language which uniformly allows singleton strings to designate characters and vice-versa.

-----

4 points by cchooper 6328 days ago | link

The CL boolean type is kind of how I'd like characters to work. Characters would be characters, but they'd also be length one strings, same as how T is a boolean, but it's also a symbol.

I agree that special characters require a different representation (using "\n" and so on all the time is not nice). Perhaps though they could still be strings, but just written with a different syntax?

-----

5 points by sacado 6328 days ago | link

Yep. As '() () and nil are equivalent, maybe #\x and "x" should be equivalent too (and so that would work with #\null too : it is the only way to read it (I think), but it is used as a 1-char-long string.

-----

3 points by sacado 6328 days ago | link

At first sight, I totally agree with you. Characters seem even less usefull as, Arc strings being Unicoded, you can't use characters to manipulate raw bytes of data.

-----

3 points by Jesin 6328 days ago | link

Yes, the choice to have a separate character type surprised me. Just do it the way Python does it (or the way Python 3 is going to do it).

-----

1 point by are 6328 days ago | link

I'm no expert, but with Unicode strings, doesn't the "character" abstraction break down anyway, when a "character" sometimes needs to have multiple glyphs? With "characters" removed, wouldn't a a multi-glyph "character" just be a length 1+ string also? Or am I just talking out of my arse here?

-----

4 points by sacado 6328 days ago | link

If I'm right, there are many concepts behind Unicode :

A Unicode string is a sequence of codepoints (or characters).

A codepoint (or character) is a numeric id that can be represented by many ways (UTF-32 : always 4 bytes ; UTF-8 : only one byte for codepoints < 128 ; from 2 to 4 bytes for codepoints >= 128 ; ...).

A glyph is what you display on the screen : a Unicode string is displayed as a sequence of glyphs. A glyph can be a single character, or the combination of 2 or more characters.

e.g., 10 glyphs on the screen can be represented as 11 characters (the two last ones being composed in a single glyph). Depending on the underlying "physical" encoding, these 11 characters can occupy 44 bytes (UTF-32) (with a O(1) access to substrings) or, say, 25 bytes (UTF-8) (with a O(n) access to substrings), or ...

In a few words : characters have a meaning in Unicode, but they don't match well with bytes (and even with physical representation) and, sometimes, with the way things are representing on the string.

Correct me if I'm wrong.

-----

2 points by almkglor 6328 days ago | link

As far as I know a "glyph" has a one-to-one mapping to a character, where "glyph" means the on-screen symbol used to represent the character (not sure whether there exist multi-glyph single characters - although I do think that there are characters which when in some sequence end up being displayed in one glyph, even though they are logically separate characters).

Or do you really mean "octet" or byte, of which several are regularly used to represent a single character during a unicode transmission? In such a case.... define "string". Is a "string" a sequence of bytes, or a sequence of characters?

-----

3 points by olavk 6325 days ago | link

I believe that e.g. accented characters like é are implemented as a single glyph in fonts, but are composed of two unicode code points: the base character (e) and a modifier character (´).

This is complicated by the issue that unicode also supports the combined character as a seperate single code point, for backwards compatibility with legacy character sets. However the decomposed (normalized) form is the recommended.

-----

1 point by almkglor 6324 days ago | link

True. A bit of research also suggests that it would be better for both forms to be considered "equal" when comparing individual characters.

-----

1 point by are 6324 days ago | link

> Or do you really mean "octet" or byte, of which several are regularly used to represent a single character during a unicode transmission? In such a case.... define "string". Is a "string" a sequence of bytes, or a sequence of characters?

I know I shouldn't have dipped my ignorant toe into Unicode waters :-)

Maybe a better question would be: If Arc got rid of the character datatype by collapsing strings-and-characters into strings-and-substrings, could you leave "how to represent a string" (chars vs. octets vs. bytes vs. code points vs. glyphs) out of the language spec altogether? Or would such a "clean" string abstraction conflict with having Unicode support (since Unicode is deeply encoding-specific)?

-----

1 point by olavk 6325 days ago | link

You need some way to get to the numerical code-point value of a character, to be able to implement string library functions like casing. You dont need a seperate char data type though, the function could operate on a string and just return the code point of a character (given by index) as an int.

-----