Unicode strings - an opportunity?

by Colin Adams (modified: 2007 Dec 16)

It seems to me that STRING_GENERAL isn't worth having - just take a look at some of the implementations, and you will see they are marked for 8-bit only.

I think it would be better to take the opportunity to abandon read-write strings altogether (confining STRING_8 to "legacy") and make STRING_32 unconnected with STRING_8.

Also substring would then not need to take a copy, and so could be much faster. Indeed, we could consider wasting the initial byte, and so eliminating the cost of translating from 1-base addressing to 0-based addressing (1 byte wasted would not be very significant when every character consumes 4 bytes - I am assuming UTF-32 for the implementation of STRING_32).

Comments
  • Peter Gummer (16 years ago 17/12/2007)

    STRING_32 read-only

    I think that's a great idea, Colin.

    But what about the fact that this would make STRING_32 an oddity in Eiffel's type system? All other reference types are writeable. Would Eiffel need a readonly keyword?

    Also, just to be sure that I understand your point about substring being more efficient, do you mean that each new substring object would be implemented by indexing into the area of the old string? Cunning!

    • Colin Adams (16 years ago 17/12/2007)

      All other reference types are not writable. Only those that provide mutating features.

      Yes, you understand my point about substring correctly, but describing it as cunning is a bit much.

      Colin Adams

  • Manu (16 years ago 20/12/2007)

    I think I've mentioned it quite a few times, STRING_GENERAL is indeed not worth it when there is no more legacy using STRING_8. It was only created to offer a smooth transition path for those who were using STRING_8 (thus the many restrictions on the string containing only characters that can fit into a STRING_8 instance). So it is just a matter of time until it becomes obsolete.

    For substring, there is nothing that you cannot do with today's implementation. If we add a boolean flag to say whether a STRING object has changed or not, then we can easily implement substring the way you describe it. It could be called `aliased_substring'.

    For the starting index being 0 instead of 1, we can easily do it, but it would break some existing code using area' directly instead of to_c'. However I'm not sure if it makes sense as most of the operations in class STRING are already using 0 based indexing for efficiency. So here you would only optimize client code but the drawback is that indexing from 0 is always messy, especially when the rest always starts at 1.

  • Peter Gummer (16 years ago 22/12/2007)

    Read-only strings

    Ignoring for the moment Colin's suggested optimisations, I think the most interesting suggestion he made was to make STRING_32 read-only. I'm convinced that read-write strings are a maintenance problem, in many ways, not the least of which is the complexity it adds to the interface of the STRING class. I agree with Colin: Eiffel should grab this opportunity to abolish read-write strings. I'd like to see what the interface of STRING_32 looks like without any commands!