Author Topic: Support of Unicode ?  (Read 290 times)

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Support of Unicode ?
« on: September 22, 2018, 05:00:01 AM »
    I don't see keywords for Unicode symbols and strings. How to use it?

    So much encodings exist. And so much difficulties and problems with it... Surrogate pairs in UTF-16. So much service bits in UTF-8, for what?

    Why not to have some simple and unique encoding for it?


**************************************************************************************

    In my project i made encoding simple as possible. Extendable to infinite. Every byte has payload 7 bits. Network byte order. Every non-last byte in symbol sequence is marked by highest bit set to 1, last marked by 0:


    For US-ASCII it is the same with UTF-8, but - in comparision with it - it is more economic even for two-bytes symbol - it has 14 bits of payload (UTF-8 has only 11).

    Source code (public domain) is attached to the post (i update it time to time).

    Called it: "SUEUF - Simple Unique Extensible Unicode Format". [SUEUF] has sound of relief  :D .
« Last Edit: November 20, 2018, 01:57:50 PM by acbaile »

bas

  • Full Member
  • ***
  • Posts: 209
    • View Profile
Re: Support of Unicode ?
« Reply #1 on: September 24, 2018, 12:34:08 PM »
Support of Unicode is simply not in yet. Ideally it would unify all the different approaches used in C code.
But the discussion on what to include (eg. only UTF-8, etc) is a tricky one that has no neat solution i fear.
UTF is not in yet, because C2's goal domain is embedded drivers/kernels/operating systems, etc. There UTF
is not really important as in a generic systems language.

But I haven't really looked into a proper UTF solution yet. On the web, UTF-8 is definately dominating.

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Re: Support of Unicode ?
« Reply #2 on: September 25, 2018, 03:04:44 AM »
UTF is not in yet, because C2's goal domain is embedded drivers/kernels/operating systems, etc.
There UTF is not really important as in a generic systems language.

I hope it will be used widely as common purpose language. Developers should like it. So, it's important for future.

But the discussion on what to include (eg. only UTF-8, etc) is a tricky one that has no neat solution i fear.

Yes, it's difficult question. A lot of opinions, and nobody knows what is better. Difficulties are source of errors. I hope, one of encoding standards will become dominative (ideally unique) for avoiding these problems. It could be UTF-8, but there are questions to it.

But I haven't really looked into a proper UTF solution yet. On the web, UTF-8 is definately dominating.

    The reason of dominating UTF-8 on web is small size of texts in this encoding for ASCII symbols. Economy of traffic. For Unicode codes up to 0x7F (US-ASCII) they are encoded by 1 byte. UTF-8 and US-ASCII are the same for this range. Higher codes - two and more bytes. The most of traffic on web is in ASCII, so it gives economy (in comparision with UTF-16, which always uses 2 bytes per symbol, or surrogate pairs - 4 bytes per symbol).

    What i can not understand - why UTF-8 uses so much service bits. For two-bytes symbol it uses 5 service bits (when 2 bits are enough to mark this information). So, my question is: "Why this redundancy when web needs minimization of volume?".

    Encoding that i offered is simplier and more economic.
« Last Edit: November 18, 2018, 04:27:30 AM by acbaile »

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Re: Support of Unicode ?
« Reply #3 on: September 25, 2018, 03:31:35 AM »
In common, task is simple. Unicode consortium standardizes codes of symbols - integer numbers without dependency of representation. They are unique.

Encodings-representations are a lot :) . Why? For what these difficulties? It looks like difficulties for difficulties. Genesis of problems.

bas

  • Full Member
  • ***
  • Posts: 209
    • View Profile
Re: Support of Unicode ?
« Reply #4 on: September 28, 2018, 03:48:59 PM »
Lol.. The basic scenario of the actual encoding is indeed a standard. But how to integrate in a language is another thing.
A programming language is a bag of choices; make it easy for some, but harder for other developers; force something on
everyone or allow multiple solutions; there's never an actual best solution for everyone. With UTF (and other encodings)
the most important thing I think is how it interacts with the common C string (ie const char*). So how do you go from one to
the other and so on. Also how do you define string constants that are UTF-8? maybe like u"My UTF-8 String" (for example)

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Re: Support of Unicode ?
« Reply #5 on: September 29, 2018, 04:19:20 AM »
Lol.. The basic scenario of the actual encoding is indeed a standard.

Why LOL? Not nice. It seems I am still suspected in being covert C++ agent, in bringing into your project destruсtive ideas.

C2 is not standard too. First appears idea. Of course you will not include nothing that is not already a dominative standard. Talking about idea.

P.S. At first, everybody make mistakes. I am not afraid of my mistakes. Because when i am afraid i can not move forward. I listen about my errors calmly. We learn ourselves on errors. But laughing is not best manner to tell to somebody about his error. It is not courtesy.
« Last Edit: September 29, 2018, 10:49:50 AM by acbaile »

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Re: Support of Unicode ?
« Reply #6 on: September 29, 2018, 05:04:24 AM »
A programming language is a bag of choices; make it easy for some, but harder for other developers; force something on
everyone or allow multiple solutions; there's never an actual best solution for everyone.

I advise you to have target a common purpose language. We need this simplicity.

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Re: Support of Unicode ?
« Reply #7 on: September 29, 2018, 05:26:27 AM »
With UTF (and other encodings) the most important thing I think is how it interacts with the common C string (ie const char*). So how do you go from one to the other and so on.

Library converting. Or can be something else?

In case of UTF-8, (char *) - with US-ASCII, up to 0x7F - can be converted to (utf_8 *) by simple type-cast.

Other encodings? Embedded systems have a lot of non-standard encodings?

Also how do you define string constants that are UTF-8? maybe like u"My UTF-8 String" (for example)

Microsoft compiler uses similar:

"string" - char *
L"string" - wchar_t * - UTF-16 LE (main unicode format in Microsoft products)
u8"string" - UTF-8
U"string" - UTF-32
etc

It is not recommendation - to prevent your suspicions. Only example. Keyword "auto" is bad keyword there.

https://msdn.microsoft.com/en-us/library/69ze775t.aspx

"u" is good choice.
« Last Edit: September 29, 2018, 05:30:49 AM by acbaile »

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Re: Support of Unicode ?
« Reply #8 on: September 29, 2018, 11:20:47 AM »
I advise you to have target a common purpose language. We need this simplicity.

You want or not - it will be.

lerno

  • Full Member
  • ***
  • Posts: 218
    • View Profile
Re: Support of Unicode ?
« Reply #9 on: October 17, 2018, 11:26:54 PM »
Most languages go for UTF8. Everything else is standardizing on UTF8 as well. Since (forward) string search is easy in UTF8, you get like 95% of everything you want with UTF8.

For any text handling that needs more optimized handling, simply offer a conversion to other formats as a byte array. C libraries can handle everything else.

lerno

  • Full Member
  • ***
  • Posts: 218
    • View Profile
Re: Support of Unicode ?
« Reply #10 on: October 29, 2018, 08:32:42 AM »
Personally, I think we should have first class strings in the language, and then C style strings are created using c”Hello World”.

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Re: Support of Unicode ?
« Reply #11 on: November 18, 2018, 03:20:51 AM »
... how do you define string constants that are UTF-8? maybe like u"My UTF-8 String" (for example)

    Maybe, to introduce different descriptive types for different encodings? For certainty in naming. For example:

    utf8 * a;

    Symbol UTF-8 has variable length (1-6 bytes). What to do with sizeof() of this type? How to allocate space for one symbol? Can it be pointer only?

    On beginning it will be sufficient. But later - when C2 will be used widely - it will be possible to introduce following types in the same naming scheme:

    utf16le *   ms_wchar_t; /* Microsoft wchar_t */
    utf32le * obsd_wchar_t; /* OpenBSD (Intel, little-endian) wchar_t */

    It will give certainty, unambiguity, clear understanding of code. When programmer will see function declaration:

    bool concatenate(utf16le * destination, const reguint length, const utf16le * source);

    , he will know exactly which string arguments this function takes. For different encodings - different functions (it is not problem for library).

*****************************************************************************************************

    How to define string literals in code? It seems, in this naming scheme it has a little-bit cumbersome:

    utf16le * string = utf16le"some text";
    utf16le * string = (utf16le *)"some text";

    And - for most usable formats - it can be short one-letter form:

    utf8 * string = u"some text";

*****************************************************************************************************

    If SUEUF format will be accepted, it can be written like:

    sueuf * string = sueuf"some text";
    sueuf * string = e"some text"; /* I prefer "e" instead of "s", because "s" - is [ass] :) */

    In letter "e" you can see laughing face ;)

acbaile

  • Jr. Member
  • **
  • Posts: 89
    • View Profile
Re: Support of Unicode ?
« Reply #12 on: November 18, 2018, 04:58:29 AM »
    UTF-8 contains information "length of sequence" in first byte, so when searching you can simply jump to next symbol. But extracting of this length needs additional instructions. This field has variable length, so you need bit-search instruction (if target processor has it).

    Octets count - Significant bits - Pattern:


    It gives additional complexity to algorithm. In my code i do search in SUEUF string by simple memory search with following checking is it border of symbol or not. I have no concrete calculations (yet), but i suspect that UTF-8 does not give advantage in performance.

    So, it seems, lerno's argument about convinient search does not play. I see complexity only.

    I updated starting post of this topic: http://www.c2lang.org/forum/index.php?topic=84.msg261#msg261
« Last Edit: November 20, 2018, 01:50:39 PM by acbaile »

lerno

  • Full Member
  • ***
  • Posts: 218
    • View Profile
Re: Support of Unicode ?
« Reply #13 on: November 18, 2018, 04:14:46 PM »
Swift recently switched to an internal representation of UTF-8 and said it worked out to being faster than alternatives even for languages where UTF-16/32 is considered "better".

https://forums.swift.org/t/string-s-abi-and-utf-8/17676

bas

  • Full Member
  • ***
  • Posts: 209
    • View Profile
Re: Support of Unicode ?
« Reply #14 on: November 30, 2018, 01:42:52 PM »
In all my years of programming, I've only had to do I18n twice. Both were C++ projects. In C never. In C a basic string is just a char* with 8-bit characters. That
works for 99.9% of the cases. What would be nice if we could add special Strings as an optional feature. If not used, no overhead etc.