C2 forum

General Category => Ideas => Topic started by: chqrlie on August 30, 2015, 12:27:27 AM

Title: Unsigned char
Post by: chqrlie on August 30, 2015, 12:27:27 AM

Hello everyone,

I'm very pleased with Bas initiative to simplify and extend the C language, fixing some of the shortcomings while keeping the spirit.

Choosing as the base types the fixed bit size signed and unsigned integers and IEEE floating point types is quite pragmatic. Interestingly, it does not prevent users from defining some of the standard C type names such as float, double, int, short and long as aliases.

Regarding the char type, I believe it should be an alias for uint8 instead of int8 as proposed currently.

I completely agree that char should be an 8 bit type, but making it signed is a big source of problems.

Most current C compilers offer a switch to select between signed and unsigned char for the naked char type but sadly default to signed char. This choice probably linked to compatibility reasons for historic code, but it is inconsistent with the rest of the C specification:

getc() for instance returns an int with a value set of 0..UCHAR_MAX plus EOF: comparing the return value from getc() to a char variable or even a character literal will fail for non ASCII characters and might even mistakenly match EOF.

The macros defined in <ctype.h> also take an int with the same set of values: passing a char to these macros is incorrect for non ASCII values. The glibc implementation performs dirty tricks to prevent erratic behaviour by using tables with 384 entries where 257 would suffice.

Making char unsigned seems the only consistent choice for Unicode too: code points are positive as well, in the range 0..0x10FFFF.

Conversely, I fail to see any advantage in making char signed by default. Historic code is not an issue, and programmers should only use the char type for actual character strings, and uint8 or int8 for variable and arrays of bytes. int8 and uint8 are standard types in C2, so using char for this is not the default choice.

Lastly, I think char*, int8* and uint8* should be incompatible pointer types. I'm not sure if that's the case with the current specification.

Title: Re: Unsigned char
Post by: bas on August 30, 2015, 11:43:05 AM

Good arguments! Can't argue with that ;)
I'll change the char to uint8.

Currently char in equal to int8 (and will be equal to uint8), so char* and uint8* will be completely
equal then (just different syntax). Do you mean you want to make char* and uint8* incompatible
or char* and int8* ? (after changing char to unsigned)

Title: Re: Unsigned char
Post by: DerSaidin on August 30, 2015, 02:59:58 PM

I agree with char being uint8.

Imo:

uint8* and int* should be incompatible
uint8* and char* could be incompatible, but it doesn't matter much

Just because they happen to be the same size they should be compatible?
I think there is a reasonable argument to separate a number from a character.

I think it doesn't matter too much because explicit casting is still available if that is really the intention.

Title: Re: Unsigned char
Post by: chqrlie on August 30, 2015, 10:56:06 PM

Of course int8*, uint8* and int* are all different types and coercing on into another should require an explicit cast.

With the current proposal, char being an alias to uint8, char* and uint8* are really the same type. This comes as no surprise to C programmers used to the typedef semantics. Yet it would be nice to have a way to prevent this in some cases, while preserving the scalar nature of these types.

My point boils down to 2 separate questions:
- should type aliases convert transparently, as they do in C?
- should pointers to type aliases be distinguishable or not? eg: should we require a cast to pass a char* to a function expecting a uint8*? They have the same sizes and value sets, but are different semantic beasts: converting a uint8 to a string would yield a different representation from the conversion of a uint8 with the same value. You do not want to have overloading in C2, so this example is contrived, but you get the idea.

Title: Re: Unsigned char
Post by: bas on September 11, 2015, 08:49:10 AM

I'm currently implementing the proposed change (char == uint8). The next interesting question I
run into is the following: the following piece of code should work right?

Code: [Select]

const char* text = "Hello World";
char c = 'a';

So changing *char* would imply that the type of string literals would also change to uint8 and that a
character literal ('a') would also have the type uint8..
I don't think the impact is very big, but I'm asking here to avoid overseeing something important...

Title: Re: Unsigned char
Post by: chqrlie on September 13, 2015, 02:10:08 PM

The impact should be very limited. You can still produce C code with the char type for strings if you pass the appropriate flag to the C compiler such as -funsigned-char for gcc and clang.

Incidentally, character literals have type int in C, i.e. sizeof('a') == sizeof(int). This is different from C++ where character literals have type char, arguably a more natural choice and definitely needed to select the correct overloaded function in cases such as:

Code: [Select]

    cout << 0x41;  // will output 65 on stdout.
    cout << 'A';   // will output A on stdout.

The C2 documentation does not specify whether 'a' has type char or int. I'm having trouble compiling simple test code using the sizeof operator and I haven't seen from the C++ source code how you treat sizeof and more specifically sizeof('a').

Title: Re: Unsigned char
Post by: bas on September 14, 2015, 08:09:13 PM

The problems I was expecting were assigning -1 as special value to char to indicate some special case, like:

Code: [Select]


char c = -1;
..
if (c != -1) {
   ..
} // else ignore

Changing char to unsigned would force this type of code to use some other special value (like 0) or add the burden of using a second variable to indicate valid-ness.

The parsing of X in sizeof(X) is still very limited. One issue is that at parse-time, the parser doesn't know if X is a type or a symbol.
I also wanted to keep things same and since I have never really seen sizeof('a') or sizeof(10) in any production code, wanted to remove this option. So only <type> or <symbol> would be allowed. Does this make sense?

Title: Re: Unsigned char
Post by: chqrlie on September 15, 2015, 12:52:07 AM

In your code snippet, using a special value for c still works the same if you use 255 instead of -1. Yet it seems a tad sloppy to use a valid char value to indicate some special case: the standard idiom for this is to make c an int and use a value that is not a valid char value. EOF comes to mind (namely -1 in your example), which is precisely the reason why char should *not* be int8.

Regarding the parsing of sizeof(X), of course it poses a problem to restrict the syntax in unexpected ways without a meaningful error or warning message. It violates the principle of least surprise. I for instance wasted almost an hour trying to compile simple test code and digging through the compiler source code and then into the modified clang parser... not a very good trip. It is quite common to see sizeof used to get the size of a struct member - sizeof(a.b) - or that of a dereferenced pointer - sizeof(*p) -, so you need to accept an expression in sizeof(X). I also tried the operator syntax sizeof 'a', which is not ambiguous, but is not accepted either.

I have never used sizeof('a') in production code, but in test code and it is actually a simple way to check if the compiler is invoked in C or C++ mode without the preprocessor:

Code: [Select]

if (sizeof('a') == 1) {
    /* C++ mode */
} else {
    /* C mode */
}

Title: Re: Unsigned char
Post by: bas on September 18, 2015, 08:39:02 AM

I realize that one of the missing entries in the documentation is how C2 handles constants (or literals). For example:

Code: [Select]

int8 a = 200;
int8 b = 100 + 28;
int8 c = 28 + 'a';
intc d = 28 + 'z';

In C2 the value 200 is not converted into int or long, but C2 checks if the *value* fits the LHS. This also works
works for constant expressions as 100 + 28 etc. So assigning to c is ok ('a' = 98, so total is 126), while assigning
to d raises an error (since the total > 127). That's how it is with constants.

It's very unfortunate that you ran into the issue with the parsing of sizeof(). There are only a few of those gaps
left in the parser, but there's always Murphy's law of course...

For checking things like sizeof('a') or sizeof(char*), static asserts would be best I think.. This is not in C2 yet, but
shouldn't be that hard to implement.

Title: Re: Unsigned char
Post by: chqrlie on February 05, 2016, 04:03:52 PM

There should be 3 distinct 8-bit types (at least):

* `int8`: an 8 bit signed integer, translated to C as `int8_t`, equivalent to `signed char`.
* `uint8` or `byte`: an 8 bit unsigned integer, translated to `uint8_t` or `unsigned char`.
* `char`: an unsigned 8 bit byte, part of a UTF-8 encoded null terminated string.
This latter type translates to C as `char` and the C compiler should be configured with `-funsigned-char` to treat `char` as unsigned by default. I already pointed the inconsistencies attached to the type `char` being signed by default on many platforms, mostly for historical reasons.

For me, `char` and `uint8` are distinct types. They have the same representation and arithmetic properties but should be used appropriately depending on context and semantics. For example a string literal is a `char[]`, not a `byte[]`, a `uint8[]` nor an `int8[]`. Character constants should probably be of type `char` instead of `int` as they are in C.

Another question is whether to use `int8` and `uint8` or the simpler alternatives `i8` and `u8`, and similarly `i16`, `i32` and `i64`...

Title: Re: Unsigned char
Post by: bas on February 10, 2016, 11:04:28 AM

good discussions!

I think everyone agrees on the int8/uint8 stuff.
The question is which type to give X in:

Code: [Select]

X hello = "Hello World";
X is currently 'char' and in my opinion, should be used for strings. That said, We could make it unsigned since that
would be better (discussed already). However.... I recently read another book about Secure programming in C
(I can post the exact title for anyone interested) that explained why signed variables are used so often in C, even
for counting stuff. The reason is that negative (or -1) can then be used for error handling. Quite obvious of course,
but when you think about it, it really is so. So

Code: [Select]

int number = calculateNumber();
if (number == -1) {
   // error
}

The same goes for strings of course. The -1 value of char can be used to have a special case.

You propose that char (or at least char*) are used for strings, but how about the case of single chars?

Code: [Select]

char c = 'C'Would signed/unsigned be better here?

Your suggestion of changing 'int8' / 'uint8' to 'i8' and 'u8' is quite funny, since that is actually what C2 started with.
It is also what's used in Rust btw. I think there are even some discussions about that on this forum. We looked at
a lot of code with either the long or the shorter version and found the longer version more 'compatible' (at least for
developers) with C. I do think that once you get used to it, i8/u8 are easier on the eyes.

Since C2 will be used for the same domain of problems as C is now, UTF support is not really high on the agenda.
It's not a systems language per-se as Go and Rust.