BString and UTF8
The BString class seems to be strange to me. Though it is handling only char type of data, it promises to handle UTF8 strings. But whenever a char parameter is used, I wonder how to apply this to e.g. a two byte sized UTF8 char symbol. Or is there also a char16 based BString alternative?

Comments
Re: BString and UTF8
I will give an example:
this will not work like expected.
It would be good to rewrite the BString class (internally still UTF8) to compatibly also support a kind of metachar instead of poor char parameters to support also multibyte chars then, which seem to be translated correctly into such multibyte integer values.
If you would send me the BString source code, I would like to do that rewriting job.
Re: BString and UTF8
Hi Octopus!
I'm not sure how and what you'd like to improve with BString. There's a guide on how to get the source, there's also the SVN browser to have a look first. You may also want to join the haiku-developer mailing list.
That said, your above example works, if you use
s += "€"; or s << "€";Regards,
Humdinger
Re: BString and UTF8
Of course, I am about to join some developer near places.
And of course you could append a string instead of a (multibyte) char. But there is an overall problem. You have to know, whether a char is a multibyte char or not. If you have only constants, this could be done. If you are living in a country like USA, where all letters are equal to exactly one UTF8 byte, live is very easy here. In Germany e.g. there are letters "äöüÄÖÜ߀". In other countries there will be even more. The good thing is, the compiler treats chars like '€' as a UTF8 multibyte metachar (integer). There would be no big problem to handle those metachars correctly where chars are requested as parameters in BString methods. And it would look more natural, because it is what one intends to do. Currently those BString parameters will be shrinked to a nonsense byte without any warning about the loss of information.
P.S.: Another example:
This will output 2 as the found position, which is erroneously one byte behind the storage place of 'ä'. This is done because of the unwarned shrinking of the char parameter in the FindFirst method.
PPS.: thus multibyte characters also might be located at a very differennt places, because only 8 bits will be compared.
Re: BString and UTF8
Well developers e.g. in the USA or UK have it easy. All traditional chars are identical to their UTF8 counterparts. Here in Germany e.g. there are "äöüÄÖÜ߀". If you have them as constants, then you really are able to avoid their use as a char and substitute such calls by calling an equivalent using a char string parameter. But this is not natural. Moreover it leads to a lot of unwanted and not (at once) understood errors. Such a parameter, e.g. 'ä' is shrinked to its last byte and might cause wrong results (mostly) without warning. E.g.
Re: BString and UTF8
Don't use chars. Use string:
Re: BString and UTF8
I come from Italy and for me "èéòàùì" are chars NOT string so is for me perfectly natural to write:
BString s = "Bèrenstraße";
int p = s.FindFirst('è');
printf("%d", p); // SHOULD Out: 1
int32 FindFirst (char c) const // This not work with UTF-8 chars? It should...
Find the first occurrence of the given character.
int32 FindFirst (const char *string, int32 fromOffset) const
Find the first occurrence of the given string, starting from the given offset.
int32 FindFirst (const BString &string, int32 fromOffset) const
Find the first occurrence of the given BString, starting from the given offset.
int32 FindFirst (const char *string) const // Why I must pass a "string" for a char (è)?
Find the first occurrence of the given string.
I don't know id can be so easy char is in relaity a sinomus of "byte" (that is 8 bit) so we need a method that can accept a int (or if we want be exotic create an apposite type wchar = int32)...
but the trick with 'ì' to indicate a "UTF-8 char" will work?
Redife char as int32 fear it will be near impossible, however, for compatibility reason...
Re: BString and UTF8
This is a C++ Linux equivalent... and it works no tricks ì is "a char" and is found in position 4... I had not to use a string (that is a char * = "ì" trick!).
So Haiku's FindFirst(char c) should work in an analogous way...
Re: BString and UTF8
not at all ... 3 would be right, because it starts with index 0.
PS.: Moreover variable toFind is not able to hold a multibyte value completely.
Don't use chars. Use string: ...
What sense does a UTF8 based BString class make, if its methods are not able to handle multibyte chars as parameter? Then it would be more consequent to forget about UTF8. Instead class methods should be opened to handle metachar parameters. That is what I have volunteered.
Re: BString and UTF8
I'm just a newbie dabbler, so I don't yet understand what's the problem in using "é" instead of 'é'. You may want to discuss your plans at the haiku-developer mailing list, that's where the Haiku devs hang out. Then you can provide a patch and create a ticket for it on the bug tracker.
Regards,
Humdinger
Re: BString and UTF8
Any cooperation task for Haiku development seems to be too complicated for me. I still do not understand how things are organized here. Maybe a native English speaking person will do it easier. I presume a little video report on how to interact showing a little example might help some fans to contribute. For now I will stop posting my problems at this site and keep watching quietly how Haiku will proceed.
Re: BString and UTF8
It's really not that complicated.
Regards,
Humdinger
edit: PS: Don't mistake these forum chats with the official development process. That is really done on the dev mailing list.
Re: BString and UTF8
Do you realise the difference between 'ä' and "ä", in memory? The first item occupies sizeof(char) bytes, which is usually one byte. The second item is a zero terminated byte array. Also, the reason your code worked under Linux was a coincidence which depends on the local users code page. On a typical US system, it would give different results to a west european code page. When you realise the difference, you've taken your first step towards understanding Unicode.
BString was created at a time when std::string was under-featured. Today it's a different story. But then again, the entire codepage/unicode text handling facility in C based languages is also a mess. This is the price of legacy.
Re: BString and UTF8
... BString was created at a time when std::string was under-featured. Today it's a different story. But then again, the entire codepage/unicode text handling facility in C based languages is also a mess. This is the price of legacy.
Therefor I suggested to compatibly extend the BString Class. Where pure char parameters are requested, there metachar (int32) parameters should be supported. I had volunteered to do this. But it does not make much sense to perform such efforts, where the to be solved problem is not seen.
P.S.: 'ä' needs two bytes, '€' will need three. They cannot be represented by char, but by metachar (int32).
Re: BString and UTF8
I had volunteered to do this. But it does not make much sense to perform such efforts, where the to be solved problem is not seen.
Maybe you missed what I edited to my last post: "Don't mistake these forum chats with the official development process. That is really done on the dev mailing list.". If really want to work on the issue, you should consider taking it to the official list.
Regards,
Humdinger
Re: BString and UTF8
Well, I have subscripted two mailing lists, but that does not enable me to post there anything. Logon at those mailing lists is not possible, because of missing of passwords or authorization strings.
Re: BString and UTF8
That's odd. You just have to subscribe with the email address you're going to use when posting to the list and confirm the email sent to you by freelists.org once. After that all mails will be delivered to the account you specified and posting should a simple mail to haiku-development@freelists.org.
Regards,
Humdinger
Re: BString and UTF8
Unfortunately you cannot extend it in a compatible manner.
If you change parameter types then the symbols will mangle differently and anything compiled against the old one will die with symbol not found errors.
I would suggest creating a separate unicode supporting string and integrating that, as BString must remain binary compatible with the BeOS one and has limited usefulness.
Re: BString and UTF8
I believe that the BString class can be extended in a compatible manner to handle UTF-8. I believe it can be done without changing the parameter types in the existing methods. It can be done so that all of the existing behavior can be done without changing the binary compatibility or the behavior of the existing methods. The one thing that my implementation won't do is locale sensitive linguistic comparisons. For that, a Collator class needs to be added to the Locale Kit.
I'm working on an implementation and am planning on writing some articles that demonstrate the new methods that will be added.
Re: BString and UTF8
Thanks from all not English speakers and user of èòàèùé€ chars!
Re: BString and UTF8
I've been working on adding full support in the BString class for UTF-8, and wanted to give an update on this issue.
First the bad news, then an explanation, then some good news.
Using national characters with single quotes (apostrophe) will not work with the BString. The problem is not the design or implementation of the BString. The problem is caused by a 'feature' of the C/C++ language.
The C/C++ language does not know about UTF-8. So, you can't natively handle UTF-8 in C/C++. From a purely technical point of view, using "char*" with UTF-8 strings is incorrect. In C/C++, a "char" is a "single byte character". A UTF-8 string is an array of "unsigned bytes", where a single character can be between one and four bytes in length. A more accurate type for a pointer to a UTF-8 string is "uint8*".
The '€' character constant is two bytes in length. The C/C++ language does not consider it a "char". So, what type is a character constant that uses single quotes and is two to four bytes? I tested this issue with several C++ compilers to see what type they considered a character constant that was between two and four bytes. One treated it as an "int". Another treated it as a "long". The C++ compilers used for Haiku treat it as an "unsigned long". Another compiler treated it as an "unsigned long", but also allowed characters constants up to 8 bytes. That compiler treated a character constant that was between 5 and 8 bytes as an "unsigned long long".
So where does that leave us?
If you want to use "operator<<" with characters in Haiku, you will have to use double quotes (if the character is not a 7-bit ASCII char).
When Be, Inc. designed BeOS, they used this 'feature' of C/C++ to their advantage. The "message constant" parameter to a BMessage is a "character constant" that can be between one and four bytes in length. One of the implications of this is that you can use national characters in a "message constant" for the BMessage system. You, of course, need to make sure that the value does not exceed four bytes in length. (The compiler will flag a compile error if you do that.)
Now for the good news.
My update of the BString class to enable full UTF-8 support is in good shape. I'm focusing on several aspects:
1) Ensure that the BString can only contain valid UTF-8 strings.
2) Add full support for both "code point" and "UTF-8 character" functions.
3) Enhance the Locale kit to complete the ability to fully handle UTF-8 characters for any language.
Beyond the code changes, I am also going to provide:
1) Updated API documentation for the "Haiku Book".
2) An article that is an introductory description of UTF-8, internationalization (i18n), collation, code points vs characters, normalization, etc... This document will also give code examples showing how to use these classes to provide full i18n support. (I would like this to be a chapter in the "overview" section of the Haiku Book.)
I'm targeting the first part of this for the end of November, 2011, with the remainder rolling out in 2012.
Notes:
A) There is often a misconception that in Unicode a "code point" and a "character" are the same thing. They are not. There are times when a character consists of a single code point, but in other cases, a character encompasses two or more code points. The BString class needs to understand (and support) both code points and characters. (New functions in the BString class will provide an easy way for applications to get the correct behavior.)
B) One of the main reasons for only allowing valid UTF-8 strings in the BString class is that not doing so would expose a security vulnerability. "String" exploits are a common target for computer viruses. Closing this security vulnerability is important.
Re: BString and UTF8
Hello,
any news about your unicode-aware version of the BString lib?
I would be interested in using it in a non-Haiku context.
Is there already a version released?
Are the license terms the same as with the original BString ?
Best regards, TE
Re: BString and UTF8
I was going to try and explain this, but, mibrid beat me to it, so what he said. Using '€' just won't work, you have to use "€" because using '' indicates a single byte char and € isn't. This isn't a problem with BString() it is a fact set by the compiler.
Re: BString and UTF8
Nun, vor langer Zeit habe ich das Problem mit UTF8 angesprochen und angeboten, hierzu eine Lösung zu erarbeiten. In mehrfachen Versuchen, das in Mailinglisten anzubieten und zu erklären, bin ich aber immer nur auf Granit gestoßen. Vielleicht lag es an meinem Englisch. Also benutze ich hier nur noch Deutsch, vielleicht versteht man mich besser ...
Man sollte doch klar sehen, dass "€" etwas anderes ist als '€'. Und wenn man das letztere braucht, dann hilft ein Hinweis auf das erstere nicht. Stattdessen sind echte Lösungen gefragt, denn nicht jeder wohnt in angelsächsischen Ländern, die unsere Probleme mit "äöüÄÖÜ߀" in UTF8 einfach nicht sehen wollen.
Alle beschweren sich, dass es so wenige Entwickler für Haiku gäbe. Tatsächlich ist man hier aber auch meisterhaft im Vergraulen potentieller Helfer.
Möglicherweise hilft ein Posten auf deutsch auch zu zeigen, dass die Welt nicht nur aus Englisch sprechenden Nutzern und Entwicklern besteht. Ursprünglich hatte BeOS doch wohl die Absicht, international (besonders auch in Japan) anerkannt zu werden. Dann aber sollte man sich (sofern das für Haiku auch noch zutrifft) aber auch mit lokalen Problemen wie dem der Umlaute in UTF8 ernsthaft beschäftigen und nicht nur die Kalte Schulter zeigen.
Re: BString and UTF8
Pleas in English :)
Hate to miss if a solution was given ;)
Re: BString and UTF8
Octopus, as you are saying, Haiku is an international project. But that doesn't mean everybody can speak his or her own language. There needs to be a common language for obvious reasons and that is English, if you like it or not. This is the case not only in Haiku but also in *every* other major open source project I know of. Just ignoring that fact and posting in German won't help at all. And by the way, being German myself I find your behavior a bit arrogant and ashaming to say the least.
Also, if I am correct, there are in fact a few German developers in Haiku. If you contact the developers via the mailing list, as was suggested multiple times to you, or file a bug in the bug tracker, I am pretty sure your help would be very much appreciated. Just check out the code, write a patch, and discuss it on the mailing list or attach it to a bug ticket. Reading this thread I don't see, where anybody wanted to scare you away. People tried to make clear to you, that this is not the right place to discuss the topic, because there are hardly any developers here.
I am not speaking on behalf of the Haiku project, I am not a contributor, not a developer, I just observe its develoment. I have contributed to other open source projects before though and had a rough start, too. If your mails were ignored, it could be that just nobody had time at that moment or they got lost between other more important things. Sometimes it helps to just politely ask again. Nobody in the open source "world" wants to scare you away, but neither does it revolve around you. And the language in this "scene" is English, cope with it.
Re: BString and UTF8
Selbstverständlich ist "€" etwas anderes als '€' - "€" ist im C++ Standard klar definiert als ein String, '€' ist laut Standard undefiniert und compilerabhängig. Dass '€' in anderen Betriebssystem funktioniert ist wohl Glücksache.
Sich in BString auf Eigenheiten des Compilers zu verlassen wäre gefährlich, BString könnte dann also mit einem Wechsel oder gar Update des Compilers plötzlich nicht mehr funktionieren.
Die Eigenschaft die du beschreibst, ist unglücklich, aber leider eine Eigenschaft von C/C++ selbst und BString ist dafür nicht verantwortlich. Die einzig vernünftige Methode um '€' zu benutzen, wäre zu einer Programmiersprache zu wechseln die UTF8 direkt unterstützt, innerhalb von C/C++ muss man sich mit "€" wohl oder übel abfinden.
Re: BString and UTF8
Ein Abfinden ist nicht nötig. Das habe ich mit eigenen Routinen bewiesen. Der Compiler ist unschuldig. Eine Verbesserung der Bibliothek wäre angesagt. Diese hatte ich angeboten zu schreiben. Daran war niemand (der Maßgeblichen) interessiert und so hat man mich auflaufen lassen. Irgendwann habe ich aufgegeben, hier meine Mitarbeit wie Sauer Bier anzubieten. Zum Glück gibt es genügend Möglichkeiten, seine Kreativität anderswo einzubringen.
Re: BString and UTF8
Hi Octopus.
Hast Du einen Link zu dem Mailinglist Thread, in dem Du Deine Ideen/Routinen vorgestellt hast? Hier ist das Archiv.
Gruß,
Humdinger