NetHack Development Team Polls Community For Advice On Unicode 165
An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
The answer is... (Score:5, Insightful)
utf-8
Re:The answer is... (Score:5, Informative)
For storing a single character: UCS-4 (aka UTF-32), and that's without possible combining character decoration. For everything else, UTF-8 internally, no matter what the system locale is.
wchar_t is always damage, it shouldn't be used except in wrappers that do actual I/O: you need such wrappers as standard-compliant functions are buggy to the level of uselessness on Windows and you need SomeWindowsInventedFunctionW() for everything if you want Unicode.
And why UTF-8 not UCS-4 for strings? UTF-8 takes slightly longer code:
while (int l = utf8towc(&c, s))
{
s += l;
do_something(c);
}
vs UCS-4's simpler:
for (; *s; s++)
{
do_something(*s);
}
but UCS-4 blows up most your strings by a factor of 4, and makes viewing stuff in a debugger really cumbersome.
My credentials: I'm the guy who added Unicode support to Dungeon Crawl.
Re: (Score:2)
Came here to say pretty much the same thing.
UTF-8 is pretty easy to work with from a memory management perspective and will make it easier when upgrading an established ASCII base.
Re: (Score:2)
More importantly, (Score:3, Insightful)
who cares? This only affects naming your character and displaying stuff on the map.
Re:More importantly, (Score:5, Funny)
Don't you want to name your fruit U+1F4A9? (can't write this as a literal because Slashdot)
Re: (Score:1)
Better to not support it at all (Score:2)
What use are those characters anyway? You don't need funny accents on letters to play Nethack. 7 bits should be enough for any character set! Hardcore hackers who want a workaround can just use LaTeX codes.
Re:Better to not support it at all (Score:5, Funny)
What use are those characters anyway? You don't need funny accents on letters to play Nethack.
For more terrifying monster types, of course. You haven't really battled a Chinese dragon until you've done it using the original Han character set.
utf-32/ucs-4 (Score:2)
Considering the length of their release cycle, seems to be a safe choice.
It's not like the difference 1/2/4 bytes would make much performance difference for the application like NetHack.
Using the utf-32 internally would save them from some of the silliness the alternatives like utf-8 bring with them.
Re: (Score:3)
i don't see a real argument here. "considering the length". how long is it?
Check the game history [wikia.com]. Literally decades between major releases.
"some of the silliness". what silliness is this exactly? external storage of utf-32 requires that one deal with an endian character set. every time any text is touched, you'll get to endian convert.
Everybody has already settled on the little-endian presentation.
isn't that awesome? utf-8 does not have this issue. and one can almost always treat utf8 as a byte stream. except in the rare case where one needs to know where character boundaries are. for example, to map the character to a font. the fast path is the common path (ascii), and just requires a single test ((c&0x80) == 0).
With UCS-4 you do not even need any tests.
Extracting a character - trivial.
Length of string - trivial.
Normalization - much simpler than the utf-8.
The sad reality that libraries I have seen actually implement the utf-8 handling by using internally utf-32. You can't avoid it: Unicode is specified in the code points, which as you point it out are already as good as 32 bit lon
Re:utf-32/ucs-4 (Score:5, Informative)
Please, don't use the Wikia NetHack Wiki. It is outdated, ad-ridden, and has been abandoned by the community, but Wikia doesn't allow a wiki to be deleted.
The current NetHack wiki is at http://nethackwiki.com/ [nethackwiki.com] .
Re:utf-32/ucs-4 (Score:5, Informative)
Extracting a character - trivial. Length of string - trivial.
I don't think it's quite as simple as you think. UTF-8 is a variable-length encoding, but UTF-32 is too when you consider grapheme clusters.
When you extract characters and and determine length, are you only talking about code points (not very useful) or are you taking into consideration combining characters to account for actual visible glyphs that most people would consider to be a character?
The overwhelming majority of apps are only doing trivial operations -- string concatenation and shuffling bits to some API to display text. For these apps, choice of encoding really does not matter. NetHack is very likely in this category.
Anything more and you'll have to deal with variable-length data for both UTF-8 and UTF-32. So it doesn't really matter. Choose whichever uses less storage space.
Re: (Score:3, Insightful)
Let me answer with a koan: 'What is the real length of a soft hyphen?'
Re: (Score:2)
It is the same problem as with the fancy acute/agrave/etc special symbols.
And the special white-space/no-space characters. And the special writing direction change characters.
They are generally removed during normalization/conversion into canonical presentation.
The thing is, after the normalization, which is needed for any Unicode text anyway, UCS-4 becomes a plain array of characters. But UTF-8 - still not.
Re: (Score:2)
The thing is, after the normalization, which is needed for any Unicode text anyway, UCS-4 becomes a plain array of characters. But UTF-8 - still not.
It becomes a plain array of codepoints. Some things still require multiple codepoints to represent, though they're relatively rare. The main advantage of UTF-32 is that if you're only broken for the things that are multiple codepoints, then most people won't notice or care. If you're broken for things that require mutlibyte UTF-8 characters, then a lot of people will notice.
Re: (Score:1)
So basically what you (and others) are saying, is that since there are some edge cases foreseen in the standard, nobody should try to make life easier even by a bit?
Combining characters (and the rest of the crap) pretty much never occur in real life. Only in some sadistic test case for the Unicode libraries, probably.
The main purpose of Unicode, why both users and developers want it, is to represent as much as characters as possible with least hustle possible. And that's pretty much what everybody's sho
Re: (Score:1)
So, based on your few months of experience with Unicode, which apparently gives you a headache, you are pushing for them to implement an easier short-term solution that you admit won't work in some cases.
Re: (Score:2)
So what you propose?
Go with utf-8 which doesn't alleviate any of the problems? But adds its own one?
Beside, I doubt very much that anybody is going to use any of the fancy characters in the NetHack.
Re: (Score:2)
Combining characters (and the rest of the crap) pretty much never occur in real life.
Depends on the scripts and languages. It's fairly common in Indic scripts
Re: (Score:2)
Re: (Score:2)
Its obvious you have little real experience with unicode, because saying 'just convert to utf-32' just papers over the problems without solving them.
Indeed I've only scratched surface. And that alone gave me headaches for months.
UTF-32 units are code points, not characters, and there are many multi-code-point (variable length) characters in utf-32.
For example?
Re: (Score:1)
"However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an ogonek, a dot above, and an acute accent, which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301."
Re: (Score:1)
A google search tells me that E with dot below and acute accent, used in Yoruba language, has no precomposed codepoint.
Re:utf-32/ucs-4 (Score:5, Informative)
Characters in Thai are rendered in display-oredr, and not logical order.
so, for example ( mina would be imna) and requires reordering for sorting.
Characters in many Indic languages are still all syllable based.
So, consonants and vowels are encoded separately, and fully interact as a logical graphical character.
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri)
If you omit the Zero-Width-Joiner, then it displays as two characters, "Sa'" and "Ri."
So, the rendering and display are dependant on the entire grapheme, which is the normal unit of display and truncation.
Otherwise one will be cropping portions of a character on display; and rendering either jibbrish/bakamoji, or unrelated characters/syllables because.
Malay:
0d15 0d4d 0d38 0d3e
KA VIRAMA SA AA
One displayable character.
If you display code-point by code point, the grapheme displayed would changes 4 times.
KA
K'
KSA
KSAA
Re: (Score:2)
Characters in Thai are rendered in display-oredr, and not logical order. [...]
Ha! Not relevant to me, actually. But very informative. Thanks.
Overall, most customers are aware of the problems (and in my experience better than me). Simple handling I had in my software had worked and was sufficient.
The Thai language specifically is a cool example. Why not relevant? My company refused to do Thai localization. (And thanks to you now I know fully why.) To do the localization we were told that we have to buy a special Thai language library. The library costs huge money. When we told cus
Re: (Score:2)
Pay indeed.... The only thing you need to pay for is somebody to QA your app who can read the foreign language. And maybe for half a clue about learning this stuff, because apparently you're too lazy too read all the free material on the web that explains this stuff.
I imagine the problem for many software companies is that localization is probably an afterthought unless they start out in Asia, and even then Thai is probably not a priority for them so while they'll certainly handle Unicode, they might not handle all of it.
Re: (Score:2)
Even in the real world, a surprising number of languages contain characters that still have no single-point normalized forms. But the most widely-known case of multi-point characters doesn't correspond to any real-world language at all: look up "Zalgo" for more information on this.
Re: (Score:2)
> Everybody has already settled on the little-endian presentation.
What makes you think this? There are plenty of old Motrola architecture based systems still in legacy environment use, preserved for stable scientific or business computing environments. NASA has a great deal of it still in use, because they've been forced to keep old earthbound hardware in use to support old spacebound mission hardware. And there is a significant amount of new, bi-endian hardware being produced now,
I'm afraid I have quite
Re: (Score:2)
Everybody has already settled on the little-endian presentation.
What makes you think this? There are plenty of old Motorola architecture based systems still in legacy environment use, preserved for stable scientific or business computing environments.
Man, I come from the BE world. You do not need to tell me that there is still abundance of the BE hardware.
And there is a significant amount of new, bi-endian hardware being produced now,
Most modern CPUs I had to deal with, except the Intel, are bi-endian. BUT. Most (by model number) are used in BE mode. (But since ARM also has settled on the LE, now it is effectively a LE world.)
Yet.
1st. The endianness of the CPU is not related to the endianness of an data exchange format.
2ns. The endianness of the data exchange format does not relate to the internal presentation of the data in t
Re: (Score:2)
> For external conversions, all what matters that the internal format can be easily converted into the widely used encodings.
And this is the difficulty. It's not the _ease_. It's the consistency, predictability, and portability. Many external displays of Unicode content have varied between platforms in alarming ways, especially due to mishandled character displays which the programmer has little control over. It may have gotten better since my last go-around with it, but even simply layout issues like co
Re: (Score:3)
UTF-8 is designed to be treated as a byte stream - even when detecting character boundaries. If a byte is >0x7F and <0xC0, then it is not a character boundary. If you want to be really strict, filter out the invalid bytes (0xC0, 0xC1, >0xF4), then everything else is a character boundary.
Use utf if you must, for character names, only. (Score:5, Interesting)
I started playing nethack before it was nethack, it was just hack. (I may well hold the record for longest time playing without an asencion, but that is beside the point.) I have played other roguelikes and keep coming back to nethack because it is the only one that keeps that same feel for me. It has had the same overall look my entire life. While the expanded character set in UTF would allow for significantly more characters to be used in drawing the map, and designating each monster with a different character, I beg of you not to do so. Keep the overall look the same, (or allow it as a compile time option at the very least) and just use UTF for the character name.
For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.
Re: (Score:1)
Adding Unicode for names would be nice but it also would probably introduce a ton of bugs in the process making the game less stable again. Plus, using the same character for different monsters is *part of the game*. If you get lazy and don't look if the G is a gnome vs gargoyle or something, the mistake is supposed to cost you.
Re: (Score:2)
Adding Unicode for names would be nice but it also would probably introduce a ton of bugs in the process making the game less stable again. Plus, using the same character for different monsters is *part of the game*. If you get lazy and don't look if the G is a gnome vs gargoyle or something, the mistake is supposed to cost you.
Thanks for reminding me why I don't play Nethack - briefly I was tempted.
Re: (Score:3)
Don't worry, that would never happen.
G's are only ever gnomes (of differing ranks), but a g might be a gargoyle, flying gargoyle, or gremlin.
I hope that clears things up. And for god's sake, don't genocide G's if you're playing as a gnome.
Re: (Score:2)
Well, clearly *that's* an exception. I'd forgotten because in that case, I'd normally a a u's (.
Re: (Score:2)
For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.
UTF-8, while originally only defined to 31 bits and now defined to 21 bits, actually has room to trivially extend up to 43 bits. One could say it's more future-proof than UTF-32. Not that it really matters -- we're only using 17 bits right now so I doubt we'll ever get past 21. Maybe when we encounter intelligent alien life.
Re: (Score:3)
(I may well hold the record for longest time playing without an ascension)
I think I started Hack around '92, and finally ascended in 2009. I've been trying to ascend about once a year since then.
Fonts missing in action (Score:2, Informative)
First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4. So, if you're gonna do unicode, and you don't like your code to be buggy, this is the way to do it.
That said, unicode is a travesty. Unlike ascii, there is no such thing as a complete unicode font that implements all of unicode's code points. Unicode only defines how any implemented chars should be numbered, but doesn't actually require yo
Re: (Score:2)
The font issue is a silly thing to worry about. The same thing can be said of ASCII of and Windows-1252. I'm sure lots of early fonts, and probably even some you find today, that claim to support all glyphs in Windows-1252, are missing the Euro sign at codepoint 0x80, because they added it later on. Even for a small character set restricted to 256 max characters, as you can see, things change over time, and fonts don't always keep up.
Re: (Score:2)
Shouldn't the font system just solve this for me in the case of display use? Sure, for typography you probably don't want magical mystery substitutions, but why can't the system figure out which of my fonts is most similar to the font I'm using and sub in missing glyphs?
Re: (Score:2)
I'm pretty sure most font systems already DO do this. In fact, this was the reason I rooted my Android phone - I wanted to change the font-fallback order so that certain Kanji would display with a Japanese font instead of Chinese one. An example is http://jisho.org/kanji/details... [jisho.org] which is drawn completely different in Chinese fonts, to the point where Japanese readers would not know the symbol, yet both are supposed to be represented by the same codepoint, because they're the same character.
But anyway, fo
Re: (Score:2)
Usually not the font systems themselves, as the font system API needs to be designed to let you use fonts in the way that suits your application, and not have random substitutions happen behind your back (though the font system provides the API functions to figure out what a good substitution font will be). But higher level UI libraries, like GTK, Qt, MFC, Windows Forms,Core Text, Skia etc will do it.
Fonts missing in action (Score:2)
First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4.
The memory usage of UTF-8 is also at most char count multiplied by 4. The 5- and 6-byte sequences were declared invalid when Unicode was restricted to have no character above U+10FFFF.
Re: (Score:3, Informative)
Terminoligy needs to be fixed.
All Codepoints are 4 bypes
All characters (defined as a single conceptual, and graphical display unit) range from 1 to 6 code-points. (so, 4-24bytes)
Sinhala:
0dc1 0dca 200d 0dbb 0dd3
ZHA VIRAMA ZWJ RA VOWEL-SIGN-II
Combine to form a single displayable character. (Sri) (kinda a fancy item; but different from without the ZWJ which would display two graphemes. (S', and RII)
And Lituanian:
"However, not all abstract characters are encoded as a single Unicode character, and some abstr
UTF-8 (Score:5, Insightful)
The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.
Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.
Re: (Score:1)
Hell, there's 5-byte UTF-8 characters too (how would we represent UTF-32 characters in UTF-8 otherwise)..
The nicest thing would of course be if the world could hurry up and switch to English so we could just have ASCII for most everything, and UTF-32 for museums and whatnot that needed to store Linear-B or ancient hieroglyphs ;-] [*]
[*] Written by a non-English person, in English, using a US keyboard. It's all good bros,.
Re: (Score:2)
The official spec limits UTF-8 to 10FFFF to help it place nice with UTF-16, so no 5 or 6 byte sequence is valid anymore. There isn't any characters defined above 10FFFF yet anyway. But in the future, if those ranges are defined, it would be easy to have programs using UTF-8 utilize those characters. If you use UTF-16 like Windows, you'd be out of luck though.
Re: (Score:2)
If you use UTF-16 like Windows, you'd be out of luck though.
UTF-16 doesn't have the problem. UCS-2 (which Windows still mostly uses, even where it pretends to use UTF-16) does. UTF-16 combines the worst of both worlds: a space-inefficient variable length encoding.
Re: (Score:2)
UTF-16 is terrible, yes, but Windows does support it. I'm sure naive programmers create bad code by assuming UCS-2 and all characters being 2 bytes, but surrogate pairs like Emoticons U+1F600 - U+1F64F work just fine.
And by "out of luck" I was referring to possible future codepoints above U+10FFF. UTF-16 can only support up to that by using surrogate pairs. It does not have any way to represent higher codepoints, where as UTF-8 can easily be extended with 5 and 6 byte sequences.
Re: (Score:1)
Re: (Score:2)
The problem with that is that there are certain thoughts and concepts that can't be expressed in English.
Are you suggesting that one of the most promiscuous languages on the planet wouldn't just add a butchered version of the original word if that were truly the case? Words mean whatever we want them to mean - they're arbitrary combinations of symbols, kind of like unicode. :)
UTF-8 (Score:5, Interesting)
UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.
Re: (Score:1)
UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.
Short of someone deliberately using a mismatched non-character fixed width type like uint8_t to check for a zero terminator in exactly 8-bit wide units or something equally brain-dead, the scenario above can't happen since the character width checked for is set by the type. Flipping the type from char to char16_t and char32_t should "just work".
This is not to say that there aren't any problems that can't occur (the blind assumption that sizeof(string) == countof(string) and referencing memory off of that,
Re: (Score:2)
Incorrect. UTF-9 works for 8-bit chars.
Here's the truth for C:
sizeof(char) <= sizeof(short int) <= s
Go with the majority (Score:5, Insightful)
In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.
The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.
Language (Score:1)
Re: (Score:1)
UTF-8 Already Works to Name Your Pet (Score:1)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Yeah but how many wand charges does it take to engrave your pet's name? Huh?
That's why I always name my pet Elbereth...
Unicode-related poll on Slashdot?! (Score:2)
I don't know if I'm supposed to marvel at the submitter's sarcastic nerve of laugh with the irony.
I think I'm gonna do both.
NetHack Development Team Not Dead (Score:1)
All of my fucking YES.
It wasn't justs a rumor.
Re: (Score:2)
All of my fucking YES.
It wasn't justs a rumor.
Just switch to a different game or a variant. You'll be happier.
Re: (Score:3)
Re: (Score:2)
It wasn't me who submitted this story. I would have done if I thought there was a chance of it being accepted - more opinions are always good and Slashdot has lots of technically-inclined users who likely have relevant opinions - but it seemed a little offtopic.
I did write the blog post for the purpose of being linked to news aggregators so that people would have more than a bare post from the devteam to introduce the issues, though.
Fuck unicode (Score:2)
Unicode is a clusterfuck. 7 bits is good enough for anyone.
The one true encoding (Score:2)
The answer is always UTF-8. It doesn't matter what project, or country, or language. Anything other than UTF-8 will cause completely avoidable problems. I wish more programmers would learn this rule, as it would make all our jobs easier.
Re: (Score:2)
for ease of use and storage efficiency with flexibility, yeah, UTF-8 is always best.
For certain type of work with specific performance characteristics however, not so. Thats usually the problem.
Re: (Score:2)
Very few places are going to be dealing in UTF-32 just because of the performance.
And certainly not Nethack.
In all my projects, I use UTF-8. Any performance hit is so far off the radar, it's just not worth worrying about.
Re: (Score:2)
Absolutely. I was just saying that it was basically the only time anything other than UTF8 matters (especially since in the time when it matters, switching from one to the other is HELL).
My wife used to work on a faceted search system made to handle a few petabyte of data... the difference was pretty huge.
Since I personally never did something like that, I never had issues just using UTF8 :)
UTF-8 (Score:1)
Re: (Score:2)
libuncursed exists because ncursesw is so bad. libuncursed is not great, but it's simple. Just the sort of band-aid we need on *nix until someone rewrites ncursesw.
Re: (Score:2)
I wrote libuncursed specifically for NH4 (but intended to work for other roguelikes too), because curses solves the wrong problem nowadays (the problem of "how to talk to an obscure terminal from the 1980s that uses nonstandard terminal codes", rather than the problem of "how to talk to a modern-day terminal emulator that's incompatible with xterm but nonetheless claims to be compatible". I wrote more about the problems here [nethack4.org].
Vanilla/mainline NetHack doesn't use libuncursed or curses, but rather a homerolled
It's the library, not the encoding (Score:1)
What the hell? (Score:2)
What the web was built on (Score:4, Funny)
Everyone knows (or should know) that the web was built on WTF-8 which explains a lot.
Re: (Score:3, Funny)
So this is why Windows uses UTF-16? (Score:2)
Re: (Score:2)
Re: (Score:1)
Plan for the future, man.. UTF-64
Re: (Score:1)
short-term thinker!
UTF-128 FTW!
Re: (Score:2)
128 bits should be enough for anyone.
Re: (Score:2)
I have worked with some people who would consider this :)
Actually a while back I found someone was passing around instructions on how to setup some software that needed a random key for a symmetric cipher. It used a 256 bit block cipher so it needed a 256 bit key.
The instructions being passed around where clearly cut and pasted from a web site (they might have even had the url) but they remembered that we had key policies for other things and so they changed the dd command to make a 1024 bit key....because
Re: (Score:2)
UTF-32 would save memory in some cases (Score:2)
A UTF-8 string would require a pointer to it, on a 64-bit system that's 8 bytes, plus the overhead of dynamic allocation (typically 8 bytes). But if you only needed a single character, then a UTF-32 could accomplish that in 4 bytes. Effectively making UTF-32 one quarter the size of a typical UTF-8 implementation, when operating with the constraint that there is a single character per data structure/item/tile/object/whatever.
Re: (Score:2)
As mentioned above this idea fails when combining characters are needed. This is the advantage of UTF-8 since you are forced to deal with variable length characters anyway. Support for combining chars won't be overlooked in most cases.
Re: (Score:2)
Yes, your tiled data would be limited to situations where the NFC (Normalization Form Canonical Decomposition/Composition) is a single code point. It's extremely difficult to find exceptions that are valid in an extant natural language, but they do exist.
If the tiles themselves require multiple code points, where the number of code points is greater than about 3, then UTF-8+pointer is the more compact solution than a fixed UTF-32 array. (by my rough napkin estimations). A fixed UTF-8 array doesn't really of
Re: (Score:2)
when operating with the constraint that there is a single unicode code point per data structure/item/tile/object/whatever.
Fixed that for you.
So you'd support rare chinese characters but exclude unusual letter/diacritic combinations.
Re: (Score:3)
Re: Short of memory? (Score:1)
Every codepoint, not character. Big difference. No normalization form guarantees one character per codepoint. Well, except Perl's NFG, but that requires dynamic mapping.
Re: (Score:2)
Re: Short of memory? (Score:5, Informative)
Re: (Score:2)
Re: Short of memory? (Score:4, Insightful)
What does "character" mean?
Something represented by one unicode codepoint? (making your statement a tautology)
Grapheme cluster? (what most users would consider a character)
A position in the character grid of a console?
Which brings us to the real question. to what extent do you want to support unicode? do you care about
* Grapheme clusters that take multiple code points to represent? (letters with multiple diacritics, unusual letter/diacritic combinations etc)
* Right to left languages? (hebrew, arabic etc)
* Languages where chracters merge together such that computer output looks more like handwriting than type? (see above)
* Languages where "fixed" width fonts use two different widths giving "single width" and "double width" characters? (chineese, japanese, korean)
* Characters outside of the basic multilingual plane? (rare Chinese characters, dead languages, made up languages, rare mathematical symbols)
Once you have worked though that design decision it will help you make others. What you find is that "length in unicode code points" and "unicode code point n" really aren't much more useful than "length in utf-k code units" and "utf-k code point n". Either is fine for sanity checking string length or iterating through a string looking for delimiter. Neither is much use for anything more than unless you are doing a very limited implementation.
UTF-32 seems enticing initially but turns out to be fairly pointless, by the time you get to caring about non-BMP characters you are probably also going to be caring about combining characters etc and it will massively increase the size of the vast majority of text.
UTF-8 vs UTF-16 is something of a tossup. UTF-16 lets you get away with treating each unit of the string as one "character" much longer which may be considered either a blessing (because you don't care about the cases where it doesn't work) or a curse (because you realise your assumptions were wrong much later after basing much more code on them). UTF-8 is smaller for text with lots of latin chracters, UTF-16 is smaller for text with lots of CJK characters. UTF-8 is the usual choice on *nix systems and internet protocols. UTF-16 is the encoding chosen by windows and Java.
Re: (Score:2)
Is there any easy way to tell where one grapheme cluster ends, and another begins? With UTF-8, it's easy to count the bits to see where one codepoint begins and ends, I hope there is something equally simple for grapheme clusters. Or perhaps it's all complicated and is different for each language?
As I understand it it comes down to table lookups. The details of full unicode support are unfortunately not trivial and theres a reason libraries like ICU are as big as they are.
Also, if I do accidentally split a grapheme cluster in two (while respecting codepoint boundaries), what will happen? If I attempt to display the two strings, can I expect a sensible result, or will the result be garbage?
As I understand it normally the base character is first and then things added to it follow.
So if you cut the end off a string and cut in the middle of a cluster then the last character may be missing some bits but the string is likely to be otherwise OK.
If you cut the start off a string and cut in the middle of a cluster things get
Re: (Score:3)
Definite a canadian
http://en.wikipedia.org/wiki/A... [wikipedia.org]
Re: (Score:2)
No, it's not about translations (Source: I'm a NetHack fork developer who's somewhat involved in this DevTeam revival thing).
Translations of games like NetHack are inherently hard. You can't use the standard approaches as the program assembles the sentences out of several parts and usually, e.g. with gettext, you translate whole sentences. But here, we have dynamic sentences where this approach can't work.
For my German translation NetHack-De, I used internally latin-1 (so I could continue to use the char* s