Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Classic Games (Games) Open Source Games

NetHack Development Team Polls Community For Advice On Unicode 165

An anonymous reader writes After years of relative silence, the development team behind the classic roguelike game NetHack has posteda question: going forward, what internal representation should the NetHack core use for Unicode characters? UTF8? UTF32? Something else? (See also: NH4 blog, reddit. Also, yes, I have verified that the question authentically comes from the NetHack dev team.)
This discussion has been archived. No new comments can be posted.

NetHack Development Team Polls Community For Advice On Unicode

Comments Filter:
  • The answer is... (Score:5, Insightful)

    by Anonymous Coward on Sunday January 11, 2015 @11:13AM (#48787165)

    utf-8

    • Re:The answer is... (Score:5, Informative)

      by KiloByte ( 825081 ) on Sunday January 11, 2015 @01:49PM (#48788141)

      For storing a single character: UCS-4 (aka UTF-32), and that's without possible combining character decoration. For everything else, UTF-8 internally, no matter what the system locale is.

      wchar_t is always damage, it shouldn't be used except in wrappers that do actual I/O: you need such wrappers as standard-compliant functions are buggy to the level of uselessness on Windows and you need SomeWindowsInventedFunctionW() for everything if you want Unicode.

      And why UTF-8 not UCS-4 for strings? UTF-8 takes slightly longer code:
      while (int l = utf8towc(&c, s))
      {
              s += l;
              do_something(c);
      }

      vs UCS-4's simpler:
      for (; *s; s++)
      {
              do_something(*s);
      }

      but UCS-4 blows up most your strings by a factor of 4, and makes viewing stuff in a debugger really cumbersome.

      My credentials: I'm the guy who added Unicode support to Dungeon Crawl.

      • Came here to say pretty much the same thing.

        UTF-8 is pretty easy to work with from a memory management perspective and will make it easier when upgrading an established ASCII base.

    • I just read the source and agree. Unless they think it's just so much fun to completely reimplement character handling on 20 platforms...or unless they intend to port entirely to Qt, UTF-8 is their only option.
  • More importantly, (Score:3, Insightful)

    by Anonymous Coward on Sunday January 11, 2015 @11:17AM (#48787183)

    who cares? This only affects naming your character and displaying stuff on the map.

  • What use are those characters anyway? You don't need funny accents on letters to play Nethack. 7 bits should be enough for any character set! Hardcore hackers who want a workaround can just use LaTeX codes.

  • Considering the length of their release cycle, seems to be a safe choice.

    It's not like the difference 1/2/4 bytes would make much performance difference for the application like NetHack.

    Using the utf-32 internally would save them from some of the silliness the alternatives like utf-8 bring with them.

  • by Little Brother ( 122447 ) <kg4wwn@qsl.net> on Sunday January 11, 2015 @11:31AM (#48787233) Journal

    I started playing nethack before it was nethack, it was just hack. (I may well hold the record for longest time playing without an asencion, but that is beside the point.) I have played other roguelikes and keep coming back to nethack because it is the only one that keeps that same feel for me. It has had the same overall look my entire life. While the expanded character set in UTF would allow for significantly more characters to be used in drawing the map, and designating each monster with a different character, I beg of you not to do so. Keep the overall look the same, (or allow it as a compile time option at the very least) and just use UTF for the character name.

    For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.

    • by Anonymous Coward

      Adding Unicode for names would be nice but it also would probably introduce a ton of bugs in the process making the game less stable again. Plus, using the same character for different monsters is *part of the game*. If you get lazy and don't look if the G is a gnome vs gargoyle or something, the mistake is supposed to cost you.

      • by lgw ( 121541 )

        Adding Unicode for names would be nice but it also would probably introduce a ton of bugs in the process making the game less stable again. Plus, using the same character for different monsters is *part of the game*. If you get lazy and don't look if the G is a gnome vs gargoyle or something, the mistake is supposed to cost you.

        Thanks for reminding me why I don't play Nethack - briefly I was tempted.

        • Don't worry, that would never happen.

          G's are only ever gnomes (of differing ranks), but a g might be a gargoyle, flying gargoyle, or gremlin.

          I hope that clears things up. And for god's sake, don't genocide G's if you're playing as a gnome.

    • For which implimentation of UTF to use, I'd go with utf8 as it seems to have the widest adoption, or 32 because that will probably allow you the longest time before having to think about this again. I would avoid the middle ground.

      UTF-8, while originally only defined to 31 bits and now defined to 21 bits, actually has room to trivially extend up to 43 bits. One could say it's more future-proof than UTF-32. Not that it really matters -- we're only using 17 bits right now so I doubt we'll ever get past 21. Maybe when we encounter intelligent alien life.

    • (I may well hold the record for longest time playing without an ascension)

      I think I started Hack around '92, and finally ascended in 2009. I've been trying to ascend about once a year since then.

  • by Anonymous Coward

    First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4. So, if you're gonna do unicode, and you don't like your code to be buggy, this is the way to do it.

    That said, unicode is a travesty. Unlike ascii, there is no such thing as a complete unicode font that implements all of unicode's code points. Unicode only defines how any implemented chars should be numbered, but doesn't actually require yo

    • by Ark42 ( 522144 )

      The font issue is a silly thing to worry about. The same thing can be said of ASCII of and Windows-1252. I'm sure lots of early fonts, and probably even some you find today, that claim to support all glyphs in Windows-1252, are missing the Euro sign at codepoint 0x80, because they added it later on. Even for a small character set restricted to 256 max characters, as you can see, things change over time, and fonts don't always keep up.

      • Shouldn't the font system just solve this for me in the case of display use? Sure, for typography you probably don't want magical mystery substitutions, but why can't the system figure out which of my fonts is most similar to the font I'm using and sub in missing glyphs?

        • by Ark42 ( 522144 )

          I'm pretty sure most font systems already DO do this. In fact, this was the reason I rooted my Android phone - I wanted to change the font-fallback order so that certain Kanji would display with a Japanese font instead of Chinese one. An example is http://jisho.org/kanji/details... [jisho.org] which is drawn completely different in Chinese fonts, to the point where Japanese readers would not know the symbol, yet both are supposed to be represented by the same codepoint, because they're the same character.
          But anyway, fo

          • by jrumney ( 197329 )

            I'm pretty sure most font systems already DO do this.

            Usually not the font systems themselves, as the font system API needs to be designed to let you use fonts in the way that suits your application, and not have random substitutions happen behind your back (though the font system provides the API functions to figure out what a good substitution font will be). But higher level UI libraries, like GTK, Qt, MFC, Windows Forms,Core Text, Skia etc will do it.

    • First off, UTF-32 is least likely to cause bugs, since all chars are the same length and thus possible to determine memory usage simply by multiplying char count by 4.

      The memory usage of UTF-8 is also at most char count multiplied by 4. The 5- and 6-byte sequences were declared invalid when Unicode was restricted to have no character above U+10FFFF.

      • Re: (Score:3, Informative)

        by IcyWolfy ( 514669 )

        Terminoligy needs to be fixed.

        All Codepoints are 4 bypes
        All characters (defined as a single conceptual, and graphical display unit) range from 1 to 6 code-points. (so, 4-24bytes)

        Sinhala:
        0dc1 0dca 200d 0dbb 0dd3
        ZHA VIRAMA ZWJ RA VOWEL-SIGN-II

        Combine to form a single displayable character. (Sri) (kinda a fancy item; but different from without the ZWJ which would display two graphemes. (S', and RII)

        And Lituanian:
        "However, not all abstract characters are encoded as a single Unicode character, and some abstr

  • UTF-8 (Score:5, Insightful)

    by Ark42 ( 522144 ) <`ten.erawtfossuehprom' `ta' `todhsals'> on Sunday January 11, 2015 @11:39AM (#48787273) Homepage

    The answer is UTF-8. It's pretty much going to be the de-facto character set now. It has backwards compatibility with ASCII, and can easily be extended in the future to support possible U+200000 - U+7FFFFFFF codepoints, as the original UTF-8 specification used to include that anyway.

    Any important point is to not mess things up and end up with CESU-8 like MySQL did. There are completely valid 4-byte UTF-8 characters, so don't think of it as some special alternate UTF-8 by artificially capping UTF-8 at a max of 3 bytes per character.

    • by Anonymous Coward

      Hell, there's 5-byte UTF-8 characters too (how would we represent UTF-32 characters in UTF-8 otherwise)..

      The nicest thing would of course be if the world could hurry up and switch to English so we could just have ASCII for most everything, and UTF-32 for museums and whatnot that needed to store Linear-B or ancient hieroglyphs ;-] [*]

      [*] Written by a non-English person, in English, using a US keyboard. It's all good bros,.

      • by Ark42 ( 522144 )

        The official spec limits UTF-8 to 10FFFF to help it place nice with UTF-16, so no 5 or 6 byte sequence is valid anymore. There isn't any characters defined above 10FFFF yet anyway. But in the future, if those ranges are defined, it would be easy to have programs using UTF-8 utilize those characters. If you use UTF-16 like Windows, you'd be out of luck though.

        • If you use UTF-16 like Windows, you'd be out of luck though.

          UTF-16 doesn't have the problem. UCS-2 (which Windows still mostly uses, even where it pretends to use UTF-16) does. UTF-16 combines the worst of both worlds: a space-inefficient variable length encoding.

          • by Ark42 ( 522144 )

            UTF-16 is terrible, yes, but Windows does support it. I'm sure naive programmers create bad code by assuming UCS-2 and all characters being 2 bytes, but surrogate pairs like Emoticons U+1F600 - U+1F64F work just fine.

            And by "out of luck" I was referring to possible future codepoints above U+10FFF. UTF-16 can only support up to that by using surrogate pairs. It does not have any way to represent higher codepoints, where as UTF-8 can easily be extended with 5 and 6 byte sequences.

  • UTF-8 (Score:5, Interesting)

    by Anonymous Coward on Sunday January 11, 2015 @11:43AM (#48787289)

    UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.

    • by Anonymous Coward

      UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.

      Short of someone deliberately using a mismatched non-character fixed width type like uint8_t to check for a zero terminator in exactly 8-bit wide units or something equally brain-dead, the scenario above can't happen since the character width checked for is set by the type. Flipping the type from char to char16_t and char32_t should "just work".

      This is not to say that there aren't any problems that can't occur (the blind assumption that sizeof(string) == countof(string) and referencing memory off of that,

    • by tlhIngan ( 30335 )

      UTF-8 is easily adopted by C based software like Nethack because null-terminated string logic works unmodified; a UTF-8 string has no embedded nulls to trip up any code that that measures string length by searching for a zero byte. For the most part things should "just work." UTF-16 and 32 strings have zero bytes embedded among characters, so you have to audit every bit of code to ensure compatibility.

      Incorrect. UTF-9 works for 8-bit chars.

      Here's the truth for C:

      sizeof(char) <= sizeof(short int) <= s

  • by namgge ( 777284 ) on Sunday January 11, 2015 @11:54AM (#48787347)

    In my experience, if you are upgrading legacy code that assumed straightforward ascii then utf8 is the
    way to go. It was invented for the purpose by someone very smart (Ken Thompson). If there were a 'Neatest Hacks of All Time' competition utf8 would be my nomination.

    The only real issues I've encountered are the usual ones of comparisons between equivalent characters and defining collating order. These stop being a problem (or more precisely 'your' problem) once you abandon the idea of rolling your own and use a decent utf8 string library.

  • Use what your programming language supports. If it supports Unicode, use UTF8 as it saves space. UTF32 isn't "one character per 32 bits" so it's no easier than UTF8.
  • by Anonymous Coward
    I'm not sure why they need to do anything, I can successfully name my pet in Nethack 3.4.3 using UTF-8 characters.
  • I don't know if I'm supposed to marvel at the submitter's sarcastic nerve of laugh with the irony.

    I think I'm gonna do both.

  • All of my fucking YES.

    It wasn't justs a rumor.

    • All of my fucking YES.

      It wasn't justs a rumor.

      Just switch to a different game or a variant. You'll be happier.

  • Unicode is a clusterfuck. 7 bits is good enough for anyone.

  • The answer is always UTF-8. It doesn't matter what project, or country, or language. Anything other than UTF-8 will cause completely avoidable problems. I wish more programmers would learn this rule, as it would make all our jobs easier.

    • by Shados ( 741919 )

      for ease of use and storage efficiency with flexibility, yeah, UTF-8 is always best.

      For certain type of work with specific performance characteristics however, not so. Thats usually the problem.

      • by ledow ( 319597 )

        Very few places are going to be dealing in UTF-32 just because of the performance.

        And certainly not Nethack.

        In all my projects, I use UTF-8. Any performance hit is so far off the radar, it's just not worth worrying about.

        • by Shados ( 741919 )

          Absolutely. I was just saying that it was basically the only time anything other than UTF8 matters (especially since in the time when it matters, switching from one to the other is HELL).

          My wife used to work on a faceted search system made to handle a few petabyte of data... the difference was pretty huge.

          Since I personally never did something like that, I never had issues just using UTF8 :)

  • I am a bit concerned about the statement on "libuncursed", which does not see to be in many distros. To me it seems the change is being made to cater to non UN*X systems and hoping to move away from curses. So given the way I read the articles, I would prefer UTF-8 and try to use 'standard' libs. This way data is easily moved between different system types and the change will still supported older under-powered hardware.
    • libuncursed exists because ncursesw is so bad. libuncursed is not great, but it's simple. Just the sort of band-aid we need on *nix until someone rewrites ncursesw.

    • by ais523 ( 1172701 )

      I wrote libuncursed specifically for NH4 (but intended to work for other roguelikes too), because curses solves the wrong problem nowadays (the problem of "how to talk to an obscure terminal from the 1980s that uses nonstandard terminal codes", rather than the problem of "how to talk to a modern-day terminal emulator that's incompatible with xterm but nonetheless claims to be compatible". I wrote more about the problems here [nethack4.org].

      Vanilla/mainline NetHack doesn't use libuncursed or curses, but rather a homerolled

  • Wrong question. What text-handlig (string-handlig) library do you want to use? Then use whatever that library supports. I still in doubt, go for UTF-8. Then you will be less tempted to think you can handle things yourself in C code. If you do think so, you will invariably write buggy code because you don't know enough about the issues.
  • It's still the same strings as far as the end user is concerned. A UTF-16 encoded string looks the same to the end user as a UTF-8 encoded string. But given that the codebase is legacy the only sensible choice is UTF-8.
  • by RogueWarrior65 ( 678876 ) on Sunday January 11, 2015 @09:40PM (#48790427)

    Everyone knows (or should know) that the web was built on WTF-8 which explains a lot.

"Sometimes insanity is the only alternative" -- button at a Science Fiction convention.

Working...