Forgot your password?
typodupeerror
Bug Programming Games

Whose Bug Is This Anyway? 241

Posted by Soulskill
from the it's-nobody's-fault-and-everybody's-angry dept.
An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"
This discussion has been archived. No new comments can be posted.

Whose Bug Is This Anyway?

Comments Filter:
  • by AaronLS (1804210) on Tuesday December 18, 2012 @09:46PM (#42332753)

    "The defect rate on hardware is so low you don't need to"

    I think the point of the article is to cast significant doubt on statements like this.

  • by Okian Warrior (537106) on Tuesday December 18, 2012 @10:32PM (#42332971) Homepage Journal

    We deal with this type of bug all the time in safety-certified systems (medical apps, aircraft, &c).

    Most of the time an embedded program doesn't use up 100% of the CPU time. What can you do in the idle moments?

    Each module supplies a function "xxxBIT" (where "BIT" stands for "Built In Test") which checks the module variables for consistency.

    The serial driver (SerialBIT) checks that the buffer pointers still point within the buffer, checks that the serial port registers haven't changed, and so on.

    The memory manager knows the last-used static address for the program (ie - the end of .data), and fills all unused memory with a pattern. In it's spare time (MemoryBIT) it checks to make sure the unused memory still has the pattern. This finds all sorts of "thrown pointer" errors. (Checking all of memory takes a long time, so MemoryBIT only checked 1K each call.)

    The stack pointer was checked - we put a pattern at the end of the stack, and if it ever changed we knew something want recursive or used too much stack.

    The EEPROM was checksummed periodically.

    Every module had a BIT function and we check every imaginable error in the processor's spare time - over and over continuously.

    Also, every function began with a set of ASSERTs that check the arguments for validity. These were active in the released code. The extra time spent was only significant in a handful of functions, so we removed the ASSERTs only in those cases. Overall the extra time spent was negligible.

    The overall effect was a very "stiff" program - one that would either work completely or wouldn't work at all. In particular, it wouldn't give erroneous or misleading results: showing a blank screen is better than showing bad information, or even showing a frozen screen.

    (Situation specific: Blank screen is OK for aircraft, but not medical. You can still detect errors, log the problem, and alert the user.)

    Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact, and coupled with good error logging it can turbocharge your bug-fixing.

  • Re:OsStress (Score:5, Interesting)

    by Anonymous Coward on Tuesday December 18, 2012 @10:33PM (#42332985)

    Then again, it might not be overclocking after all [msdn.com].

    More relevantly, Microsoft has access to an enormous wealth of data about hardware failures from Windows Error Reporting. This paper [microsoft.com] has some fascinating data in it:

    - Machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault
    - Machines that crashed once had a probability of 1 in 3.3 of crashing a second time
    - The probability of a hard disk failure in the first 5 days of uptime is 1 in 470
    - Once you've had one hard disk failure, the probability of a second failure is 1 in 3.4
    - Once you've had two failures, the probability of a third failure is 1 in 1.9

    Conclusion: When you get a hard disk failure, replace the drive immediately.

  • by mykepredko (40154) on Tuesday December 18, 2012 @10:45PM (#42333039) Homepage

    I can't remember the exact code sequence, but in a loop, I had the statement:

    if (i = 1) {

    Where "i" was the loop counter.

    Most of the time, the code would work properly as other conditions would take program execution but every once in a while the loop would continue indefinitely.

    I finally decided to look at the assembly code and discovered that in the conditional statement, I was setting the loop counter to 1 which was keeping it from executing.

    I'm proud to say that my solution to preventing this from happening is to never place a literal last in a condition, instead it always goes first like:

    if (1 = i) {

    So the compiler can flag the error.

    I'm still amazed at how rarely this trick is not taught in programming classes and how many programmers it still trips up.

    myke

  • by Okian Warrior (537106) on Tuesday December 18, 2012 @10:57PM (#42333089) Homepage Journal

    If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.

    Yeah, right. Let's see how that works out in practice.

    I go to the home page of the project with bug in hand (including sample code). Where do I log the problem?

    I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)

    Once registered, I'm subscribed to your newsletter. (My temp E-mail has been getting status updates from the GCC crowd for years. My mail reader does something funky with the subject line, so responding with "unsubscribe" doesn't work for me.)

    Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).

    Some times the authors think it's the user's problem (no, really? This program causes gcc to core dump. How can that be *my* fault?) Some times the authors interpret the spec different from everyone else (Opera - I'm looking at you). Some times you're just ignored, some times they say "We're rewriting the core system, see if it's still there at the next release", and some times they say "it's fixed in the next release, should be available in 6 months".

    What you really do is figure out the sequence of events that causes the problem, change the code to do the same thing in a different way (which *doesn't* trigger the error), and get on with your life. I've given up reporting bugs. It's a waste of time.

    That's how you deal with compiler bugs: figure out how to get around them and get on with your work.

    No, I'm not bitter...

  • by TubeSteak (669689) on Tuesday December 18, 2012 @11:18PM (#42333229) Journal

    Especially the just-makes-it-past-warranty crap that's sold these days.

    Actually, to get 95% of your product past the warranty period, you have to overengineer because, statistically, some of your product will fail earlier than you expect.

    So if you have a 3 year warranty, you better be engineering for 4+ years or you're going to spend a lot on replacements for the near end of the bathtub curve.

    I've had an unfortunate amount of experience with made in china crap that's ended up being replaced a few times within the warranty period.

  • by DigiShaman (671371) on Tuesday December 18, 2012 @11:38PM (#42333339) Homepage

    I think it's a crying shame that the PC industry hasn't forced ECC as a mandatory standard. Servers and workstations have it, and with memory as cheap as it is to fab, there's absolutely -zero- excuse not to use ECC!!! With the transistor count as densely packed and small, errors will occur. I'll go a step further and even recommend ECC throughout the entire motherboard bridge buses. End-to-end error correction should be a requirement!

  • More error checking (Score:5, Interesting)

    by Okian Warrior (537106) on Wednesday December 19, 2012 @12:21AM (#42333621) Homepage Journal

    My previous was modded up, so here's some more checks.

    During boot, the system would execute a representative sample of CPU instructions, in order to test that the CPU wasn't damaged. Every mode of memory storage (ptr, ptr++, --ptr), add, subtract, multiply, divide, increment &c.

    During boot, all memory was checked - not a burin-in test, just a quick check for integrity. The system wrote 0, 0xFF, A5, 5A and read the output back. This checked for wires shorted to ground/VCC, and wires shorted together.

    During boot, the .bss segment was filled with a pattern, and as a rule, all programs were required to initialize all of their static variables. Each routine had an xxxINIT function which was called at boot. You could never assume a static variable was initialized to zero - this caught a lot of "uninitialized variable" errors.

    (This allowed us to reboot specific systems without rebooting the system. Call the SerialINIT function, and don't worry about reinitializing that section's static vars.)

    The program code was checksummed (1K at a time) continuously.

    When filling memory, what pattern should you use? The theory was that any program using an uninitialized variable would crash immediately because of the pattern. 0xA5 is a good choice:

    1) It's not 0, 1, or -1, which are common program constants.
    2) It's not a printable character
    3) It's a *really big* number (negative or unsigned), so array indexing should fail
    4) It's not a valid floating point or double
    5) Being odd, it's not a valid pointer

    Whenever we use enums, we always start the first one at a different number; ie:

    enum Day { Sat = 100, Sun, Mon... }
    enum Month { Jan = 200, Feb, Mar, ... }

    Note that the enums for Day aren't the same as Month, so if the program inadvertently stores one in the other, the program will crash. Also, the enums aren't small integers (ie - 0, 1, 2), which are used for lots of things in other places. Storing a zero in a Day will cause an error.

    (This was easy to implement. Just grep for "enum" in the code, and ensure that each one starts on a different "hundred" (ie - one starts at 100, one starts at 200, and so on).)

    The nice thing about safety cert is that the hardware engineer was completely into it as well. If there was any way for the CPU to test the hardware, he'd put it into the design.

    You could loopback the serial port (ARINC on aircraft) to see if the transmitter hardware was working, you could switch the A/D converters to a voltage reference, he put resistors in the control switches so that we could test for broken wires, and so on.

    (Recent Australian driver couldn't get his vehicle out of cruise-control because the on/off control wasn't working. He also couldn't turn the engine off (modern vehicle) nor shift to neutral (shift-by-wire). Hilarity ensued. Vehicle CPU should abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer. But, I digress...)

    If you're interested in the software safety systems, look up the Therac some time. Particularly, the analysis of the software bugs. Had the system been peppered with ASSERTs, no deaths would have occurred.

    P.S. - If you happen to be building a safety cert system, I'm available to answer questions.

  • by Animats (122034) on Wednesday December 19, 2012 @12:49AM (#42333751) Homepage

    "The defect rate on hardware is so low you don't need to" I think the point of the article is to cast significant doubt on statements like this.

    Right. Google assumes their server hardware (which is cheap, not good) is flaky, and designs their software to deal with that. I've heard a Google engineer say that if they sort a terabyte twice, they get two different results.

  • by tlhIngan (30335) <(ten.frow) (ta) (todhsals)> on Wednesday December 19, 2012 @02:00AM (#42334051)

    By "serious", I mean the compiler itself doesn't crash, issues no warnings or errors, but generates incorrect code. Maybe I've just been lucky. (Or maybe QA just never found them ;-)

    I saw this once - took me weeks to solve it. Basically I had a flash driver that would occasionally erase the boot block (bad!). It was odd because we had protected the boot block both in the higher level OS as well as the code itself.

    Well, it happened and I ended up tracing through the assembly code - it turned out the optimizer worked a bit TOO well - it completely optimized out a macro call used to translate between parameters (the function to erase the block required a sector number. The OS called with the block number, so a simple multiplication was needed to convert). End result, the checks worked fine, but because the multiplication never happened, it erased the wrong block. (The erase code erased the block a sector belonged to - so sectors 0, 1, ... NUM_SECTORS_PER_BLOCK-1 erased the first block).

    A little #pragma to disable optimizations on that one function and the bug was fixed.

  • by perpenso (1613749) on Wednesday December 19, 2012 @02:02AM (#42334061)
    While at a large game company I wrote the code that collected CPU make and model, video make and model, amount of RAM, OS version, etc. Basically the type of info you see under minimum system requirements. The CPUID instruction can return a vendor string indicating who made the CPU. Intel CPUs return "GenuineIntel". On very very rare and often transient occasions the reported string had a misspelling, the misspellings generally indicated a single bit error. Whether an overclocked CPU generating subtle errors or bad RAM or a bad power supply or something else is responsible I can't say. All I really know is that outside of the CPU manufacturing facility things do go wrong in hardware. The article is consistent with various things I have seen.
  • by Josh Coalson (538042) on Wednesday December 19, 2012 @04:12AM (#42334611) Homepage
    I used to get bug reports for FLAC caused by this very same problem.

    FLAC has a verify mode when encoding which, in parallel, decodes the encoded output and compares it against the original input to make sure they're identical. Every once in a while I'd get a report that there were verification failures, implying FLAC had a bug.

    If it were actually a FLAC bug, the error would be repeatable* (same error in the same place) because the algorithm is deterministic, but upon rerunning the exact same command the users would get no error, or (rarely) an error in a different place. Then they'd run some other hardware checker and find the real problem.

    Turns out FLAC encoding is also a nice little hardware stressor.

    (* Pedants: yes, there could be some pseudo-random memory corruption, etc but that never turned out to be the case. PS I love valgrind.)

  • by FireFury03 (653718) <slashdot&nexusuk,org> on Wednesday December 19, 2012 @08:22AM (#42335531) Homepage

    Look up "bathtub curve" sometime.

    This is exactly why I cringe when I hear people saying "we need to replace that hardware because its been running for a few years now so might fail soon" - the chances of your brand new hardware going pop are often far higher than the tired old hardware. Eventually the old kit will of course die, but in my experience that is far further into the future than most people imagine.

    I've not quite figured out the optimal hardware replacement frequency, but I tend to think that for servers (excluding the hard drives) the time you want to replace it is largely when it is no longer powerful enough to do what you want, rather than because its a bit old and creaky and you're worried it might break.

    Hard drives, on the other hand, seem to break with reasonable frequency whatever their age, so usually I just run them (in a RAID) until they either give up, or SMART tells me they are reallocating large numbers of sectors, rather than trying to preemptively replace them.

  • by UnknownSoldier (67820) on Wednesday December 19, 2012 @01:06PM (#42337575)

    It used to, I don't know about current gen RTS's. But back ~2000 RTS typically you would run in a lock-step model. We used fixed-point to guarantee each machine was doing the _exact_ same 3D math due to the imprecision of the FPU. ANY discrepancy and your game state was boned. I believe at the time this decision was due to network implementation -- I don't know the exact reason though since I was doing rendering / optimizations.

    You also have to keep in mind the context. Back in 2000 AMD's FPU was beating the pants of the Intel's (depending on the operation as much as 1000% !!) With Intel having such a slow FPU you didn't rely on it unless you had to. Also, using C's 64-bit 'double' was prohibited for two reasons:

    a) the PS2 emulated it IN SOFTWARE !
    b) it was horrendously SLOW compared to 32-bit floats.

    Game programmers stayed as far away as possible from floats (and especially doubles!) as long as (reasonably) possible. For FPS you were forced to go the float route because while Intel hid the latency of the INT-to-FLOAT casts it was just easier to stay entirely in the float domain. That also opened the door for some clever optimizations like Carmack did with over-lapping the FPU and INT units but that was the rare case.

    On PS3 you take a HUGE Load-Hit-Store penalty if you try doing the naive INT32-to-FLOAT32 cast so fixed point has fallen out favor for lack of performance reasons.

You can measure a programmer's perspective by noting his attitude on the continuing viability of FORTRAN. -- Alan Perlis

Working...