Whose Bug Is This Anyway?

Whose Bug Is This Anyway? 241

Posted by Soulskill on Tuesday December 18, 2012 @09:04PM from the it's-nobody's-fault-and-everybody's-angry dept.

An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"

Whose Bug Is This Anyway?

This discussion has been archived. No new comments can be posted.

Search 241 Comments Log In/Create an Account

Comments Filter:

The memory thing... (Score:5, Informative)

by Loopy ( 41728 ) writes: on Tuesday December 18, 2012 @09:15PM (#42332529) Journal

...is pretty much what those of us that build our own systems do anytime we upgrade components (RAM/CPU/MB) or experience unexplained errors. It's similar to running the Prime95 torture tests overnight, which also checks calculations in memory against known data sets for expected values.
Good stuff for those that don't already have a knack for QA.

OsStress (Score:5, Informative)

by larry bagina ( 561269 ) writes: on Tuesday December 18, 2012 @09:20PM (#42332567) Journal

Microsoft found similar [msdn.com] impossible bugs when overclocking was involved.

Re:OsStress (Score:2, Informative)

by Anonymous Coward writes: on Tuesday December 18, 2012 @09:45PM (#42332739)

That's not too surprising. For instance if you try to read too fast from memory, the data you read may not be what was actually in the memory location. Some bits may be correct, some may not. Sometimes the incorrect values may relate to the data that was on the bus last cycle, eg there has not been enough time for the change to propagate through. This can easily lead to the data apparently read being a value that should not be possible. This is why overclocking is not a good idea for mission critical systems, although of course it can be fun to push a system a bit harder to get better performance for non critical applications.
John

Re:The memory thing... (Score:5, Informative)

by DMUTPeregrine ( 612791 ) writes: on Tuesday December 18, 2012 @09:47PM (#42332757) Journal

Unless you're trying to overclock.
Admittedly that's a small percentage of the populace, even among people who build their own systems.

Re:I don't believe 1% of computers give wrong answ (Score:5, Informative)

by godrik ( 1287354 ) writes: on Tuesday December 18, 2012 @09:53PM (#42332781)

I actually believe it. I am sure they might have think of floating point precision problem. But most likely they only used integers. That's what prime 95 and memtest are doing. Integer and memory operations uncover most common hardware failure. I encountered many computers with faulty hardware when stressed. And I am sure guildwars was stressful.

Re:The memory thing... (Score:4, Informative)

by Runaway1956 ( 1322357 ) writes: on Tuesday December 18, 2012 @10:24PM (#42332949) Homepage Journal

" Either it's DOA or runs forever."
Nonsense. I bought 8 gig of memory about 4 years ago, for an Opteron rig. That computer recently started having serious problems, with corrupted data and crashing. I looked at all the other components first, then finally ran memory tests. Memtest failed immediately. I removed three modules and ran memtest again, it failed immediately. Replaced with another module, memtest ran for awhile, then failed. The other two modules proved to be good, so I am now running that aging Opteron with 4 gig of memory.
Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.

Re:The memory thing... (Score:5, Informative)

by Alwin Henseler ( 640539 ) writes: on Tuesday December 18, 2012 @10:39PM (#42333015)

The defect rate on hardware is so low you don't need to - buy your stuff from Newegg, assemble, and install. Either it's DOA or runs forever.

Look up "bathtub curve" sometime. Even well-built, perfectly working gear is aging, aging usually translates into "reduced performance / reliability", and any electronic part will fail sometime. Possibly gradually. Especially the just-makes-it-past-warranty crap that's sold these days. And there may be instabilities / incompatibilities that only show under very specific conditions (like when a system is pushed really hard).
That's ignoring things like ambient temperature variations, CPU coolers clogging with dust over the years, sporadic contact problems on connectors, or the odd cosmic ray that nukes a bit in RAM (yes that happens [wikipedia.org], too). A lot of things must come together to have (and keep) a reliable working computer, so a lot of things can go wrong and put an end to that.

Re:The memory thing... (Score:5, Informative)

by scheme ( 19778 ) writes: on Tuesday December 18, 2012 @10:52PM (#42333061)

Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.
Check out "A study of DRAM failures in the field" from the supercomputing 2012 proceedings. They have some interesting stats based on 5 million DIMM days of operation.

Re:OsStress (Score:2, Informative)

by Anonymous Coward writes: on Tuesday December 18, 2012 @11:06PM (#42333145)

We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.
This is not 100% correct. When Intel or other fabricators of microprocessors make the things they do use the same "mold" to "stamp out" all the processors in large bunches all based on the same design, however they don't get the exact same result each time. The little difference from chip to chip, like on this chip some transistors ended up a few atoms closer together than what is the optimum distance so this part of the processor now will heat up more when in use, or on this chip someone coughed* during the process and smeared the result and its totally unusable, or on this this chip part of the cache memory is fubared and has to be disabled.
The end result is they have a chip and they have to test it to see how well it performs because of all these variables in the manufacturing process. One chip might be 100% reliable and operate under the desired temperature at clock speed A, while another chip due to its unique manufacturing imperfections has problems at clock speed A, either its too hot or needs too much voltage or has calculation errors, but when lowered to clock speed B it works just fine. I believe they call this process "binning" and its the main thing that separates the chips into different speeds and capabilities.
IT IS HOWEVER a known practice that the chip manufacturers will sometimes take a processor that is just fine and dandy to work at clock speed A but they label it a slower clock speed B part because they are running low on clock speed B parts and it makes better financial sense to sell it as such instead of lowering the price on their clock speed A parts. Sometimes its more than a clock speed, sometimes its the intentional disabling of capabilities of the processor to make it match their budget models like disabling some of the on board cache memory or some of the (working) cores.
What it comes down to is that it costs the processor manufacturers the exact same amount to make all the different speed processors in a given family, but they don't all come out the same. The worst ones are put on the low end, the better quality ones on the high and expensive end, and sometimes there is a perfect high quality one that is sold as a low end one because they need to produce and ship more low quality ones. If you get one of those, then consider yourself lucky and overclock the shit out of it. All processors can be overclocked, as the manufacturers make the official speed the 100% stable and error free operation with a normal (not aftermarket) cooling solution that will last for the lifetime of the warranty. You just sometimes get lucky and have a processor that is super easily overclocked because it could have been labeled and sold as a higher speed to begin with. This is never guaranteed however.
*Yeah its more complicated, they aren't using molds that they press, they are using some complicated look-it-up-and-read-if-you-are-really-interested stuff to make the things so amazingly small

Re:Reminded me of my first C application (Score:5, Informative)

by safetyinnumbers ( 1770570 ) writes: on Tuesday December 18, 2012 @11:15PM (#42333213)

That's known as "Yoda style" [codinghorror.com]

Re:Reminded me of my first C application (Score:2, Informative)

by mykepredko ( 40154 ) writes: on Tuesday December 18, 2012 @11:20PM (#42333241) Homepage

The statement:
if (i = 1) {
is equivalent to:
i = i; if (i) { // Always true because i = 1 and i != 0
myke

Re:The memory thing... (Score:4, Informative)

by Sir_Sri ( 199544 ) writes: on Tuesday December 18, 2012 @11:42PM (#42333371)

Even if you have a small calculation failure rate, it's not practical for an end user to recognize that as a hardware partial failure rather rather than a software bug.
From the perspective of the average user, yes, it either works or it doesn't. If you use something bit (like wow/guildwars or the like) and they can diagnose it for you then you might have an argument. But even then, 1% could be overclocking or, as the author of TFA says, heat or PSU undersupply issues. That's not 'defective' hardware, that's temperamental hardware or the user doing it wrong. And because it's rare it's not necessarily serious, most users can handle the odd application crash in something like an MMO once every few days.
It does mean a bug hunter needs to know what is happening though.

Re:Reminded me of my first C application (Score:5, Informative)

by richardcavell ( 694686 ) writes: <richardcavell@mail.com> on Tuesday December 18, 2012 @11:46PM (#42333413) Journal

I just want to correct this, not to prove how smart I am but because there are novice programmers out there who will learn from this case. The statement:

if (i = 1) {

is equivalent to:

i = 1; /* correction */ if (i) {

Re:How to lose time and sanity (Score:2, Informative)

by Anonymous Coward writes: on Wednesday December 19, 2012 @03:16AM (#42334399)

You're not kidding. I recently discovered a bug in glibc that causes cos() to return values ridiculously outside the range of -1 to +1 when used along with fesetround() on 64-bit systems. After submitting the bug report (and, indeed, they posted my email address online) someone posts a link to an older bug report, from five years earlier, about a similar issue with exp() and cosh(). The bug report ends with something like "well, I fixed it for exp(), cosh() and sinh(). If any other functions have a similar issue, someone should file a separate bug report."
The bug had been opened five years before it was closed. ...and I'm not sure they even fixed it for cos(), nor do I know if they're going to because I can't find the glibc bug tracker anymore. (how I found it the first time I have no idea) Lesson learned: Don't expect glibc to know how to do math. So I just changed the code so it no longer used fesetround().
Sadly, I was only using it in order to get GCC to stop using glibc and use my FPU directly instead, as glibc's math is slow, but unfortunately the same tricks don't work on 64-bit systems, where glibc seems to always be used. The result is that simply by using the tricks listed in my blog post [ecstaticlyrics.com] and recompiling for 32-bit, your math code runs twice as fast as it does when compiled for 64-bit. However, the same tricks cause the math functions to return wildly incorrect results when compiled for 64-bit.

Re:OsStress (Score:4, Informative)

by epine ( 68316 ) writes: on Wednesday December 19, 2012 @05:48AM (#42334983)

Nope! It's the same processor. Sure, some come out different, but oftentimes there are loads of perfectly good processors that get underclocked for marketing reasons only.
When the day arrives that we achieve molecular assembly, even then for two devices identically assembled with atom for atom correspondence, there will likely be enough variation in molecular or crystaline conformation remaining to classify the two devices at the margin as "not quite the same".
Binning levels are determined by the weakest transistor out of billions, the one with a gate thickness three deviations below the mean, and a junction length a deviation above. There is probably some facility for defective block substitution at the level of on-chip SRAM (cache memory), and maybe you can laser out an entirely defective core or two.
As production ramps, Intel has a rough model of how the binning will play out, but this is a constantly moving target. Meanwhile, marketting is making promises to the channel on prices and volumes at the various tiers. There's no sane way to do this without sometimes shifting chips down a grade from the highest level of validation in order to meet your promises at all levels despite ripples experienced in actual production.
Intel is also concerned--for good reason--about dishonest remarking in the channel. There's huge profit in it, and it comes mainly at the expense of Intel's reputation. Multiplier locks help to discourage this kind of shady business practice. So yeah, a few chips do get locked into a speed grade less than the chip could feasibly achieve. This is all common sense from gizzard to gullet. What's your point, then?
If they were an engineering firm, they'd sell one product at one price and be done with it.
Where you even find so many stupid engineers? The College of Engineering for Engineers Who Think Statistics is One Big Cosmic Joke presided over by the Edwin J. Goodwin [wikipedia.org] Chair of Defining Pi As Equal to 22/7?

Re:The memory thing... (Score:4, Informative)

by mcgrew ( 92797 ) * writes: on Wednesday December 19, 2012 @12:06PM (#42337047) Homepage Journal

My experience goes along with this. A few times I've had dual-boot computers constantly crashing on the Windows side, so I was blaming MS for their buggy software -- until the flaky hardware that made Windows flaky failed completely. Turns out that Linux is simply far more hardware fault-tolerant than Windows, rather than Windows being a bug-ridden piece of shit.

Re:The memory thing... (Score:4, Informative)

by Lonewolf666 ( 259450 ) writes: on Wednesday December 19, 2012 @12:31PM (#42337275)

Intel also charges you extra for ECC (only in server processors and mainboards), while AMD supports it in their better desktop processors. You still have to check if the mainboard does support it, though.
A quick online price check shows that for 8 GByte DDR3 RAM (2 sticks), you might have to pay 20 Euros more for the ECC variety, compared to non-ECC from the same vendor. The more limited choice in mainboards might end up costing you cost another 10-20 Euros, so let's say +40 Euros to get your AMD PC with ECC Ram.
On the Intel side, it is more like +50 Euros for a small Xeon instead of a matching i5, +100 Euros for an ECC-capable board and the same +20 for the RAM as with AMD. That makes about +170 Euros to get an Intel with ECC RAM, and was the main reason why my current PC is still an AMD...

Re:The memory thing... (Score:4, Informative)

by mgbastard ( 612419 ) writes: on Wednesday December 19, 2012 @12:34PM (#42337301)

Without the paywall: The study was performed on the Jaguar Supercomputer at Oak Ridge National Laboratories http://softerrors.info/selse/images/selse_2012/Papers/selse2012_submission_4.pdf [softerrors.info]

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Whose Bug Is This Anyway? 241

Whose Bug Is This Anyway? More Login

Whose Bug Is This Anyway?

The memory thing... (Score:5, Informative)

OsStress (Score:5, Informative)

Re:OsStress (Score:2, Informative)

Re:The memory thing... (Score:5, Informative)

Re:I don't believe 1% of computers give wrong answ (Score:5, Informative)

Re:The memory thing... (Score:4, Informative)

Re:The memory thing... (Score:5, Informative)

Re:The memory thing... (Score:5, Informative)

Re:OsStress (Score:2, Informative)

Re:Reminded me of my first C application (Score:5, Informative)

Re:Reminded me of my first C application (Score:2, Informative)

Re:The memory thing... (Score:4, Informative)

Re:Reminded me of my first C application (Score:5, Informative)

Re:How to lose time and sanity (Score:2, Informative)

Re:OsStress (Score:4, Informative)

Re:The memory thing... (Score:4, Informative)

Re:The memory thing... (Score:4, Informative)

Re:The memory thing... (Score:4, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot