Whose Bug Is This Anyway? 241
An anonymous reader writes "Patrick Wyatt, one of the developers behind the original Warcraft and StarCraft games, as well as Diablo and Guild Wars, has a post about some of the bug hunting he's done throughout his career. He covers familiar topics — crunch time leading to stupid mistakes and finding bugs in compilers rather than game code — and shares a story about finding a way to diagnose hardware failure for players of Guild Wars. Quoting: '[Mike O'Brien] wrote a module ("OsStress") which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second. On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!'"
The memory thing... (Score:5, Informative)
...is pretty much what those of us that build our own systems do anytime we upgrade components (RAM/CPU/MB) or experience unexplained errors. It's similar to running the Prime95 torture tests overnight, which also checks calculations in memory against known data sets for expected values.
Good stuff for those that don't already have a knack for QA.
Re:The memory thing... (Score:5, Interesting)
"The defect rate on hardware is so low you don't need to"
I think the point of the article is to cast significant doubt on statements like this.
Re:The memory thing... (Score:4, Informative)
Even if you have a small calculation failure rate, it's not practical for an end user to recognize that as a hardware partial failure rather rather than a software bug.
From the perspective of the average user, yes, it either works or it doesn't. If you use something bit (like wow/guildwars or the like) and they can diagnose it for you then you might have an argument. But even then, 1% could be overclocking or, as the author of TFA says, heat or PSU undersupply issues. That's not 'defective' hardware, that's temperamental hardware or the user doing it wrong. And because it's rare it's not necessarily serious, most users can handle the odd application crash in something like an MMO once every few days.
It does mean a bug hunter needs to know what is happening though.
Re:The memory thing... (Score:4, Interesting)
"The defect rate on hardware is so low you don't need to" I think the point of the article is to cast significant doubt on statements like this.
Right. Google assumes their server hardware (which is cheap, not good) is flaky, and designs their software to deal with that. I've heard a Google engineer say that if they sort a terabyte twice, they get two different results.
Re:The memory thing... (Score:5, Informative)
Admittedly that's a small percentage of the populace, even among people who build their own systems.
Re:The memory thing... (Score:4, Insightful)
Doubtful (Score:3)
Higher end "gamer" motherboards come with default overclock settings, it doesn't require anything more then leaving the default settings as they are for the motherboard to attempt optimal settings rather then purely the settings your CPU/Memory themselves report. Going even further is also pretty easy requiring little more then selecting performance mode from a nice graphical screen.
Yes, extreme overclocking is still for enthousiasts but running your hardware slightly faster then recommended by you CPU/Mem
Re:The memory thing... (Score:4, Informative)
" Either it's DOA or runs forever."
Nonsense. I bought 8 gig of memory about 4 years ago, for an Opteron rig. That computer recently started having serious problems, with corrupted data and crashing. I looked at all the other components first, then finally ran memory tests. Memtest failed immediately. I removed three modules and ran memtest again, it failed immediately. Replaced with another module, memtest ran for awhile, then failed. The other two modules proved to be good, so I am now running that aging Opteron with 4 gig of memory.
Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.
Re:The memory thing... (Score:5, Informative)
Yeah, yeah, yeah - I realize a single person's anecdotal evidence doesn't carry much weight. I wonder what the statistics are though? As AaronLS already pointed out, these tests seem to indicate that my situation isn't very unusual. Components age and wear out.
Check out "A study of DRAM failures in the field" from the supercomputing 2012 proceedings. They have some interesting stats based on 5 million DIMM days of operation.
Re:The memory thing... (Score:4, Informative)
Re:The memory thing... (Score:4, Insightful)
Nah, that's pretty typical. In fact ram is the only component other than HDD's to have a statistically significant AFR in my datacenter. At the peak I had a bit over 200 servers and we'd have a DIMM go bad about once every other month (so say 6 of 1200 DIMMs per year). Heck with my Proliants the fans and PSUs were more reliable as we've only lost a handful of each over the last 6 years.
Re: (Score:3)
Re:The memory thing... (Score:5, Informative)
The defect rate on hardware is so low you don't need to - buy your stuff from Newegg, assemble, and install. Either it's DOA or runs forever.
Look up "bathtub curve" sometime. Even well-built, perfectly working gear is aging, aging usually translates into "reduced performance / reliability", and any electronic part will fail sometime. Possibly gradually. Especially the just-makes-it-past-warranty crap that's sold these days. And there may be instabilities / incompatibilities that only show under very specific conditions (like when a system is pushed really hard).
That's ignoring things like ambient temperature variations, CPU coolers clogging with dust over the years, sporadic contact problems on connectors, or the odd cosmic ray that nukes a bit in RAM (yes that happens [wikipedia.org], too). A lot of things must come together to have (and keep) a reliable working computer, so a lot of things can go wrong and put an end to that.
Re:The memory thing... (Score:4, Interesting)
Especially the just-makes-it-past-warranty crap that's sold these days.
Actually, to get 95% of your product past the warranty period, you have to overengineer because, statistically, some of your product will fail earlier than you expect.
So if you have a 3 year warranty, you better be engineering for 4+ years or you're going to spend a lot on replacements for the near end of the bathtub curve.
I've had an unfortunate amount of experience with made in china crap that's ended up being replaced a few times within the warranty period.
Comment removed (Score:5, Interesting)
Re: (Score:2)
I don't think any modern bus would survive without ECC. Heck Intel even does ECC on the cache lines inside the processor these days (a feature brought down from the Itanium to the Xeons for the Nehalem generation, checkout RAS for more info). The more interesting idea to me is T10-DIF which allows ECC from disk to application and back, I'm kind of surprised it hasn't taken off.
Re:The memory thing... (Score:4, Informative)
Intel also charges you extra for ECC (only in server processors and mainboards), while AMD supports it in their better desktop processors. You still have to check if the mainboard does support it, though.
A quick online price check shows that for 8 GByte DDR3 RAM (2 sticks), you might have to pay 20 Euros more for the ECC variety, compared to non-ECC from the same vendor. The more limited choice in mainboards might end up costing you cost another 10-20 Euros, so let's say +40 Euros to get your AMD PC with ECC Ram.
On the Intel side, it is more like +50 Euros for a small Xeon instead of a matching i5, +100 Euros for an ECC-capable board and the same +20 for the RAM as with AMD. That makes about +170 Euros to get an Intel with ECC RAM, and was the main reason why my current PC is still an AMD...
Re: (Score:2)
Zero?
Adding a bit-per-byte for ECC means multiplying the number of bits required by 1.125.
Not zero.
*ahem*
Re:The memory thing... (Score:4, Insightful)
I've been hearing this for the entirety of my worldly awareness (several decades), and the song remains the same.
Eventually, I'd hoped that folks would realize that they were unlucky or were just buying garbage, instead of the insipidly assuming that such-and-such widget was so perfectly constructed and planned that it failed within hours/days of the warranty expiring -- just as designed.
The truth is that no matter what the nature of the item, or the term of the limited warranty: Given sufficient quantity, some of them are going to fail mere seconds after the warranty is gone.
Such as it is.
We all want everything we buy to work perfectly and last forever, but nothing ever does. It should be no surprise that this is not the result of any conspiracy, but just life. Things wear out. (Even DIMMs.)
Re: (Score:3, Insightful)
Open computer and blow it out with a leaf blower every 6 six months. Solves 80% of your boot problems, no need to reinstall or re-seat components.
Re:The memory thing... (Score:4, Interesting)
Look up "bathtub curve" sometime.
This is exactly why I cringe when I hear people saying "we need to replace that hardware because its been running for a few years now so might fail soon" - the chances of your brand new hardware going pop are often far higher than the tired old hardware. Eventually the old kit will of course die, but in my experience that is far further into the future than most people imagine.
I've not quite figured out the optimal hardware replacement frequency, but I tend to think that for servers (excluding the hard drives) the time you want to replace it is largely when it is no longer powerful enough to do what you want, rather than because its a bit old and creaky and you're worried it might break.
Hard drives, on the other hand, seem to break with reasonable frequency whatever their age, so usually I just run them (in a RAID) until they either give up, or SMART tells me they are reallocating large numbers of sectors, rather than trying to preemptively replace them.
Re:The memory thing... (Score:4, Informative)
My experience goes along with this. A few times I've had dual-boot computers constantly crashing on the Windows side, so I was blaming MS for their buggy software -- until the flaky hardware that made Windows flaky failed completely. Turns out that Linux is simply far more hardware fault-tolerant than Windows, rather than Windows being a bug-ridden piece of shit.
Re: (Score:2)
That's not true. There was a recent paper looking at memory defects and causes on the Jaguar supercomputer, and memory errors were moderately common. Just as surprisingly, there were errors were a single DIMM going bad would cause errors for all the DIMMs on that channel.
So, memory does go bad and it does that more frequently than you'd expect.
Re: (Score:2)
Wait its possible?! (Score:5, Funny)
You mean all those times when my code was 'fine' and i gave up it really could have been the compiler or a memory problem
shit i'm a much better programmer than i realized
Re: (Score:2)
Re: (Score:3)
Re: (Score:3)
Wow you had really crappy computer installations to work with. In my labs, gcc was in /usr/bin and I didn't have any write permission to that directory at all to mess things up. gcc and make just always worked for me.
Re: (Score:3, Insightful)
Welcome to planet Earth. If your species expects competence in its dealings with humans you should have done more research before landing. Didn't you get those episodes of I Love Lucy we kept sending you?
Re:Wait its possible?! (Score:5, Insightful)
I've been programming professionally for over 20 years, mostly in C/C++ (MSVC, GCC, and recently CLang (and others back in the olden days)). I've seen maybe two serious compiler bugs in the past 10 years. They used to be common.
On the other hand, I can't count how many times I've seen coders insist there must be a compiler bug when after investigation, the compiler had done exactly what it should according to the standard (or according to the compiler vendor's documentation when the compiler intentionally deviated from the standard).
By "serious", I mean the compiler itself doesn't crash, issues no warnings or errors, but generates incorrect code. Maybe I've just been lucky. (Or maybe QA just never found them
Oh, and btw, yes I realize you were joking (and I found it funny.)
Even better! (Score:5, Funny)
the compiler had done exactly what it should according to the standard...
That's even better - it means that you've found a bug in the standard! ;-)
Re: (Score:3)
I wish D [dlang.org] would gain some momentum.
Re:Wait its possible?! (Score:4, Interesting)
I saw this once - took me weeks to solve it. Basically I had a flash driver that would occasionally erase the boot block (bad!). It was odd because we had protected the boot block both in the higher level OS as well as the code itself.
Well, it happened and I ended up tracing through the assembly code - it turned out the optimizer worked a bit TOO well - it completely optimized out a macro call used to translate between parameters (the function to erase the block required a sector number. The OS called with the block number, so a simple multiplication was needed to convert). End result, the checks worked fine, but because the multiplication never happened, it erased the wrong block. (The erase code erased the block a sector belonged to - so sectors 0, 1, ... NUM_SECTORS_PER_BLOCK-1 erased the first block).
A little #pragma to disable optimizations on that one function and the bug was fixed.
OsStress (Score:5, Informative)
Re: (Score:2, Informative)
That's not too surprising. For instance if you try to read too fast from memory, the data you read may not be what was actually in the memory location. Some bits may be correct, some may not. Sometimes the incorrect values may relate to the data that was on the bus last cycle, eg there has not been enough time for the change to propagate through. This can easily lead to the data apparently read being a value that should not be possible. This is why overclocking is not a good idea for mission critical system
Re: (Score:2, Insightful)
We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.
Overclocking isn't the issue, because the CPUs are the same. The problem arises when aggressive overclocking is done by ignorant hobbyists or money-grubbing computer retailers. They overclock the computer to where it crashes, and then back off just a little bit. "The
Re:OsStress (Score:5, Insightful)
Bullshit. While Intel does occasionally bin processors into lower speeds to fulfill quotas and such, often times those processors are binned lower because they can't pass the QA process at their full speed. But they can pass the QA process when running at a lower speed. These processors were meant to be the same as the more expensive line, but due to minor defects can't run stably or reliably at the higher speed. Or at least not enough for Intel to sell them at full speed.
Which is a large part of why some processors in the same batch can handle it when others can't.
As much as I hate Intel, I think we could at least realize that they are often times doing this with good reason.
Re: (Score:2)
Can't really test an overclocked CPU ... (Score:2)
Re: (Score:3)
The typical method used by hobbyists was: overclock step by step until it crashes. Then, back off one step. You are now at the "optimal" speed - i.e. the fastest and therefore best speed. Games crash? Must be software bugs!
And one step back from an obvious crash may be in the subtle errors region where CPU failures can't be easily distinguished from software bugs. For example the subtle error can simply be an erroneous answer, 2+2=5 sort of stuff. If that erroneous answer is part of the calculation of where to draw something on the screen the error may be of no consequence, one pixel off may be imperceptible. However if that erroneous answer is ultimately part of the calculation of an array index then being one index off may
Re:OsStress (Score:4, Informative)
When the day arrives that we achieve molecular assembly, even then for two devices identically assembled with atom for atom correspondence, there will likely be enough variation in molecular or crystaline conformation remaining to classify the two devices at the margin as "not quite the same".
Binning levels are determined by the weakest transistor out of billions, the one with a gate thickness three deviations below the mean, and a junction length a deviation above. There is probably some facility for defective block substitution at the level of on-chip SRAM (cache memory), and maybe you can laser out an entirely defective core or two.
As production ramps, Intel has a rough model of how the binning will play out, but this is a constantly moving target. Meanwhile, marketting is making promises to the channel on prices and volumes at the various tiers. There's no sane way to do this without sometimes shifting chips down a grade from the highest level of validation in order to meet your promises at all levels despite ripples experienced in actual production.
Intel is also concerned--for good reason--about dishonest remarking in the channel. There's huge profit in it, and it comes mainly at the expense of Intel's reputation. Multiplier locks help to discourage this kind of shady business practice. So yeah, a few chips do get locked into a speed grade less than the chip could feasibly achieve. This is all common sense from gizzard to gullet. What's your point, then?
Where you even find so many stupid engineers? The College of Engineering for Engineers Who Think Statistics is One Big Cosmic Joke presided over by the Edwin J. Goodwin [wikipedia.org] Chair of Defining Pi As Equal to 22/7?
Re: (Score:2, Informative)
We all realize that when Intel bakes a bunch of processors, they come out all the same, and then Intel labels some as highspeed, some as middle, and some as low. They are then sold for different prices. However, they are the exact same CPU.
This is not 100% correct. When Intel or other fabricators of microprocessors make the things they do use the same "mold" to "stamp out" all the processors in large bunches all based on the same design, however they don't get the exact same result each time. The little difference from chip to chip, like on this chip some transistors ended up a few atoms closer together than what is the optimum distance so this part of the processor now will heat up more when in use, or on this chip someone coughed* during th
Re:OsStress (Score:5, Interesting)
Then again, it might not be overclocking after all [msdn.com].
More relevantly, Microsoft has access to an enormous wealth of data about hardware failures from Windows Error Reporting. This paper [microsoft.com] has some fascinating data in it:
- Machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault
- Machines that crashed once had a probability of 1 in 3.3 of crashing a second time
- The probability of a hard disk failure in the first 5 days of uptime is 1 in 470
- Once you've had one hard disk failure, the probability of a second failure is 1 in 3.4
- Once you've had two failures, the probability of a third failure is 1 in 1.9
Conclusion: When you get a hard disk failure, replace the drive immediately.
Caution: (Score:5, Funny)
Re: (Score:2)
Its them damn cosmic rays, I tell ya.
The death of Moore's law, they will be.
Re: (Score:2)
Its them damn cosmic rays, I tell ya.
The death of Moore's law, they will be.
Or the reason semiconductor houses switch from conventional (bulk CMOS) processes to Silicon-on-Insulator [wikipedia.org]. Many SOI processes are rad hardened by default.
Re: (Score:3)
Many SOI processes are rad hardened by default.
Rad hard usually means that they are not damaged by radiation e.g. you can stick them close to an LHC beam as part of a detector and the massive radiation dose they receive will not cause the device to permanently cease functioning (or at least last longer before it fails). On the other hand cosmic rays which slow down and stop in material can cause a large amount of local ionization. This can be enough to flip the state of a memory bit which can cause crashes. As devices get smaller the charge needed to f
How to deal with compiler bugs (Score:5, Insightful)
If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.
How to lose time and sanity (Score:5, Interesting)
If you suspect the compiler is generating invalid machine code, try to make a minimal test case for it. If you succeed, file a bug report and add that test case; the compiler developers will appreciate it. If you don't succeed in finding a minimal test case that triggers the same issue, it's likely not a compiler bug but an issue in your program in some place where you weren't expecting it.
Yeah, right. Let's see how that works out in practice.
I go to the home page of the project with bug in hand (including sample code). Where do I log the problem?
I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)
Once registered, I'm subscribed to your newsletter. (My temp E-mail has been getting status updates from the GCC crowd for years. My mail reader does something funky with the subject line, so responding with "unsubscribe" doesn't work for me.)
Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).
Some times the authors think it's the user's problem (no, really? This program causes gcc to core dump. How can that be *my* fault?) Some times the authors interpret the spec different from everyone else (Opera - I'm looking at you). Some times you're just ignored, some times they say "We're rewriting the core system, see if it's still there at the next release", and some times they say "it's fixed in the next release, should be available in 6 months".
What you really do is figure out the sequence of events that causes the problem, change the code to do the same thing in a different way (which *doesn't* trigger the error), and get on with your life. I've given up reporting bugs. It's a waste of time.
That's how you deal with compiler bugs: figure out how to get around them and get on with your work.
No, I'm not bitter...
Re: (Score:2)
Once entered, my E-mail and/or name is publicly available on the bug report for the next millenium. In plain text in the bug report, and sometimes in the publicly-accessible changelog - naked for the world to see (CPAN is especially fragrant).
Well, at least it smells nice.
Re: (Score:3)
...I have to register with your site. One more external agent gets my E-mail, or I have to take pains to manage multiple E-mails to avoid spam. (I don't want to be part of your community! I just thought you wanted to make your product better.)...
Let me help with one aspect.
If your email address is:
your_address@gmail.com
then you supply
your_address+domain.name@gmail.com
And if you don't use gmail, then maybe your email supplier does something similar. Or you should learn procmail if you're still managing your own.
p.s. It looks like your www.o...r.com domain/host is down.
Re:How to lose time and sanity (Score:4, Insightful)
He wrote a bug report, but it was ignored.
Re: (Score:2)
Compilers (Score:5, Funny)
For being a skilled developer, I can't believe he would not think that Dev/Test/Prod build environments not running the same version of the compiler was not an issue (Obviously, until it was an issue).
That's Development Cycle 101.
12 hours a day for weeks on end (Score:3)
i can't believe you don't understand that the brain doesn't work 100% reliably when you force it past the breaking point like this. its work 101.
QA fail (Score:2)
Worse, the article hints at a bigger problem:
"We had "pushed" a new build out to end-users, and now none of them could play the game!"
Which I read as: developers write & debug code, that code goes through a build server which builds it & combines with game data etc, result of that is pushed to users. The obvious step missing here: make sure the exact same stuff you're pushing to users, is working & tested thoroughly before release. Seems like a gaping Quality Assurance fail right there, forget differences between developer and production systems.
Skip that step and you're implic
Re: (Score:2)
Yep, seen it all (Score:5, Insightful)
I've had compilers miscompile my code, assemblers mis-assemble it, and even on a few cases CPUs mis-execute it consistently (look up CPU6 and msp430). Random crashes due to bad memory/cpu... yep. But on very rare occasions, I find that the bug is indeed in my own code, so I check there first.
Re: (Score:2)
Typical for safety cert programs (Score:5, Interesting)
We deal with this type of bug all the time in safety-certified systems (medical apps, aircraft, &c).
Most of the time an embedded program doesn't use up 100% of the CPU time. What can you do in the idle moments?
Each module supplies a function "xxxBIT" (where "BIT" stands for "Built In Test") which checks the module variables for consistency.
The serial driver (SerialBIT) checks that the buffer pointers still point within the buffer, checks that the serial port registers haven't changed, and so on.
The memory manager knows the last-used static address for the program (ie - the end of .data), and fills all unused memory with a pattern. In it's spare time (MemoryBIT) it checks to make sure the unused memory still has the pattern. This finds all sorts of "thrown pointer" errors. (Checking all of memory takes a long time, so MemoryBIT only checked 1K each call.)
The stack pointer was checked - we put a pattern at the end of the stack, and if it ever changed we knew something want recursive or used too much stack.
The EEPROM was checksummed periodically.
Every module had a BIT function and we check every imaginable error in the processor's spare time - over and over continuously.
Also, every function began with a set of ASSERTs that check the arguments for validity. These were active in the released code. The extra time spent was only significant in a handful of functions, so we removed the ASSERTs only in those cases. Overall the extra time spent was negligible.
The overall effect was a very "stiff" program - one that would either work completely or wouldn't work at all. In particular, it wouldn't give erroneous or misleading results: showing a blank screen is better than showing bad information, or even showing a frozen screen.
(Situation specific: Blank screen is OK for aircraft, but not medical. You can still detect errors, log the problem, and alert the user.)
Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact, and coupled with good error logging it can turbocharge your bug-fixing.
More error checking (Score:5, Interesting)
My previous was modded up, so here's some more checks.
During boot, the system would execute a representative sample of CPU instructions, in order to test that the CPU wasn't damaged. Every mode of memory storage (ptr, ptr++, --ptr), add, subtract, multiply, divide, increment &c.
During boot, all memory was checked - not a burin-in test, just a quick check for integrity. The system wrote 0, 0xFF, A5, 5A and read the output back. This checked for wires shorted to ground/VCC, and wires shorted together.
During boot, the .bss segment was filled with a pattern, and as a rule, all programs were required to initialize all of their static variables. Each routine had an xxxINIT function which was called at boot. You could never assume a static variable was initialized to zero - this caught a lot of "uninitialized variable" errors.
(This allowed us to reboot specific systems without rebooting the system. Call the SerialINIT function, and don't worry about reinitializing that section's static vars.)
The program code was checksummed (1K at a time) continuously.
When filling memory, what pattern should you use? The theory was that any program using an uninitialized variable would crash immediately because of the pattern. 0xA5 is a good choice:
1) It's not 0, 1, or -1, which are common program constants.
2) It's not a printable character
3) It's a *really big* number (negative or unsigned), so array indexing should fail
4) It's not a valid floating point or double
5) Being odd, it's not a valid pointer
Whenever we use enums, we always start the first one at a different number; ie:
enum Day { Sat = 100, Sun, Mon... } ... }
enum Month { Jan = 200, Feb, Mar,
Note that the enums for Day aren't the same as Month, so if the program inadvertently stores one in the other, the program will crash. Also, the enums aren't small integers (ie - 0, 1, 2), which are used for lots of things in other places. Storing a zero in a Day will cause an error.
(This was easy to implement. Just grep for "enum" in the code, and ensure that each one starts on a different "hundred" (ie - one starts at 100, one starts at 200, and so on).)
The nice thing about safety cert is that the hardware engineer was completely into it as well. If there was any way for the CPU to test the hardware, he'd put it into the design.
You could loopback the serial port (ARINC on aircraft) to see if the transmitter hardware was working, you could switch the A/D converters to a voltage reference, he put resistors in the control switches so that we could test for broken wires, and so on.
(Recent Australian driver couldn't get his vehicle out of cruise-control because the on/off control wasn't working. He also couldn't turn the engine off (modern vehicle) nor shift to neutral (shift-by-wire). Hilarity ensued. Vehicle CPU should abort cruise control if it doesn't see a periodic heartbeat from the steering-wheel computer. But, I digress...)
If you're interested in the software safety systems, look up the Therac some time. Particularly, the analysis of the software bugs. Had the system been peppered with ASSERTs, no deaths would have occurred.
P.S. - If you happen to be building a safety cert system, I'm available to answer questions.
Cruise control (Score:3)
The actual error was more nuanced than I had room to describe in the comment.
Car systems pass messages to each other using an internal buss. In this particular incident, one of the systems failed in a way that made it continuously spew out messages, using all the bandwidth so that no other system could get a message through.
The brakes will normally abort cruise control, but there's no direct wire to the engine computer - it's all messages passed around on the buss. Similar with the On/Off button (newer cars
Re: (Score:2)
> Everyone says to only use error checking during development, and remove it on released code. I don't see it that way - done right, error checking has negligible impact,
That depends on the _type_ of app. In the games industry a _debug_ build runs TOO slow to be even practical. You are forced to run optimized code if you want to have any hope of going above 1 fps.
TINSFAAFL. Error checking costs. If I was doing software were somebody's life depended on it -- hell yeah you spot on! But for a "game" yo
Stress testing: most critical Overclocking step! (Score:3)
This is why stress testing is so important. The system may seem stable at overclocked speeds but only while it is lightly or even moderately loaded, and not every error will result in a kernel panic. The hardest errors to get stable are often the subtle ones that cause cascades elsewhere, minutes or hours after the load finished.
I start by getting it stable enough to pass memtest86+ tests 5 and 7 at (or as close as possible) my target frequencies/dividers. This is pretty easy to do nowadays, but it's a good sanity check starting point before booting the OS and minimizes gross misconfigurations that cause filesystem corruption. Then I run prime95, then linpack, then y cruncher, then loops of a few 3dmark versions. Sometimes I run the number crunchers simultaneously across all cores, first configured to stress the cpu/cache, then with large sets to stress ram (but not swap! in fact turn swap off for this). The minimum time for all of this really should be 12 hrs.. 24 is best, or more if you're paranoid. A variety of loads over this time is important because the synthetic ones are often highly repetitious, and this can sometimes fail to expose problems despite the load the system's under. The 3dmark (or pick a scriptable util of your choice) stresses bus IO as well as all the really cranky and picky gfx driver code. As a unique stressor, I use a quake 3 map compile that eats most of the ram and pegs the cpu for hours.. q3map2 is a bitch and it usually finds those subtle 'non-fatal' hardware errors if they exist.
If the boot survives without an application or kernel crash (or other wonky behavior), I run a few games in timedemo loops. In the old days this was quake1/2/3, but these days I stick with games like metro 2033 which have their own bench utilities. these tests are still valid even if your intended use is for 'workstation' class work and don't game much, but still want to squeeze as much performance as you can from your hardware. I do both with mine and have had great success with this method.
Don't forget to do out of the box testing (Score:2)
Don't forget to do out of the box testing / testing for stuff that you may not think of off hand.
Reminded me of my first C application (Score:4, Interesting)
I can't remember the exact code sequence, but in a loop, I had the statement:
if (i = 1) {
Where "i" was the loop counter.
Most of the time, the code would work properly as other conditions would take program execution but every once in a while the loop would continue indefinitely.
I finally decided to look at the assembly code and discovered that in the conditional statement, I was setting the loop counter to 1 which was keeping it from executing.
I'm proud to say that my solution to preventing this from happening is to never place a literal last in a condition, instead it always goes first like:
if (1 = i) {
So the compiler can flag the error.
I'm still amazed at how rarely this trick is not taught in programming classes and how many programmers it still trips up.
myke
another trick: stop mixing up testing + assignment (Score:2)
anything that both assigns and tests a loop index is by definition a fucking accident waiting to happen. its like driving a car without wearing a seatbelt and then deciding the 'solution' is to put the steering wheel in the back seat instead of the front.
Re: (Score:2)
I agree, but
if (i = 1) {
is a perfectly valid "C" (and Java) statement - there was no intention of putting an assignment in a conditional statement.
Modern compilers now issue warnings on statements like this, but at the time nothing was returned.
myke
Re: (Score:3)
The GP's tip is about avoiding accidentally inserting an assignment operator "=" when you intended to put an equivalence operator "==". If you follow the GP's habit there can be no such accidents because it's illegal to assign a value to a constant. The coding tip itself has been circulating for 23yrs that I personally know about, in those days compilers did not warn you about assignments within conditionals, the compiler saw valid syntax and assumed you knew what you were doing.
Re:Reminded me of my first C application (Score:5, Informative)
Re: (Score:2)
The reference, like, do I.
myke
Re: (Score:2)
Which is why I always compile with -Wall -Werror on gcc. I get: "warning: suggest parentheses around assignment used as truth value [-Wparentheses]" for code which looks like that. I consider code that generates compiler warnings as being a bad sign, and always make it a point to clean them up before considering any code suitable. I don't know why this doesn't seem to be as widely done as it should be.
Confusing poing in parent article (Score:2)
Hiya,
I just noticed the confusion when I put in "keeping it from executing" when I meant to say "keeping it from exiting [the loop]".
Sorry about that,
myke
Re: (Score:2)
Should I even note that I misspelled "point"?
myke
Re: (Score:2)
if ( 1 == i ) {
Re: (Score:2)
Ick, that's what real compilers (eg, gcc) are for: good warning messages (such as "suggest parenthesis around assignment used as truth value"), and better yet, -Werror. "if (1 == i)" is completely unnatural (for an English speaker anyway), which makes it more likely to forget to do 1 == i than it is to forget to double the equals sign. I too used to make the same mistake when I first started with C (having come from Pascal: that was fun := became =, = became ==), but I quickly learned to double check my tes
Re: (Score:2, Informative)
The statement:
if (i = 1) {
is equivalent to:
i = i; // Always true because i = 1 and i != 0
if (i) {
myke
Re:Reminded me of my first C application (Score:5, Informative)
if (i = 1) {
is equivalent to:
i = 1;
if (i) {
Bugs A Noy (Score:2)
In my own coding, I tend to *gasp* make mistakes. Sometimes, really, really dumb ones.
One of the biggest problems with my coding, is that I am often the only real coder looking at it. Even my FOSS work seldom gets reviewed by coders.
I can't say enough about peer review. I wish I had more. It can really suck, as one thing that geeks LOVE to do, is cut down other geeks. However, they are sometimes right, and should be heard.
Negative feedback makes the product better. Positive feedback makes the producer feel
Yes, hardware errors happen! (Score:2)
It's just a matter of whether you realize it or not.
The blatant ones cause an application or OS crash. But depending on what got corrupted, it might just cause a momentary application glitch, or even cause an alteration in the contents of a file that you won't notice for weeks... if ever.
When I build PCs, they get an overnight Memtest run at a minimum. Most of the time I also use ECC RAM to protect against random flipped bits and DIMMs that fail after being in use for a while.
I've seen this before... (Score:5, Interesting)
FLAC has a verify mode when encoding which, in parallel, decodes the encoded output and compares it against the original input to make sure they're identical. Every once in a while I'd get a report that there were verification failures, implying FLAC had a bug.
If it were actually a FLAC bug, the error would be repeatable* (same error in the same place) because the algorithm is deterministic, but upon rerunning the exact same command the users would get no error, or (rarely) an error in a different place. Then they'd run some other hardware checker and find the real problem.
Turns out FLAC encoding is also a nice little hardware stressor.
(* Pedants: yes, there could be some pseudo-random memory corruption, etc but that never turned out to be the case. PS I love valgrind.)
Re: (Score:3)
While hardware could definitely be at fault, have you considered the possibility of a race condition causing the error? Race conditions may occur very infrequently and can be incredibly difficult to discover. I'm not trying to disparage the FLAC co
Re:stress test (Score:5, Funny)
In my field, I have a bunch of grass, a few shrubs and even a small tree. Lots of rodents and birds. If a computer can survive two weeks sitting in my field and still power on, you have a damned good system. If not, you're left with people wondering why you left your computer in my field for two weeks.
Re:stress test (Score:5, Funny)
He didn't say anything about a computer: "In my field, if YOU can survive"... scary...
Re: (Score:2)
Funny though, I like what you did there.
Re: (Score:2)
so do you suggest guildwars incorporate a gcc testsuite run in parallel with the game?
except if GCC is wrong (Score:2)
which, well, it can be.
Re:I don't believe 1% of computers give wrong answ (Score:5, Insightful)
You don't have any idea what you're talking about, and that's why you don't understand what he's talking about.
Re:I don't believe 1% of computers give wrong answ (Score:5, Informative)
I actually believe it. I am sure they might have think of floating point precision problem. But most likely they only used integers. That's what prime 95 and memtest are doing. Integer and memory operations uncover most common hardware failure. I encountered many computers with faulty hardware when stressed. And I am sure guildwars was stressful.
Re: (Score:3)
> I actually believe it. I am sure they might have think of floating point precision problem.
I can believe it. Ten years ago on one the PC games I worked on there were significant floating-point differences between Intel and AMD. Fortunately it was an RTS so we could get away with fixed-point. If we would of been forced to deal with floats it would of been a hassle to keep them "in sync."
Floating-point is an approximation anyways, so IMHO
a) the server should be making the authoritative decision(s), an
Re: (Score:3)
Fortunately it was an RTS so we could get away with fixed-point.
Does it really vary by genre? For a game world the size of Liechtenstein, a 32-bit fixed-point length gives precision down to 10 microns or so. And even in a vast open world, you start to get glitches like the far lands in Minecraft [minecraftwiki.net] if you stray more than 12.5 million units from the origin.
Re:I don't believe 1% of computers give wrong answ (Score:4, Interesting)
It used to, I don't know about current gen RTS's. But back ~2000 RTS typically you would run in a lock-step model. We used fixed-point to guarantee each machine was doing the _exact_ same 3D math due to the imprecision of the FPU. ANY discrepancy and your game state was boned. I believe at the time this decision was due to network implementation -- I don't know the exact reason though since I was doing rendering / optimizations.
You also have to keep in mind the context. Back in 2000 AMD's FPU was beating the pants of the Intel's (depending on the operation as much as 1000% !!) With Intel having such a slow FPU you didn't rely on it unless you had to. Also, using C's 64-bit 'double' was prohibited for two reasons:
a) the PS2 emulated it IN SOFTWARE !
b) it was horrendously SLOW compared to 32-bit floats.
Game programmers stayed as far away as possible from floats (and especially doubles!) as long as (reasonably) possible. For FPS you were forced to go the float route because while Intel hid the latency of the INT-to-FLOAT casts it was just easier to stay entirely in the float domain. That also opened the door for some clever optimizations like Carmack did with over-lapping the FPU and INT units but that was the rare case.
On PS3 you take a HUGE Load-Hit-Store penalty if you try doing the naive INT32-to-FLOAT32 cast so fixed point has fallen out favor for lack of performance reasons.
Re:I don't believe 1% of computers give wrong answ (Score:5, Insightful)
I think this is bull. I just don't believe 1% of computers give wrong answers
1% of all computers? Probably not.
1% of gamers' computers, in an era when PC gaming technology was progressing very quickly, and so gamers were often running overclocked (or otherwise poorly set up) hardware? Sounds plausible enough.
Re: (Score:2)
I won't go into specific reasons you mention, but it is perfectly possible to write code that has a known, fully deterministic result. After all: compilers produce machine code, and the bulk of that is integer operations which have exactly defined behavior with 0 room for interpretation (when it comes to digital logic like CPU's, "defined" is deterministic). Maybe there are exceptions (like floating point? don't count on it), maybe for some types of operations you need to sidestep a compiler and code some
Re:I don't believe 1% of computers give wrong answ (Score:5, Insightful)
He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.
Re: (Score:3)
He said 1% of computers that were used to play Guild Wars gave wrong answers. Gaming PCs are more likely to be overclocked too far, have under-dimensioned power supplies or overheating issues than the average PC. 1% doesn't sound unrealistically high to me.
guild wars runs okay on a crappy netbook, let alone anyones PC, If you need to OC to play Guild Wars then you are using a Pentium III.
Just saying.
I've fixed a lot of PC. In fact, i pride myself on the fact that I can usually figure out what is wrong with a computer, software or hardware. And I've seen some funky hardware in my time. And lost a lot of hardware going bad.
I bet more then 1% of computers are problemmatic and the owners don't know it. The last chunk of memory could be bad, but if they do
Re: (Score:2)
Re: (Score:2)
faulty hardware is a common problem in clusters. We still have a 5 years old cluster at work that is falling in pieces due to faulty processor, faulty memory, faulty power supplies and faulty hard drives. Some errors are just weird. You never saw a machine having errors in memtest86 but stopping to have them once you swapped two dimms? I see that from time to time and I don't spend much time dealing directly with hardware.
Re:I don't believe 1% of computers give wrong answ (Score:4, Interesting)
Re: (Score:3)