Re: think you could stay on topic?
By Peter Gerassimoff on Wednesday, December 4, 2002 9:55 PM EST
Electromigration is an exponential effect. The life is cut in half for every 0.03V or so on a 0.18u process with a certain gate thickness. Given a shrink to 0.13u which is about 70.7% in all three dimensions, now the effect is a halving of life every 0.21V or about 31 times for a 0.1V increase. So a stock CPU at designed specifications might last 10 years or more. Add 0.1V to the Vcore and you just shrunk it to 1 to 2 years (some low level destruction may be due to other factors which are more linear with respect to voltages or temperatures). Add another 0.1V and you are down to a few months (now almost all due to electromigration). Add another 0.1V and it is now a few days.
Now if you look at temperature effects, you must realize that the hot spot problem is more insidious than most people think. High performance cooling and overclocking can stress the die more than normal cooling and overclocking. This is due to a design decision by Intel to push clock speeds by monitoring certain areas of the chip that produce more than the average amount of waste heat and to automatically reduce the work cycle of the CPU when these are tripped. This is all well and dandy if you keep the CPU above a certain temperature (unlike all past CPUs which could run down to absolute zero). This is because the hot spots that Intel designed monitors assume a range of temperatures leading to gradients making the trip points a certain number of degrees above max allowed average die temp and these are then set into the die masks. Now make the average die temp lower, the gradients increase and the necessary trip points will be lower, but the die still uses the old ones. Once the average temperature is low enough (vapor chilling, dry ice, LN2, etc.), the needed trip points fall far enough below the preset ones (and the margins built into them) allowing for instabilities leading to thermal runaway and thus, destruction of the CPU. And there is no way to stop it, but to back off the clock rate to reduce these gradients and thus raise the trip points back into the safe range.
The effects of these two failure modes interact and are both cumulative. The longer you do them, the higher the chance you reduce the CPU to junk.
However with all of this in mind, what may be happening is that Intel is relaxing some of the margins to make higher yielding high performance CPUs. They have done this before and it has cost them (P3-1.13G comes quickest to mind). The P4 might be one of the first MPUs that really does have a low die temp limit. These overclockers may be giving the first real proof that Intel has lowered the margins enough that power users are beginning to see these higher order effects that go contrary to previous experience. And that could be worrying for their future.
Pete
http://www.aceshardware.com/forum?read=80051215
Sudden Overclocked Northwood Death Syndrome. Is It Strange That Overclocked CPUs Eventually Die?
Posted 12/06/02 at 8:30 pm by Anton
For the last couple of days I have been keeping my eyes on different discussions about the so called Sudden Northwood Death Syndrome (S.N.D.S) at different web-sites and forums. Apparently, numerous Intel Pentium 4 “Northwood” processors malfunction and are not able to work after working at higher clock-speeds with core-voltage increased. Since we always set the Vcore up when we overclock our CPUs, a lot of people were very surprised that a CPU can eventually die due this rather simple tuning.
Although there is a widely-known theory that processor will never burn if it runs cool, it seems that this time this idean does not work. First of all, microprocessors are very complex and contain loads of separate blocks nowadays; for example, the temperature of FPU can be lower than a temperature of ALU when working in office applications. Secondly, manufacturing technologies are becoming thinner and thinner these days, therefore their working conditions should be determined very precisely and any deviations can lead to incorrect work or malfunction. Finally, computer hardware cannot be considered as “toys for big boys” – you should work carefully with it, just as you did with your first i486-based PC. With the so-called Sudden Northwood Death Syndrome (S.N.D.S), all those who have forgotten the following statements definitely revised them.
Now let us take a look from another point of view. There are a lot of explanations of the reason why Intel Pentium 4 “Northwood” processors malfunction even though they run at lower than 45 degrees Celsius average temperature, according to thermal diodes located inside the CPU. Honestly speaking, only Intel engineers may suspect the real source of the problem and maybe even localise it if it is possible. We can only make intentions and maybe some of them appear to be real. It seems that the roots of the issue are in the manufacturing process itself as well as the hot-spots inside the CPU that make their appearance when the processor functions in unusual conditions. With thinner fabrication processes, the possibility of the so-called electromigration effect increases drastically. Electromigration is generally considered to be the result of momentum transfer from the electrons, which move in the applied electric field, to the ions which make up the lattice of the interconnect material. When electrons are conducted through a metal, they interact with imperfections in the lattice and scatter. Scattering occurs whenever an atom is out of place for any reason. Thermal energy produces scattering by causing atoms to vibrate. This is the source of resistance of metals. The higher the temperature, the more out of place the atom is, the greater the scattering and the greater the resistivity. Moreover, according to people with knowledge of the matter, electrical voltage influences the atomic vibration even more than the temperature, as a result if you start to increase the voltage, there will be a lot more chances for your Pentium 4 CPUs to start working incorrectly and die after all. So, even if the average temperature of the core is relatively low, there may be some hot spots inside the CPU that can lead to the electromigration effect. The problem was not so widely spread before because semiconductor manufacturing processes were less thin and electromigration effect was not a common thing at all even with very high voltages.
You now can call it as Sudden Overclocked Northwood Death Syndrome, but you should understand that from this point you will start to hear about such effects more and more often. When Intel and AMD start to utilise the 90 nanometer process in 2003 and 2004 respectively, a big challenge for them will be the electromigration effect because it is exponential and depends on manufacturing process, voltage, heat, the quality of material and some other factors. All the mentioned factors influence the core-speed and the latter influence the performance. So, it all form the cycle and CPU developers have to find a consensus between the speed and reliability. In fact, it seems to be found for the current Northwood processors, but when you go above the recommendations, it may cause your CPU to burn down.
I wonder if the same things may happen to AMD’s Athlon XP processors that are manufactured using 0.13 micron process and that are sometimes overclocked with the core voltage raised to 2.0V and above from the nominal 1.5 to 1.65V.
Useful links:
- Sudden Northwood Death Syndrome Discussion at X-bit Labs Discussion Board.
- Introduction in Electromigration
http://www.xbitlabs.com/news/story.html?id=1039224602
http://www.tweakpc.de/berichte/emig/emig.htm