Date: Thu, 8 Dec 1994 18:49:14 -0600 From: "Gregory D. Peterson" To: farber@central.cis.upenn.edu Subject: Forward of Pentium FDIV bug discussio Here's an interesting post - the latest in a debate raging in comp.arch concerning the Pentium FDIV bug. I thought it might be good to forward it to the IP list. greg Article follows: >From pratt@Sunburn.Stanford.EDU Thu Dec 8 18:44:17 CST 1994 Article: 35319 of comp.arch Path wuccrc!udel!gatech!howland.reston.ans.net!news.moneng.mei.co m!uwm.edu!lll-winken.llnl.gov!unixhub!news.Stanford.EDU!Sunb urn.Stanford.EDU!prat From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Why there is no worst FDIV bug---large vs. likely Message-ID: <3c7tav$g0m@Radon.Stanford.EDU> Date: 8 Dec 94 21:20:31 GMT Organization: Computer Science Department, Stanford University. Lines: 100 Xref: wuccrc comp.arch:35319 NNTP-Posting-Host: sunburn.stanford.edu This is a brief note to highlight an important point that is getting lost in the volume of technical traffic (to say nothing of the nontechnical) on the severity of the FDIV bug. There are two fundamental criteria for judging an arithmetic error, magnitude and frequency. (Those who follow my work on the duality of time and information, see my .sig http, will understand the sense o "fundamental" I mean here; physicists should identify magnitude with time and frequency with energy, forming a conjugate pair.) Any time you have two or more criteria for judging something, it becomes possible to have no worst case. Magnitude. The worst FDIV bug with regard to magnitude is Tim Coe's pair 4195835/3145727, for which the Pentium gets 1.333739 instead of the correct 1.33382045. The relative error here is .999 times 2^-14. This is the largest error observed to date, and 2^-14 may well be the maximum possible error for this bug. Frequency. The rate at which errors are encountered, which for this bug is extremely dependent on the application, has to do with th number and distribution of occurrences of the bug in operand space. Thus one would not a priori expect that it made sense to talk about any single bug as relevant to frequency. Nevertheless the single pair 4.999999/14.999999, more memorable as 5/15 with a millionth shaved off each operand, yielding a 2^-16 error, does tell us something. It does not tell us about the total number o bugs---after all, without further information it could be the only bug. Rather, it tells us something about the likelihood of encountering an FDIV bug. Suppose the 5/15 pair and the Coe pair were the only two bugs. While Tim's pair hurts more four times as badly as mine, I think I can safely leave it to the reader to dream up plausible applications where my pair is encountered at least four times as often as Tim's, e.g. data obtained from a data logging device that is using an analog-to-digital converter and measuring numbers that for some reason concentrate around integers, or obtaining data from a decimal calculator that retains only six digits after the point. An abstract way of putting this is to say that 4.999999/14.999999 has low *Kolmogorow complexity*. The Kolmogorow complexity of any finite bit pattern is the size in bits of the smallest Turing machine that started on a blank tape, writes down that pattern. Even though 4.999999/14.999999 is as long as 4195835/3145727, it should have lower Kolmogorow complexity on a non-Pentium. (But moving the Coe pair to a Pentium decreases its Kolmogorow complexity in principle if not in practice because the Pentium can describe it as that pair of odd integers x,y maximizing x-(x/y)*y; Kolmogorow complexity ignores running time. In practice today's architectures trade things off to represent pairs of numbers rather more compactly than the above program, but this need not hold for all architectures.) This is the sense in which there is no worst pair. Instead there are two worst pairs, one unambiguously demonstrating how large the error can get, the other somewhat smaller in magnitude but more likely to be encountered in practice, depending heavily on the application. This "two" is "up to isomorphism" as they say in algebra. The Coe pair has many siblings that do the same job: just scale either operand by a pwer of two. Likewise my pair has many siblings, which however are explicitly *not* obtainable by scaling as I pointed out earlier, rather they are those bugs of similar structural simplicity enumerated in my table posted earlier. 5/15 is simply the most appealing (to me) of the 800 or so small (operands < 1000) fractions that are problematic in this sense, making them at least cousins to 5/15. Of these, 26 are siblings in that they have relative errors of at least 10^-5. Had all the large bugs been exceedingly unlikely, and had all the likely bugs all been of very low magnitude, I would not dispute s strenuously the emerging (this week) industry consensus that the FDIV bug is not serious other than politically for the computer supply side. This is however not the case: the 5/15 bug shows that customers can experience fairly large errors, a quarter of the largest possible, in fairly likely numbers. It is important to bring this consideration to the attention of industry leaders, who in this week's news have been downplaying the significance of the bug. Until it is brought to their attention, one cannot accuse them of *deliberately* putting their own concerns ahead of their customers' by burying their collective heads in the sand in this way. The strongest accusation possible is that they are doing this but unintentionally and without malice. There is a chain of responsibility here. A whole industry does not listen to one person; rather the business community looks to the technical community as a whole for advice on the matter. I therefore call on the academic community to reach a meeting of the minds on the severity of the problem so that it can present a united front to industry on this matter. One way *you* can help here is to forward this message, not (necessarily) to management and the media, but to those of your technical colleagues who are capable of assessin the technical merits of the above arguments but who lack the time required to read Usenet, particularly those newsgroups carrying the highly contagious FDIV bug, which having infected the Pentium is now infecting many news groups. -- Vaughan Pratt http://boole.stanford.edu/boole.html