Path: Eric's Site / Work / Samples | Related: Résumé, Work Samples, What I Do, Apple Service (Site Map) |
File | Function | My Time | Intel Time |
sinfcosf.s | sine | 30, 75 | 38, 51, 64 |
cosine | 33, 78 | 36, 51, 64 | |
tanf.s | tangent | 34, 44 | 57, 61, 97 |
asinf.s | arcsine | 31, 22, 28 | 74 |
acosf.s | arccosine | 31, 22, 28 | 73 |
atanf.s | arctangent | 37, 24 | 43 |
atan2f.s | arctangent, two-argument | 28, 44 | 57, 62 |
These are single-precision trigonometric functions I designed and implemented in hand-coded assembly. They are fast, deliver faithfully rounded results, and conform to standards. I used Maple to design the approximating polynomials and other mathematics.
Argument reduction in the sine, cosine, and tangent routines is interesting. Because π is irrational, some very large floating-point numbers are very close to multiples of π/2. These numbers have very small sines or cosines. Determining the correct result requires finding a remainder that is many orders of magnitude smaller than the least significant bit in the input.
In order to produce a faithfully rounded result, the error in the remainder operation, the error in the polynomial, and the rounding errors in the polynomial evaluation must be, in total, less than the value of the least significant bit of the result expressed in single-precision.
A faithfully rounded result is one of the two floating-point numbers bracketing the exact mathematical answer. This criterion is used because determining the nearest floating-point number requires excessive time when the exact answer is very near the point midway between two floating-point numbers.
I tested each single-argument routine with each of the 4,294,967,296 input values.
The times shown are CPU cycles on an Intel Core 2, measured as the duration per call when called repeatedly. (This represents throughput. The latency, the duration from start of a call to when the result is available, is longer.) For comparison, times measured for Intel's Integrated Performance Primitives (IPP) are shown. A routine has several times because its behavior varies over its domain. These times are for normal numbers, not infinities, denormals, NaNs, or out-of-domain values.
The sine and cosine routines spike to 75 and 78 cycles for a small part of their input domains (four exponent values out of 254) because the table entries share some address bits with a stack location, causing the processor to delay execution until it determines there is no actual physical address collision.
The sine, cosine, and tangent routines use generated data in sinfcosfTable.s and tanfTable.s, but the generating code is not publicly available. (It is straightforward arithmetic but was not published by Apple.)
Construction of a High-Performance FFT (PDF). (Also available as Microsoft Word in a ZIP file.)This paper describes thoroughly and in detail the construction of a high-performance FFT. It is quite long, over 90 pages. Starting from the mathematical definition of an FFT, it transforms the math into a form suitable for an O(n log n)-time implementation, shows code for that, and then, in several stages, develops the code into a highly efficient design.
The design is discussed almost entirely in C but divides the work into subroutines that can clearly be implemented as high-performance code on an ordinary processor or on a single-instruction multiple-data processor. (That is, the routines perform parallel calculations on separate data, so it is clear that multiple instructions can be issued separately.)
You are welcome to copy the paper for the purpose of reviewing my work. If you would like to use the paper in your work, please read the license in the ZIP file.
A radix-4 FFT butterfly subroutine (text file).In my work for EADS, I wrote a high-performance FFT. This code sample (published with permission) is the subroutine called FFT4_1WeightPerIteration in my FFT construction paper. Because this is both assembly language and specialized code, it might not be easily readable by many software engineers. There are two points to note.
First, if you have seen an FFT implementation before, search for the first occurrence of "1:". This is a label which starts the main loop. In some FFT implementations, the butterfly calculations are a mess. I sought to express the calculations cleanly and bring out the symmetry in the mathematics.
Second, if you have some experience with high-performance code or instruction scheduling, search for the second occurrence of "1:". Here begins a loop that performs the same calculations, but the instructions are reordered for high performance. (I apologize that the documentation for this reordering is in another file that I do not have permission to publish.) This loop runs in 31 CPU cycles per iteration.
(There are two versions of the loop because the first performs the calculations in a simple serial, comprehensible manner. It works and is used for development and illustration and is not assembled in the production code.)
Zero-Knowledge Proofs (PDF).These are slides for a talk about zero-knowledge proofs aimed at an audience of intelligent people but using mostly high-school mathematics. Zero-knowledge proofs are quite interesting: One person can prove to another they know a number (a password) without giving an eavesdropper any information about what the number is or how to forge a proof.
This was not prepared as a work project, but it shows some of my exposition style even though it does not include the narration that adds to the slides.
Path: Eric's Site / Work / Samples | Related: Résumé, Work Samples, What I Do, Apple Service (Site Map) |
© Copyright 2003 by Eric Postpischil.