Binary GCD algorithm
From Wikipedia, the free encyclopedia
The binary GCD algorithm is an algorithm which computes the greatest common divisor of two nonnegative integers. It gains a measure of efficiency over the ancient Euclidean algorithm by replacing divisions and multiplications with shifts, which are cheaper when operating on the binary representation used by modern computers. This is particularly critical on embedded platforms that have no direct processor support for division. While the algorithm was first published in modern times by Josef Stein in 1961, it may have been known in first-century China (Knuth, 1998).
Contents |
[edit] Algorithm
The algorithm reduces the problem of finding the GCD by repeatedly applying these identities:
- gcd(0, v) = v, because everything divides zero, and v is the largest number that divides v. Similarly, gcd(u, 0) = u. gcd(0, 0) is not defined.
- If u and v are both even, then gcd(u, v) = 2·gcd(u/2, v/2), because 2 is a common divisor.
- If u is even and v is odd, then gcd(u, v) = gcd(u/2, v), because 2 is not a common divisor. Similarly, if u is odd and v is even, then gcd(u, v) = gcd(u, v/2).
- If u and v are both odd, and u ≥ v, then gcd(u, v) = gcd((u-v)/2, v). If both are odd and u < v, then gcd(u, v) = gcd(u, (v-u)/2). These are combinations of one step of the simple Euclidean algorithm, which uses subtraction at each step, and an application of step 3 above. The division by 2 results in an integer because the difference of two odd numbers is even.
- Repeat steps 3-4 until u = v, or (one more step) until u = 0. In either case, the result is 2kv, where k is the number of common factors of 2 found in step 2.
Since this definition is tail-recursive, a loop can be used to replace the recursion.
The algorithm requires O((log2 uv)2) worst-case time, or in other words time proportional to the square of the number of bits in u and v together. Although each step reduces at least one of the operands by at least a factor of 2, the subtract and shift operations do not take constant time for very large integers (although they're still quite fast in practice, requiring about one operation per word of the representation).
An extended version of binary GCD, analogous to the extended Euclidean algorithm, is given in (Knuth 1998, answer to exercise 39 of section 4.5.2, p. 646) along with pointers to other versions.
[edit] Implementation
[edit] Implementation in C
Following is an implementation of the algorithm in C, taking two non-negative arguments u and v. It first removes all common factors of 2 using identity 2, computes the gcd of the remaining numbers using identities 3 and 4, then combines these to form the final answer.
unsigned int gcd(unsigned int u, unsigned int v) { int shift; /* GCD(0,x) := x */ if (u == 0 || v == 0) return u | v; /* Let shift := lg K, where K is the greatest power of 2 dividing both u and v. */ for (shift = 0; ((u | v) & 1) == 0; ++shift) { u >>= 1; v >>= 1; } while ((u & 1) == 0) u >>= 1; /* From here on, u is always odd. */ do { while ((v & 1) == 0) /* Loop X */ v >>= 1; /* Now u and v are both odd, so diff(u, v) is even. Let u = min(u, v), v = diff(u, v)/2. */ if (u < v) { v -= u; } else { int diff = u - v; u = v; v = diff; } v >>= 1; } while (v != 0); return u << shift; }
[edit] Implementation in assembly
This version of binary GCD in ARM assembly (using GNU Assembler syntax) highlights the benefit of branch predication. The loop to implement step 2 consists of 6 instructions, four of which are predicated. Step 3 consists of two loops, each 3 instructions long (one of the instructions being predicated); however, after the first iteration r0 is kept odd and need not be tested, and only one of the loops is executed. Finally, step 4 takes four instructions of which two are predicated.
gcd: @ Arguments arrive in registers r0 and r1 mov r3, #0 @ Initialize r3, the power of 2 in the result remove_twos_loop: tst r0, #1 @ Check least significant bit of r0. If zero... tsteq r1, #1 @ ... check least significant bit of r1 too. If zero: addeq r3, r3, #1 @ Increment r3, marking one more common 2 factor; moveq r0, r0, lsr #1 @ Shift r0 right, removing the common 2 factor; moveq r1, r1, lsr #1 @ Shift r1 right, removing the common 2 factor; ... beq remove_twos_loop @ ... And Check if we can remove one more check_two_r0: tst r0, #1 @ Check least significant bit of r0 moveq r0, r0, lsr #1 @ If r0 is even, shift r0 right, dividing it by 2, and... beq check_two_r0 @ ... Check if we can remove one more check_two_r1: @ /* Loop X */ tst r1, #1 @ Check least significant bit of r1 moveq r1, r1, lsr #1 @ If r1 is even, shift it right by 1 and... beq check_two_r1 @ ... Check if we can remove one more subs r2, r0, r1 @ Set r2 = r0 - r1 and set flags rsblt r2, r2, #0 @ Set r2 = |r0 - r1| (actually negate r2 if r0 < r1) movgt r0, r1 @ Set r0 = min (r0, r1) (actually move r1 to r0 if r0 > r1) mov r1, r2, lsr #1 @ Set r1 = |r0 - r1| / 2 (useless on last iteration) bne check_two_r1 @ Uses flags from subs; was r0 = r1? If not, iterate @ r0 is one of the old r0 and r1, so it's odd: check only r1 mov r0, r0, asl r3 @ Put r0 << r3 in the result register bx lr @ Return to caller
[edit] Efficiency
Brigitte Vallée proved that binary GCD can be about 60% more efficient (in terms of the number of bit operations) on average than the Euclidean algorithm [1][2][3].
The real life computers operate on more than one bit, however. The speed of the binary GCD depends substantially on details of implementation, the CPU architecture, and the compiler optimizations. With implementations of binary GCD similar to the above codes, the gain in speed over the Euclidean algorithm always was less in practice than in the above theory. For example, Knuth reports a 15% gain (1998). According to a recent comparison, the best gain was about 60%, while on some popular architectures even good implementations of binary GCD were marginally slower than the Euclidean algorithm.[4]
One way to improve the efficiency of binary GCD is to use special CPU instructions that count leading or trailing binary zeros in a number, allowing all trailing zero bits to be removed in a single step instead of one at a time [5]. Of course, it is possible only on platforms were such instructions are available, and the assembly codes are not portable.
A family of portable implementations in C or C++ was proposed recently that takes several bits at a time with the aid of lookup tables in the loop marked "Loop X
".[6]. The implementations are parametrized by the number of bits taken in one step. With a 256-byte lookup table (8 bits), the implementation turned to be between 82.5% and 163% faster than the Euclidean algorithm, depending on CPU and compiler. Even with a small 16-byte lookup table (4 bits), the gains are in the range of 54% to 116%.
Although this algorithm outperforms the traditional Euclidean algorithm, its asymptotic performance in terms of O(n) is the same, and it is considerably more complex.
[edit] See also
[edit] References
- Donald Knuth, The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition). Addison-Wesley.
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Problem 31-1, pg.902.
[edit] External links
- NIST Dictionary of Algorithms and Data Structures: binary GCD algorithm
- Notes on Programming by Alexander Stepanov
- Cut-the-Knot: Binary Euclid's Algorithm at cut-the-knot
- Analysis of the Binary Euclidean Algorithm (1976), a paper by Richard P. Brent, including a variant using left shifts