Big-Integers: Why They Exist and How They Work
A small introduction to big integers
In this first post — and many more to come — I’ll be delving into the world of Big-Integers also known as Multi-Precision Integers or Arbitrary Precision Integers. They are available in many popular programming languages, as built-in data types or through the standard library, such as Julia, JavaScript, Haskell, Ruby, Golang and Python, to name a few.
In this first article, I’ll be going through some of their prominent use cases and how to represent them using C code.
When 64 bits are just not enough
Modern 64-bit CPUs have a limit on how large of a number can be represented using a built-in primitive integer type. On a typical 64-bit processor implementing a 2’s complement machine, the largest unsigned integer that can be stored in a general-purpose register is the following:
Attempting to store a value larger than this results in unsigned integer overflow, causing the stored value to wrap around to zero, as per the properties of modular arithmetic.
CPUs from ISAs such as x86-64 and AArch64 implement modular (wrapping) arithmetic by default for integer operations across both their general-purpose and SIMD registers.
But why would anyone need a number larger than (approximately) eighteen and a half quintillion? Turns out there are quite a few use cases.
Modern Cryptography is the most common. Algorithms such as RSA and ECC (elliptic-curve cryptography) operate on numbers that are hundreds or thousands of bits wide. Primes and moduli in RSA, for instance, are typically 2048 to 4096 bits long, since the security of these systems relies on the computational intractability of problems like integer factorization or discrete logarithm problem, whose difficulty scales with the size of the numbers involved. The larger the numbers, the more infeasible it becomes for any attacker to break the scheme by brute force or even by computationally optimized algorithms in a practical amount of time.
In the field of Number Theory, encountering numbers several magnitudes larger than the limit of 64-bit unsigned integers is not too uncommon either. Tools such as Mathematica, Maple, and SageMath rely heavily on big-integer arithmetic as a foundation for symbolic computation, and such tools can be instrumental in investigating conjectures in number theory, like the one below:
For those not too keen on mathematical lingo, it says that there exist no integers x, y and z such that the equation above holds true.
However, in 1988, Noam Elkies found a solution to the equation, thus disproving the conjecture. Here’s some python code you can execute to test it out yourself:
x = 355507307842882624593086325021133856149447336710120844428552934573043094018915289363
y = -354602746692986709129018423204648314355484458881941451025238387384142099383045862152
z = 474859390028874317081970642668844016410429280362746808383285167570721710561593528843
lhs = (x ** 3 + y ** 3) * 313
rhs = z ** 3
print(lhs == rhs)If one were to go a step above, you could try to find a solution purely in positive integers. That solution however is massive, with more than 21000 decimal digits in all three of x, y and z. I can only present a scientific notation approximation of the actual solution below:
Thus, such Big-Integers are crucial to computational number theory and cryptography. Beyond that hobbyist massively distributed computing projects such as GIMPS (Great Internet Mersenne Prime Search) and PrimeGrid, who look for extremely large primes rely on Big-Integers. There is even a cash prize of USD 250,000 that will be awarded to any group (or person) who can find a prime with at least a billion decimal digits!
Representing Big Integers in Code
Now that I’ve covered the trivial stuff, it’s time to get to the core question:
How do we represent integers that don’t fit in a machine register—and how do we implement arithmetic on them?
Choosing a base
The first design choice is the numeric base we’ll store the number in.
Humans default to base-10, but a CPU fundamentally works in base-2. So, for performance, we want a base that matches the machine’s “natural word size”, because that lets addition, subtraction, and multiplication use the CPU’s integer instructions directly.
On a 64-bit machine, the most convenient choice is the largest general-purpose width integer, where each “digit” of a big-integer is a 64-bit word in itself. In C code, it can be written in the following way1:
#include <stdint.h>
#include <stddef.h>
typedef uint64_t u64; // arbitrary precision digit type
typedef size_t usize; // arbitrary precision size type
#define MAX_U64 UINT64_MAX // (2 ^ 64) - 1
// example number with 32 digits
#define NUM_SIZE (usize)32
u64 my_num[NUM_SIZE];Choosing an endianness
Once we have this much, next is to choose how to actually interpret the digit layout in memory. This is the choice of “endianness”.
There are two possible layouts:
Little-Endian (The least significant digit first)
Big-Endian (The most significant digit first)
I am choosing to go with Little-Endian. Even though it seems counter-intuitive for humans as we write the most significant digit first, I prefer this layout for the following reason:
In all arithmetic operations, we need to propagate carries and borrows from least significant digit up to the most significant digit
In this way, it becomes much easier for me to reason about the code.
What lies next
In the next post, we will get to writing one of the most basic operations on big integers — Addition.
Changed the new typedef size_t from u64 to usize on suggestion from a reader to increase clarity in code.
