Fibdrv

Fibdrv is a Linux kernel module that calculates the Fibonacci sequence. It is an assignment of CSIE3018 - Linux Kernel Internals by Ching-Chun (Jim) Huang at National Cheng Kung University. In this assignment, I made fibdrv theoretically possible to calculate $F_{188795}$ and, using the relationship between the Fibonacci sequence and the golden ratio, made memory allocation deterministic. Additionally, I used Fast Doubling to reduce the amount of computations, and compared the computation times of the Karatsuba and Schönhage–Strassen algorithms.

Programs usually use previous generated values as the basis for the next calculation when computing random numbers. See Pseudo-Random Number Generators (PRNG). Fibonacci, after modifying the initial value or performing mod calculations, can produce numbers that are not easily predictable. This method of generating random numbers is called Lagged Fibonacci generator (LFG). See also: For The Love of Computing: The Lagged Fibonacci Generator — Where Nature Meet Random Numbers.

The expected goals are:

Writing programs for the Linux kernel level
Learning core APIs such as ktimer and copy_to_user
Reviewing C language numerical systems and bitwise operations
improving numerical analysis and computation strategies
Exploring Linux VFS
Implementing automatic testing mechanisms
Conducting performance analysis through perf

Homework instructions (in Chinese): https://hackmd.io/@sysprog/linux2022-fibdrv
GitHub repository: https://github.com/blueskyson/fibdrv

Replace `long long` with custome struct `BigN`

This is the initial function that calculate Fibonacci $F_{k}$ in fibdrv.c:

  
static long long fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    long long f[k + 2];

    f[0] = 0;
    f[1] = 1;

    for (int i = 2; i <= k; i++) {
        f[i] = f[i - 1] + f[i - 2];
    }

    return f[k];
}

fib_sequence uses long long to process the value of Fibonacci sequence, and it will overflow when reaching $F_{93}$ , so long long must be replaced by a data structure that can store larger values. First define struct BigN in client.c and fibdrv.c and give it an type u128:

  
#define BIGN
#ifdef BIGN
typedef struct BigN {
    unsigned long long upper, lower;
} u128;
#endif

Ensure that lower can store up to $10^{19}$ at most, so the value represented by u128 is $upper * 10^{19} + lower$. This way, the rules for carrying from lower to upper will be similar to carrying in base $10^{19}$, and the range of representable values will be from $0$ to $1.8 * 10^{38}$, following this pattern:

decimal	upper	lower
0	0	0
1	0	1
2	0	2
$10^{19} - 1$	0	99…99
$10^{19}$	1	0
$10^{19} + 1$	1	1
$2 * 10^{19} - 1$	1	99…99
$2 * 10^{19}$	2	0

This allows us to easily use printf to print the string as follows:

  
printf("%llu%019llu", fib->upper, fib->lower);

The fib_sequence rewritten using u128 is as follows:

  
#ifdef BIGN
#define P10_UINT64 10000000000000000000ULL
static u128 fib_sequence_u128(long long k)
{
    if (k <= 1LL) {
        u128 ret = {.upper = 0, .lower = (unsigned long long) k};
        return ret;
    }

    u128 a, b, c;
    a.upper = 0;
    a.lower = 0;
    b.upper = 0;
    b.lower = 1;

    for (int i = 2; i <= k; i++) {
        c.lower = a.lower + b.lower;
        c.upper = a.upper + b.upper;
        // if the lower overflows or the lower >= 10^9, then add a carry to upper
        if ((c.lower < a.lower) || (c.lower >= P10_UINT64)) {
            c.lower -= P10_UINT64;
            c.upper += 1;
        }
        a = b;
        b = c;
    }

    return c;
}
#endif

Next, I will modify the interface (fib_read) that calls this kernel driver. The original fib_read function directly returns the value of $F_k$ to user space, which means it tells client.c the value of $F_k$ through the return value of read(fd, buf, 1), with the type being ssize_t. Referring to Integer-Types, I found that ssize_t on my machine is 64 bits. Obviously, passing values of $F_{93}$ or higher using ssize_t will cause an overflow. Therefore, I should use the copy_to_user API to copy the data of $F_k$ into the buf of read(fd, buf, 1), and perform base conversion in client.c. The modified fib_read is as follows:

  
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
#ifdef BIGN
    u128 fib = fib_sequence_u128(*offset);
    copy_to_user(buf, &fib, sizeof(fib));
    return sizeof(fib);
#else
    return (ssize_t) fib_sequence(*offset);
#endif
}

Next, rewrite client.c to read value from fibdrv:

  
#define FIB_DEV "/dev/fibonacci"
#define BIGN

#ifdef BIGN
typedef struct BigN {
    unsigned long long upper, lower;
} u128;

void print_fib_u128(int i, u128 *fib)
{
    if (fib->upper) {
        printf("Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%llu%019llu.\n",
               i, fib->upper, fib->lower);
    } else {
        printf("Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%llu.\n",
               i, fib->lower);
    }
}
#endif

int main()
{
    long long sz;

    char buf[sizeof(u128)];
    char write_buf[] = "testing writing";
    int offset = 100; /* TODO: try test something bigger than the limit */

    int fd = open(FIB_DEV, O_RDWR);
    if (fd < 0) {
        perror("Failed to open character device");
        exit(1);
    }

    for (int i = 0; i <= offset; i++) {
        sz = write(fd, write_buf, strlen(write_buf));
        printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz);
    }

    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        sz = read(fd, buf, sizeof(u128));
#ifdef BIGN
        print_fib_u128(i, (u128 *) buf);
#else
        printf("Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%lld.\n",
               i, sz);
#endif
    }

    for (int i = offset; i >= 0; i--) {
        lseek(fd, i, SEEK_SET);
        sz = read(fd, buf, sizeof(u128));
#ifdef BIGN
        print_fib_u128(i, (u128 *) buf);
#else
        printf("Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%lld.\n",
               i, sz);
#endif
    }

    close(fd);
    return 0;
}

This BigN implementation can compute up to

\[F_{184}=\underbrace{12712787974383433414}_{upper}\underbrace{6972278486287885163}_{lower}\]

The current implementation is commit 53f5ddf.

Extending `struct BigN` to Compute $F_{1000}$

Rewrite struct BigN and name it ubig, expanding upper and lower to cell[size] as follows:

  
#define BIGN
#ifdef BIGN

#define BIGNSIZE 12

typedef struct BigN {
    unsigned long long cell[BIGNSIZE];
} ubig;

#define init_ubig(x) for (int i = 0; i < BIGNSIZE; x.cell[i++] = 0)

#endif

The macro BIGNSIZE is used to control the length of the cell array. When BIGNSIZE is set to 12, ubig becomes a cell with 768 bits to store Fibonacci sequence values, capable of representing numbers from $0$ to $1.8 * 10^{19 * 12}$. cell[0] stores the least significant bits, and cell[11] stores the most significant bits. Next, rewrite fib_sequence:

  
#ifdef BIGN
#define P10_UINT64 10000000000000000000ULL
static ubig fib_sequence_ubig(long long k)
{
    if (k <= 1LL) {
        ubig ret;
        init_ubig(ret);
        ret.cell[0] = (unsigned long long) k;
        return ret;
    }

    ubig a, b, c;
    init_ubig(a);
    init_ubig(b);
    b.cell[0] = 1;

    for (int i = 2; i <= k; i++) {
        for (int j = 0; j < BIGNSIZE; j++)
            c.cell[j] = a.cell[j] + b.cell[j];

        for (int j = 0; j < BIGNSIZE - 1; j++) {
            if ((c.cell[j] < a.cell[j]) || (c.cell[j] >= P10_UINT64)) {
                c.cell[j] -= P10_UINT64;
                c.cell[j + 1] += 1;
            }
        }

        a = b;
        b = c;
    }

    return c;
}
#endif

The current implementation is commit d0dd8a4.

Improving `struct BigN`

Currently, each unsigned long long in struct BigN will carry over when it reaches $10^{19}$, but when implementing the Fast Doubling algorithm, directly performing left shifts will lead to errors because upper and lower are not separated by $2^{64}$. Therefore, I simply let the cell array use unsigned binary representation, which is $upper * 2^{64} + lower$. The binary to Base-10 conversion will be done when printing to the terminal.

decimal	upper	lower
0	0	0
1	0	1
2	0	2
$2^{64}$ - 1	0	0xff…ff
$2^{64}$	1	0
$2^{64} + 1$	1	1
$2^{65} - 1$	1	0xff…ff
$2^{65}$	2	0

Here is the implementation of improved addition (ubig_add) after the enhancement:

  
#define BIGNSIZE 12
typedef struct BigN {
    unsigned long long cell[BIGNSIZE];
} ubig;

static inline void init_ubig(ubig *x)
{
    for (int i = 0; i < BIGNSIZE; x->cell[i++] = 0)
        ;
}

static inline void ubig_add(ubig *dest, ubig *a, ubig *b) {
    for (int i = 0; i < BIGNSIZE; i++)
        dest->cell[i] = a->cell[i] + b->cell[i];

    for (int i = 0; i < BIGNSIZE - 1; i++)
        dest->cell[i + 1] += (dest->cell[i] < a->cell[i]);
}

Next, let’s apply the ubig_add implementation to modify fibdrv:

  
static ubig fib_sequence_ubig(long long k)
{
    if (k <= 1LL) {
        ubig ret;
        init_ubig(&ret);
        ret.cell[0] = (unsigned long long) k;
        return ret;
    }

    ubig a, b, c;
    init_ubig(&a);
    init_ubig(&b);
    b.cell[0] = 1ULL;

    for (int i = 2; i <= k; i++) {
        ubig_add(&c, &a, &b);
        a = b;
        b = c;
    }

    return c;
}

#define BUFFSIZE 500
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
#ifdef BIGN
    char fibbuf[BUFFSIZE];
    ubig fib = fib_sequence_ubig(*offset);
    int __offset = ubig_to_string(fibbuf, BUFFSIZE, &fib);
    copy_to_user(buf, fibbuf + __offset, BUFFSIZE - __offset);
    return (ssize_t) BUFFSIZE - __offset;
#else
    return (ssize_t) fib_sequence(*offset);
#endif
}

The current implementation is commit b2749a6.

Computing Fibonacci Sequence using Fast Doubling

First, implement subtraction (ubig_sub), left shift (ubig_lshift), and long multiplication (ubig_mul) for struct BigN:

  
static inline void ubig_sub(ubig *dest, ubig *a, ubig *b)
{
    for (int i = 0; i < BIGNSIZE; i++)
        dest->cell[i] = a->cell[i] - b->cell[i];

    for (int i = 0; i < BIGNSIZE - 1; i++)
        dest->cell[i + 1] -= (dest->cell[i] > a->cell[i]);
}

static inline void ubig_lshift(ubig *dest, ubig *a, int x)
{
    init_ubig(dest);

    // quotient and remainder of x being divided by 64
    unsigned quotient = x >> 6, remainder = x & 0x3f;
    for (int i = 0; i + quotient < BIGNSIZE; i++)
        dest->cell[i + quotient] |= a->cell[i] << remainder;

    if (remainder)
        for (int i = 1; i + quotient < BIGNSIZE; i++)
            dest->cell[i + quotient] |= a->cell[i - 1] >> (64 - remainder);
}

void ubig_mul(ubig *dest, ubig *a, ubig *b)
{
    init_ubig(dest);
    int index = BIGNSIZE - 1;
    while (index >= 0 && !b->cell[index])
        index--;
    if (index == -1)
        return;

    for (int i = index; i >= 0; i--) {
        int bit_index = (i << 6) + 63;
        for (unsigned long long mask = 0x8000000000000000ULL; mask;
             mask >>= 1) {
            if (b->cell[i] & mask) {
                ubig shift, tmp;
                ubig_lshift(&shift, a, bit_index);
                ubig_add(&tmp, dest, &shift);
                *dest = tmp;
            }
            bit_index--;
        }
    }
}

Refer to the Fast Doubling algorithm:

\[\begin{split} \begin{bmatrix} F(2n+1) \\ F(2n) \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{2n} \begin{bmatrix} F(1) \\ F(0) \end{bmatrix}\\ \\ &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} F(1) \\ F(0) \end{bmatrix}\\ \\ &= \begin{bmatrix} F(n+1) & F(n) \\ F(n) & F(n-1) \end{bmatrix} \begin{bmatrix} F(n+1) & F(n) \\ F(n) & F(n-1) \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix}\\ \\ &= \begin{bmatrix} F(n+1)^2 + F(n)^2\\ F(n)F(n+1) + F(n-1)F(n) \end{bmatrix} \end{split}\]

Now we have:

\[\begin{split} F(2k) &= F(k)[2F(k+1) - F(k)] \\ F(2k+1) &= F(k+1)^2+F(k)^2 \end{split}\]

Pseudo code:

  
Fast_Fib(n)
    a = 0; b = 1;       // m = 0
    for i = (number of binary digit in n) to 1
        t1 = a*(2*b - a);
        t2 = b^2 + a^2;
        a = t1; b = t2; // m *= 2
        if (current binary digit == 1)
            t1 = a + b; // m++
            a = b; b = t1;
    return a;

For example: Computing $F(10)$:

i	start	4	3	2	1	result
n	-	1010	1010	1010	1010	-
F(m)	F(0)	F(0*2+1)	F(1*2)	F(2*2+1)	F(5*2)	F(10)
a	0	1	1	5	55	55
b	1	1	2	8	89	-

Implementing fibdrv using Fast Doubling:

  
static ubig fib_sequence_ubig(long long k)
{
    if (k <= 1LL) {
        ubig ret;
        init_ubig(&ret);
        ret.cell[0] = (unsigned long long) k;
        return ret;
    }

    ubig a, b;
    init_ubig(&a);
    init_ubig(&b);
    b.cell[0] = 1ULL;

    for (unsigned long long mask = 0x8000000000000000ULL >> __builtin_clzll(k);
         mask; mask >>= 1) {
        ubig tmp1, tmp2, t1, t2;
        ubig_lshift(&tmp1, &b, 1);   // tmp1 = 2*b
        ubig_sub(&tmp2, &tmp1, &a);  // tmp2 = 2*b - a
        ubig_mul(&t1, &a, &tmp2);    // t1 = a*(2*b - a)

        ubig_mul(&tmp1, &a, &a);      // tmp1 = a^2
        ubig_mul(&tmp2, &b, &b);      // tmp2 = b^2
        ubig_add(&t2, &tmp1, &tmp2);  // t2 = a^2 + b^2

        a = t1, b = t2;
        if (k & mask) {
            ubig_add(&t1, &a, &b);  // t1 = a + b
            a = b, b = t1;
        }
    }

    return a;
}

The current implementation is commit 473ff4a.

Performance Analysis of Addition and Fast Doubling

Isolating Specific CPUs

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
...

First, modify the GRUB configuration file to force isolate a CPU:

$ sudo vim /etc/default/grub

Edit GRUB_CMDLINE_LINUX="":

GRUB_CMDLINE_LINUX="isolcpus=0"

Update GRUB:

$ sudo update-grub

After rebooting, use taskset to check if the CPU has been isolated:

$ taskset -p 1
pid 1's current affinity mask: 7ff

Run the client program pinned to the isolated CPU:

$ sudo insmod fibdrv.ko
$ sudo taskset -c 11 ./client

Eliminating Other Interfering Factors

Disable address space layout randomization (ASLR):

  
$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

Set the scaling_governor to “performance”:

  
# performance.sh
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

Disable turbo mode of Intel processors:

  
$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

Calculating Execution Time of fibdrv in Kernel Space

Use ktime_t and ktime_get() to handle the computation time of the Fibonacci sequence. Finally, pass the elapsed time to user space through the return value of fib_read. The specific implementation is as follows:

  
#include <linux/ktime.h>

// ...

static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
#ifdef BIGN
    ktime_t t = ktime_get();
    char fibbuf[BUFFSIZE];
    ubig fib = fib_sequence_ubig(*offset);
    int __offset = ubig_to_string(fibbuf, BUFFSIZE, &fib);
    copy_to_user(buf, fibbuf + __offset, BUFFSIZE - __offset);
    s64 elapsed = ktime_to_ns(ktime_sub(ktime_get(), t));
    return elapsed;
#else
    return (ssize_t) fib_sequence(*offset);
#endif
}

Then write a perf-test.c to print the execution time of user space, kernel space, and kernel to user. This program will repeatedly calculate $F_0$ to $F_{1000}$ 100 times and compute the 10% trimmed mean.

Next, Pipe the test results to fast_doubling.txt:

  
$ gcc perf-test.c -o a
$ sudo insmod fibdrv.ko
$ sudo taskset -c 11 ./a > fast_doubling.txt
$ sudo rmmod fibdrv.ko || true >/dev/null

The output results will be like:

232 543 311
51551 51835 284
52655 52997 342
52379 52663 284
52758 53041 283
...

Compare the user space time of Fast Doubling and Adding:

It can be observed that Fast Doubling outperforms Adding in terms of performance.

The current implementation is commit 48956a1.

One-on-One Review with Professor on 2022/04/02

Estimate the significant digits before calculating decimal values.
Abstraction: Abstract the implementation to reduce redundant code modifications and increase readability.
Determinism: Prepare memory allocation before starting the calculation. If memory allocation fails, return a failure message immediately to avoid wasting calculation time.
Reduce the data transfer between kernel and user space.

When dynamically allocating memory, consider the following issues:

Continuous unsigned long long may fail when dynamically allocated.
Multiplication operation: m digits * m digits => results in 2m digits => memory allocation must be done in advance (cannot be done in-place).
Binary to decimal conversion also requires dynamic memory allocation.
During the kernel to user data transfer, it’s possible to encounter insufficient memory allocated by the client (user) beforehand.

Abstracting the Interface and Implementation

I have abstracted the functions for big number addition, multiplication, initialization, and others as follows:

  
// ubig_xxx.h
ubig *new_ubig(int size);
void destroy_ubig(ubig *ptr);
void zero_ubig(ubig *x);

void ubig_assign(ubig *dest, ubig *src);
void ubig_add(ubig *dest, ubig *a, ubig *b);
void ubig_sub(ubig *dest, ubig *a, ubig *b);
void ubig_lshift(ubig *dest, ubig *a, int x);
void ubig_mul(ubig *dest, ubig *a, ubig *b, ubig *shift_buf, ubig *add_buf);
int ubig_to_string(char *buf, size_t bufsize, ubig *a);

ubig fib_sequence(long long k);

Different versions of implementations are written in ubig_1.h, ubig_2.h, and so on. In fibdrv.c, different implementations are switched by #include "ubig_xxx.h".

Estimating the Significant Digits of $F_n$

According to relation to the golden ratio, the ratio of consecutive Fibonacci numbers approaches the golden ratio, which is $F_n * 1.61803… \approx F_{n+1}$.

Moreover, $\dfrac{1}{\log_{10}\ 1.61804} \approx 4.78518$, which means that when the ratio of consecutive Fibonacci numbers approaches 1.61804, in decimal representation, every increase of four to five terms will result in an additional digit.

Let’s list some Fibonacci numbers and observe at which point $\dfrac{F_{k}}{F_{k-1}}$ becomes less than $1.61804$:

$k$	$F_k$	$F_k/F_{k-1}$
13	233	1.6180555556
14	377	1.6180257511
15	610	1.6180371353
16	987	1.6180327869
17	1597	1.6180344478

From the listing, we can see that when $k \geq 14$, $\dfrac{F_k}{F_{k-1}}$ converges to less than 1.61804. Therefore, when $k \geq 14$, we can estimate the decimal digits of $F_k$ using $\lfloor log_{10}\ 233+(k-13)log_{10}\ 1.61804 \rfloor + 1$. This equation can be simplified to $\lfloor k0.20899 + 0.6505 \rfloor$.

Assuming ubig uses $N$ unsigned long long numbers, computing $\left\lfloor \dfrac{\log_2 \ 2^{64N}}{\log_2 \ 10} \right\rfloor + 1$ gives the maximum number of digits ubig can represent. Simplifying the constants, this becomes $\lfloor N*19.26593 + 1\rfloor$.

We can dynamically adjust the value of $N$ based on the value of $k$ by ensuring $\lfloor N19.2661 + 1\rfloor \geq \lfloor k0.20899 + 0.6505 \rfloor$. Below is the function to estimate the required number of unsigned long long numbers (since kernel modules do not allow floating-point arithmetic, the pre-computed constants are multiplied by 100000 and then used for integer calculations):

  
static inline int estimate_size(long long k) {
    if (k <= 93)
        return 1;
    unsigned long long n = (k * 20899 - 34950) / 1926610;
    return (int)n + 1;
}

Next, rewrite the ubig to allow adjusting the length of the unsigned long long array as needed.

  
typedef struct BigN {
    int size;
    unsigned long long *cell;
} ubig;

Allocate memory for ubig using kmalloc and release memory using kfree:

  
ubig *new_ubig(int size)
{
    ubig *ptr = kmalloc(sizeof(ubig), GFP_KERNEL);
    if (!ptr)
        return NULL;

    unsigned long long *ullptr =
        kmalloc(size * sizeof(unsigned long long), GFP_KERNEL);
    if (!ullptr) {
        kfree(ptr);
        return NULL;
    }
    memset(ullptr, 0, size * sizeof(unsigned long long));

    ptr->size = size;
    ptr->cell = ullptr;
    return ptr;
}

static inline void destroy_ubig(ubig *ptr)
{
    if (ptr) {
        if (ptr->cell)
            kfree(ptr->cell);
        kfree(ptr);
    }
}

After dynamically allocating memory using kmalloc, theoretically, the maximum number of unsigned long long that can be computed is limited by the maximum length that kmalloc can allocate (128KB), which is equivalent to 2048 unsigned long long. The maximum number of decimal digits that can be represented is $\lfloor 2048*19.26593 + 1\rfloor=39457$.

For Fibonacci numbers with decimal digits less than or equal to 39457, the corresponding Fibonacci sequence index is given by $\lfloor k*0.20899 + 0.6505 \rfloor \leq 39457 \Rightarrow k \leq 188795$. Therefore, the current implementation can compute up to $F_{188795}$.

The implementation up to this point is in 0ee4f89 in ubig_2.h.

Introduce the Schönhage–Strassen Algorithm to Accelerate Multiplication

The concept of the Schönhage–Strassen Algorithm is to split the large numbers $x$ and $y$ into smaller numbers $x_0$, $x_1$, …, and $y_0$, $y_1$, … with $n$ digits each. Then, perform linear convolution on $(x_0, x_1, …)$ and $(y_0, y_1,…)$, and after completing the linear convolution, left-shift and add all elements according to their digits to achieve multiplication. For example:

Let $x = 26699$ and $y = 188$.

Splitting $x$ and $y$ into subsequences with 2 digits each yields the following sequences:

$(2, 66, 99),\ (1, 88)$

Perform linear convolution on these sequences:

         2    66    99
x              1    88
-----------------------
       176  5808  8712
   2    66    99
-----------------------
   2   242  5907  8712

The linear convolution of $(2, 66, 99)$ and $(1, 88)$ results in $(2, 242, 5907, 8712)$. Next, by left-shifting and adding the elements of the linear convolution, we can complete the multiplication of large numbers.

        8712
      5907
     242
+    2
--------------
     5019412

In the multiplication implementation of ubig, I plan to split the numbers at the boundary of 32 bits and calculate the linear convolution using 64-bit multiplication. To facilitate the above operations, I am changing the data type of ubig from unsigned long long to unsigned int for storing the values.

  
// find the array index of MSB of a
// TODO: use binary search
static inline int ubig_msb_idx(ubig *a)
{
    int msb_i = a->size - 1;
    while (msb_i >= 0 && !a->cell[msb_i])
        msb_i--;
    return msb_i;
}

void ubig_mul(ubig *dest, ubig *a, ubig *b)
{
    zero_ubig(dest);

    // find the array index of the MSB of a, b
    int msb_a = ubig_msb_idx(a);
    int msb_b = ubig_msb_idx(b);

    // a == 0 or b == 0 then dest = 0
    if (msb_a < 0 || msb_b < 0)
        return;

    // calculate the length of vector 
    // after doing linear convolution
    // i.e. column number
    int length = msb_a + msb_b + 1;

    /* do linear convolution */
    unsigned long long carry = 0;
    for (int i = 0; i < length; i++) {
        unsigned long long col_sum = carry;
        carry = 0;

        int end = (i <= msb_b) ? i : msb_b;
        int start = i - end;
        for (int j = start, k = end; j <= end; j++, k--) {
            unsigned long long product =
                (unsigned long long) a->cell[k] * b->cell[j];
            col_sum += product;
            carry += (col_sum < product);
        }
        dest->cell[i] = (unsigned int) col_sum;
        carry = (carry << 32) + (col_sum >> 32);
    }

    dest->cell[length] = carry;
}

The complete implementation is in ubig_schonhange_strassen.h on commit b21a89a.

Introduce the Karatsuba Algorithm to Accelerate Multiplication

The concept of Karatsuba is to split $x$ and $y$ at the $n$th digit boundary into halves $x_1$, $x_0$, $y_1$, $y_0$, and consider them as smaller numbers to multiply. Then, compensate for the lost digits of $x_1$ and $y_1$ by left-shifting. Using decimal digits as an example:

\[\text{Let} \ x = x_1 * 10^n + x_0 \ \text{and} \ y = y_1 * 10^n + y_0\]

Then, $x*y$ can be expressed as:

\[\underbrace{x_1y_1}_{z_2}*10^{2n}+\underbrace{(x_1y_0+y_1x_0)}_{z_1}*10^n+\underbrace{x_0y_0}_{z_0}\]

The above algorithm for calculating $z_2$, $z_1$, and $z_0$ requires four multiplications. However, we can optimize it to three multiplications using the following technique:

By expanding $(x_1+x_0)(y_1+y_0)$, we get:

\[(x_1+x_0)(y_1+y_0)=\underbrace{x_1y_1}_{z_2}+\underbrace{x_1y_0+x_0y_1}_{z_1}+\underbrace{x_0y_0}_{z_0}\]

After rearranging, we can calculate $z_1$ using $(x_1+x_0)(y_1+y_0)$, $z_0$, and $z_2$:

$z_2=x_1y_1$

$z_0 = x_0y_0$

$z_1=(x_1+x_0)(y_1+y_0)-z_0-z_2$

Finally, calculating $z_210^{2n}+z_110^n+z_0$ gives the result of $x$ multiplied by $y$.

The above is the Karatsuba algorithm.

Next, I will demonstrate Karatsuba using two 8-bit numbers to yield a 16-bit product (assuming the processor only supports 8-bit multiplication):

\[x = 01001001_2 = 73_{10}, \ y = 10000011_2 = 131_{10}\] \[x=x_110^4+x_0=010010^4+1001\] \[y=y_110^4+y_0=100010^4+0011\] \[z_2=x_1y_1=0100*1000=00100000\] \[z_0=x_0y_0=1001*0011=00011011\] \[z_1=(x_1+x_0)(y_1+y_0)-z_0-z_2\] \[\begin{align*} z_1 &=(x_1+x_0)(y_1+y_0)-z_0-z_2 \\ &=(0100+1001)(1000+0011)-00100000-00011011 \\ &=10001111-00100000-00011011 \\ &=01010100 \end{align*}\] \[\begin{align*} x*y &=z_2*10^8+z_1*10^4+z_0 \\ &=0010010101011011 \\ &=9563_{10} \end{align*}\]

Where $*10^n$ can be replaced by left-shift operations.

Continuing from the previous example, when $x$ and $y$ exceed 8 bits, Karatsuba can be implemented using a divide-and-conquer approach. If the number of bits in $x_1$, $x_0$, $y_1$, and $y_0$ exceeds the processor’s multiplication capacity, they can be further divided into $x_{11}$, $x_{10}$, $x_{01}$, $x_{00}$, etc., and Karatsuba can be applied again. Here’s a demonstration using two 16-bit numbers to yield a 32-bit product (assuming the processor only supports 8-bit multiplication):

The implementation of Karatsuba can be found in ubig_karatsuba.h at commit 9673a70.

Comparing Karatsuba and Schönhage–Strassen Algorithms

The following chart displays the execution time for calculating $F_0$, $F_{50}$, $F_{100}$, …, $F_{50000}$ (excluding base conversion). The former takes around 45 milliseconds to calculate $F_{50000}$, while the latter takes approximately 1.8 milliseconds.

Next, let’s look at the time consumption of each algorithm when calculating smaller numbers:

From the chart, it can be observed that when calculating from $F_0$ to $F_{45}$, Adding is the fastest, while Karatsuba and Schönhage–Strassen are approximately equally fast from $F_{45}$ to $F_{95}$, and Schönhage–Strassen is the fastest for calculations beyond that.

Regarding the time complexity of the three algorithms for calculating $F_N$:

Adding: It performs $N$ additions, with each addition requiring $log_{2^{32}}F_N$ iterations, resulting in a complexity of $O(N*log_{2^{32}}F_N)$.
Long Multiplication (Fast Doubling): It requires $log_2N$ iterations, with each iteration’s long multiplication taking $O(log_2F_Nlog_{2^{32}}F_N)$, resulting in a complexity of $O(log_2Nlog_2F_N*log_{2^{32}}F_N)$.
Schönhage–Strassen Fast Doubling: It also requires $log_2N$ iterations, with each iteration’s multiplication taking $O(1+2+…+log_{2^{32}}F_N+log_{2^{32}}F_N-1+…+2+1)=O(log_{2^{32}}F_Nlog_{2^{32}}F_N)$, resulting in a complexity of $O(log_2Nlog_{2^{32}}F_N*log_{2^{32}}F_N)$.

Fibdrv

Replace long long with custome struct BigN

Extending struct BigN to Compute $F_{1000}$

Improving struct BigN