Skip to main content

Foreign Field Multiplication

This document is an original RFC explaining how we constrain foreign field multiplication (i.e. non-native field multiplication) in the Kimchi proof system. For a more recent RFC, see FFmul RFC.

Changelog

Author(s)DateDetails
Joseph Spadavecchia and Anais QuerolOctober 2022Design of foreign field multiplication in Kimchi

Overview

This gate constrains

ab=cmodfa \cdot b = c \mod f

where a,b,cFfa, b, c \in \mathbb{F_f}, a foreign field with modulus ff, using the native field Fn\mathbb{F_n} with prime modulus nn.

Approach

In foreign field multiplication the foreign field modulus ff could be bigger or smaller than the native field modulus nn. When the foreign field modulus is bigger, then we need to emulate foreign field multiplication by splitting the foreign field elements up into limbs that fit into the native field element size. When the foreign modulus is smaller everything can be constrained either in the native field or split up into limbs.

Since our projected use cases are when the foreign field modulus is bigger (more on this below) we optimize our design around this scenario. For this case, not only must we split foreign field elements up into limbs that fit within the native field, but we must also operate in a space bigger than the foreign field. This is because we check the multiplication of two foreign field elements by constraining its quotient and remainder. That is, renaming cc to rr, we must constrain

ab=qf+r,a \cdot b = q \cdot f + r,

where the maximum size of qq and rr is f1f - 1 and so we have

ab(f1)qf+(f1)rf21.\begin{aligned} a \cdot b &\le \underbrace{(f - 1)}_q \cdot f + \underbrace{(f - 1)}_r \\ &\le f^2 - 1. \end{aligned}

Thus, the maximum size of the multiplication is quadratic in the size of foreign field.

Naïve approach

Naïvely, this implies that we have elements of size f21f^2 - 1 that must split them up into limbs of size at most n1n - 1. For example, if the foreign field modulus is 256256 and the native field modulus is 255255 bits, then we'd need log2((2256)21)512\log_2((2^{256})^2 - 1) \approx 512 bits and, thus, require 512/2553512/255 \approx 3 native limbs. However, each limb cannot consume all 255255 bits of the native field element because we need space to perform arithmetic on the limbs themselves while constraining the foreign field multiplication. Therefore, we need to choose a limb size that leaves space for performing these computations.

Later in this document (see the section entitled "Choosing the limb configuration") we determine the optimal number of limbs that reduces the number of rows and gates required to constrain foreign field multiplication. This results in =88\ell = 88 bits as our optimal limb size. In the section about intermediate products we place some upperbounds on the number of bits required when constraining foreign field multiplication with limbs of size \ell thereby proving that the computations can fit within the native field size.

Observe that by combining the naïve approach above with a limb size of 8888 bits, we would require 512/886512/88 \approx 6 limbs for representing foreign field elements. Each limb is stored in a witness cell (a native field element). However, since each limb is necessarily smaller than the native field element size, it must be copied to the range-check gate to constrain its value. Since Kimchi only supports 7 copyable witness cells per row, this means that only one foreign field element can be stored per row. This means a single foreign field multiplication would consume at least 4 rows (just for the operands, quotient and remainder). This is not ideal because we want to limit the number of rows for improved performance.

Chinese remainder theorem

Fortunately, we can do much better than this, however, by leveraging the chinese remainder theorem (CRT for short) as we will now show. The idea is to reduce the number of bits required by constraining our multiplication modulo two coprime moduli: 2t2^t and nn. By constraining the multiplication with both moduli the CRT guarantees that the constraints hold with the bigger modulus 2tn2^t \cdot n.

The advantage of this approach is that constraining with the native modulus nn is very fast, allowing us to speed up our non-native computations. This practically reduces the costs to constraining with limbs over 2t2^t, where tt is something much smaller than 512512.

For this to work, we must select a value for tt that is big enough. That is, we select tt such that 2tn>f212^t \cdot n > f^2 - 1. As described later on in this document, by doing so we reduce the number of bits required for our use cases to 264264. With 8888 bit limbs this means that each foreign field element only consumes 33 witness elements, and that means the foreign field multiplication gate now only consumes 22 rows. The section entitled "Choosing tt" describes this in more detail.

Overall approach

Bringing it all together, our approach is to verify that

ab=qf+r\begin{aligned} a \cdot b = q \cdot f + r \end{aligned}

over Z+\mathbb{Z^+}. In order to do this efficiently we use the CRT, which means that the equation holds mod M=2tnM = 2^t \cdot n. For the equation to hold over the integers we must also check that each side of the equation is less than 2tn2^t \cdot n.

The overall steps are therefore

  1. Apply CRT to equation (1)
    • Check validity with binary modulus mod2t\mod 2^t
    • Check validity with native modulus modn\mod n
  2. Check each side of equation (1) is less than MM
    • ab<2tna \cdot b < 2^t \cdot n
    • qf+r<2tnq \cdot f + r < 2^t \cdot n

This then implies that

ab=cmodf.a \cdot b = c \mod f.

where c=rc = r. That is, aa multiplied with bb is equal to cc where a,b,cFfa,b,c \in \mathbb{F_f}. There is a lot more to each of these steps. That is the subject of the rest of this document.

Other strategies

Within our overall approach, aside from the CRT, we also use a number of other strategies to reduce the number and degree of constraints.

  • Avoiding borrows and carries
  • Intermediate product elimination
  • Combining multiplications

The rest of this document describes each part in detail.

Parameter selection

This section describes important parameters that we require and how they are computed.

  • Native field modulus nn
  • Foreign field modulus ff
  • Binary modulus 2t2^t
  • CRT modulus 2tn2^t \cdot n
  • Limb size in bits \ell

Choosing tt

Under the hood, we constrain ab=qf+rmod2tna \cdot b = q \cdot f + r \mod 2^t \cdot n. Since this is a multiplication in the foreign field ff, the maximum size of qq and rr are f1f - 1, so this means

ab(f1)f+(f1)f21.\begin{aligned} a \cdot b &\le (f - 1) \cdot f + (f - 1) \\ &\le f^2 - 1. \end{aligned}

Therefore, we need the modulus 2tn2^t \cdot n such that

2tn>f21,2^t \cdot n > f^2 - 1,

which is the same as saying, given ff and nn, we must select tt such that

2tnf2t2log2(f)log2(n).\begin{aligned} 2^t \cdot n &\ge f^2 \\ t &\ge 2\log_2(f) - \log_2(n). \end{aligned}

Thus, we have an lower bound on tt.

Instead of dynamically selecting tt for every nn and ff combination, we fix a tt that will work for the different selections of nn and ff relevant to our use cases.

To guide us, from above we know that

tmin=2log2(f)log2(n)t_{min} = 2\log_2(f) - \log_2(n)

and we know the field moduli for our immediate use cases.

vesta     = 2^254 + 45560315531506369815346746415080538113 (255 bits)
pallas = 2^254 + 45560315531419706090280762371685220353 (255 bits)
secp256k1 = 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1 (256 bits)

So we can create a table

nnfftmint_{min}
vestasecp256k1258
pallassecp256k1258
vestapallas255
pallasvesta255

and know that to cover our use cases we need t258t \ge 258.

Next, given our native modulus nn and tt, we can compute the maximum foreign field modulus supported. Actually, we compute the maximum supported bit length of the foreign field modulus f=2mf=2^m.

2tnf2(2m)2=22mt+log2(n)>log2(22m)=2m\begin{aligned} 2^t \cdot n &\ge f^2 \\ &\ge (2^m)^2 = 2^{2m} \\ t + \log_2(n) &> \log_2(2^{2m}) = 2m \end{aligned}

So we get

mmax=t+log2(n)2.m_{max} = \frac{t + \log_2(n)}{2}.

With t=258,n=255t=258, n=255 we have

mmax=258+2552=256.5,\begin{aligned} m_{max} &= \frac{258 + 255}{2} = 256.5, \end{aligned}

which is not enough space to handle anything larger than 256 bit moduli. Instead, we will use t=264t=264, giving mmax=259m_{max} = 259 bits.

The above formula is useful for checking the maximum number of bits supported of the foreign field modulus, but it is not useful for computing the maximum foreign field modulus itself (because 2mmax2^{m_{max}} is too coarse). For these checks, we can compute our maximum foreign field modulus more precisely with

maxmod=2tn.max_{mod} = \lfloor \sqrt{2^t \cdot n} \rfloor.

The max prime foreign field modulus satisfying the above inequality for both Pallas and Vesta is 926336713898529563388567880069503262826888842373627227613104999999999999999607.

Choosing the limb configuration

Choosing the right limb size and the right number of limbs is a careful balance between the number of constraints (i.e. the polynomial degree) and the witness length (i.e. the number of rows). Because one limiting factor that we have in Kimchi is the 12-bit maximum for range check lookups, we could be tempted to introduce 12-bit limbs. However, this would mean having many more limbs, which would consume more witness elements and require significantly more rows. It would also increase the polynomial degree by increasing the number of constraints required for the intermediate products (more on this later).

We need to find a balance between the number of limbs and the size of the limbs. The limb configuration is dependent on the value of tt and our maximum foreign modulus (as described in the previous section). The larger the maximum foreign modulus, the more witness rows we will require to constrain the computation. In particular, each limb needs to be constrained by the range check gate and, thus, must be in a copyable (i.e. permuteable) witness cell. We have at most 7 copyable cells per row and gates can operate on at most 2 rows, meaning that we have an upperbound of at most 14 limbs per gate (or 7 limbs per row).

As stated above, we want the foreign field modulus to fit in as few rows as possible and we need to constrain operands a,ba, b, the quotient qq and remainder rr. Each of these will require cells for each limb. Thus, the number of cells required for these is

cells=4limbscells = 4 \cdot limbs

It is highly advantageous for performance to constrain foreign field multiplication with the minimal number of gates. This not only helps limit the number of rows, but also to keep the gate selector polynomial small. Because of all of this, we aim to constrain foreign field multiplication with a single gate (spanning at most 22 rows). As mentioned above, we have a maximum of 14 permuteable cells per gate, so we can compute the maximum number of limbs that fit within a single gate like this.

limbsmax=cells/4=14/4=3\begin{aligned} limbs_{max} &= \lfloor cells/4 \rfloor \\ &= \lfloor 14/4 \rfloor \\ &= 3 \\ \end{aligned}

Thus, the maximum number of limbs possible in a single gate configuration is 3.

Using limbsmax=3limbs_{max}=3 and t=264t=264 that covers our use cases (see the previous section), we are finally able to derive our limbs size

=tlimbsmax=264/3=88\begin{aligned} \ell &= \frac{t}{limbs_{max}} \\ &= 264/3 \\ &= 88 \end{aligned}

Therefore, our limb configuration is:

  • Limb size =88\ell = 88 bits
  • Number of limbs is 33

Avoiding borrows

When we constrain abqf=rmod2ta \cdot b - q \cdot f = r \mod 2^t we want to be as efficient as possible.

Observe that the expansion of abqfa \cdot b - q \cdot f into limbs would also have subtractions between limbs, requiring our constraints to account for borrowing. Dealing with this would create undesired complexity and increase the degree of our constraints.

In order to avoid the possibility of subtractions we instead use ab+qfa \cdot b + q \cdot f' where

f=fmod2t=2tf\begin{aligned} f' &= -f \mod 2^t \\ &= 2^t - f \end{aligned}

The negated modulus ff' becomes part of our gate coefficients and is not constrained because it is publicly auditable.

Using the substitution of the negated modulus, we now must constrain ab+qf=rmod2ta \cdot b + q \cdot f' = r \mod 2^t.

Observe that f<2tf' < 2^t since f<2tf < 2^t and that f>ff' > f when f<2t1f < 2^{t - 1}.

Intermediate products

This section explains how we expand our constraints into limbs and then eliminate a number of extra terms.

We must constrain ab+qf=rmod2ta \cdot b + q \cdot f' = r \mod 2^t on the limbs, rather than as a whole. As described above, each foreign field element xx is split into three 88-bit limbs: x0,x1,x2x_0, x_1, x_2, where x0x_0 contains the least significant bits and x2x_2 contains the most significant bits and so on.

Expanding the right-hand side into limbs we have

(a0+a12+a222)(b0+b12+b222)+(q0+q12+q222)(f0+f12+f222)=     a0b0+a0b12+a0b222    +a1b02+a1b122+a1b223    +a2b022+a2b123+a2b224+     q0f0+q0f12+q0f222    +q1f02+q1f122+q1f223    +q2f022+q2f123+q2f224=a0b0+q0f0+2(a0b1+a1b0+q0f1+q1f0)+22(a0b2+a2b0+q0f2+q2f0+a1b1+q1f1)+23(a1b2+a2b1+q1f2+q2f1)+24(a2b2+q2f2)\begin{aligned} &(a_0 + a_1 \cdot 2^{\ell} + a_2 \cdot 2^{2\ell}) \cdot (b_0 + b_1 \cdot 2^{\ell} + b_2 \cdot 2^{2\ell}) + (q_0 + q_1 \cdot 2^{\ell} + q_2 \cdot 2^{2\ell}) \cdot (f'_0 + f'_1 \cdot 2^{\ell} + f'_2 \cdot 2^{2\ell}) \\ &=\\ &~~~~~ a_0 \cdot b_0 + a_0 \cdot b_1 \cdot 2^{\ell} + a_0 \cdot b_2 \cdot 2^{2\ell} \\ &~~~~ + a_1 \cdot b_0 \cdot 2^{\ell} + a_1 \cdot b_1 \cdot 2^{2\ell} + a_1 \cdot b_2 \cdot 2^{3\ell} \\ &~~~~ + a_2 \cdot b_0 \cdot 2^{2\ell} + a_2 \cdot b_1 \cdot 2^{3\ell} + a_2 \cdot b_2 \cdot 2^{4\ell} \\ &+ \\ &~~~~~ q_0 \cdot f'_0 + q_0 \cdot f'_1 \cdot 2^{\ell} + q_0 \cdot f'_2 \cdot 2^{2\ell} \\ &~~~~ + q_1 \cdot f'_0 \cdot 2^{\ell} + q_1 \cdot f'_1 \cdot 2^{2\ell} + q_1 \cdot f'_2 \cdot 2^{3\ell} \\ &~~~~ + q_2 \cdot f'_0 \cdot 2^{2\ell} + q_2 \cdot f'_1 \cdot 2^{3\ell} + q_2 \cdot f'_2 \cdot 2^{4\ell} \\ &= \\ &a_0 \cdot b_0 + q_0 \cdot f'_0 \\ &+ 2^{\ell} \cdot (a_0 \cdot b_1 + a_1 \cdot b_0 + q_0 \cdot f'_1 + q_1 \cdot f'_0) \\ &+ 2^{2\ell} \cdot (a_0 \cdot b_2 + a_2 \cdot b_0 + q_0 \cdot f'_2 + q_2 \cdot f'_0 + a_1 \cdot b_1 + q_1 \cdot f'_1) \\ &+ 2^{3\ell} \cdot (a_1 \cdot b_2 + a_2 \cdot b_1 + q_1 \cdot f'_2 + q_2 \cdot f'_1) \\ &+ 2^{4\ell} \cdot (a_2 \cdot b_2 + q_2 \cdot f'_2) \\ \end{aligned}

Since t=3t = 3\ell, the terms scaled by 232^{3\ell} and 242^{4\ell} are a multiple of the binary modulus and, thus, congruent to zero mod2t\mod 2^t. They can be eliminated and we don't need to compute them. So we are left with 3 intermediate products that we call p0,p1,p2p_0, p_1, p_2:

TermScaleProduct
p0p_011a0b0+q0f0a_0b_0 + q_0f'_0
p1p_122^{\ell}a0b1+a1b0+q0f1+q1f0a_0b_1 + a_1b_0 + q_0f'_1 + q_1f'_0
p2p_2222^{2\ell}a0b2+a2b0+q0f2+q2f0+a1b1+q1f1a_0b_2 + a_2b_0 + q_0f'_2 + q_2f'_0 + a_1b_1 + q_1f'_1

So far, we have introduced these checked computations to our constraints

  1. Computation of p0,p1,p2p_0, p_1, p_2

Constraining mod2t\mod 2^t

Let's call p:=ab+qfmod2tp := ab + qf' \mod 2^t. Remember that our goal is to constrain that pr=0mod2tp - r = 0 \mod 2^t (recall that any more significant bits than the 264th are ignored in mod2t\mod 2^t). Decomposing that claim into limbs, that means

22(p2r2)+2(p1r1)+p0r0=0mod2t.\begin{aligned} 2^{2\ell}(p_2 - r_2) + 2^{\ell}(p_1 - r_1) + p_0 - r_0 = 0 \mod 2^t. \end{aligned}

We face two challenges

  • Since p0,p1,p2p_0, p_1, p_2 are at least 22^{\ell} bits each, the right side of the equation above does not fit in Fn\mathbb{F}_n
  • The subtraction of the remainder's limbs r0r_0 and r1r_1 could require borrowing

For the moment, let's not worry about the possibility of borrows and instead focus on the first problem.

Combining multiplications

The first problem is that our native field is too small to constrain 22(p2r2)+2(p1r1)+p0r0=0mod2t2^{2\ell}(p_2 - r_2) + 2^{\ell}(p_1 - r_1) + p_0 - r_0 = 0 \mod 2^t. We could split this up by multiplying aba \cdot b and qfq \cdot f' separately and create constraints that carefully track borrows and carries between limbs. However, a more efficient approach is combined the whole computation together and accumulate all the carries and borrows in order to reduce their number.

The trick is to assume a space large enough to hold the computation, view the outcome in binary and then split it up into chunks that fit in the native modulus.

To this end, it helps to know how many bits these intermediate products require. On the left side of the equation, p0p_0 is at most 2+12\ell + 1 bits. We can compute this by substituting the maximum possible binary values (all bits set to 1) into p0=a0b0+q0f0p_0 = a_0b_0 + q_0f'_0 like this

maxbits(p0)=log2((21)a0(21)b0+(21)q0(21)f0)=log2(2(222+1+1))=log2(22+12+2+2).\begin{aligned} \mathsf{maxbits}(p_0) &= \log_2(\underbrace{(2^{\ell} - 1)}_{a_{0}} \underbrace{(2^{\ell} - 1)}_{b_{0}} + \underbrace{(2^{\ell} - 1)}_{q_{0}} \underbrace{(2^{\ell} - 1)}_{f'_{0}}) \\ &= \log_2(2(2^{2\ell} - 2^{\ell + 1} + 1)) \\ &= \log_2(2^{2\ell + 1} - 2^{\ell + 2} + 2). \end{aligned}

So p0p_0 fits in 2+12\ell + 1 bits. Similarly, p1p_1 needs at most 2+22\ell + 2 bits and p2p_2 is at most 2+32\ell + 3 bits.

Termp0p_0p1p_1p2p_2
Bits2+12\ell + 12+22\ell + 22+32\ell + 3

The diagram below shows the right hand side of the zero-sum equality from equation (2). That is, the value prp - r. Let's look at how the different bits of p0,p1,p2,r0,r1p_0, p_1, p_2, r_0, r_1 and r2r_2 impact it.

0             L             2L            3L            4L
|-------------|-------------|-------------|-------------|-------------|
:
|--------------p0-----------:-| 2L + 1
:
|-------------:-p1------------| 2L + 2
p10➚ : p11➚
|----------------p2-------------| 2L + 3
:
|-----r0------| :
:
|-----r1------|
:
|-----r2------|
\__________________________/ \______________________________/
≈ h0 ≈ h1

Within our native field modulus we can fit up to 2+δ<log2(n)2\ell + \delta < \log_2(n) bits, for small values of δ\delta (but sufficient for our case). Thus, we can only constrain approximately half of prp - r at a time. In the diagram above the vertical line at 2L bisects prp - r into two 2\approx2\ell bit values: h0h_0 and h1h_1 (the exact definition of these values follow). Our goal is to constrain h0h_0 and h1h_1.

Computing the zero-sum halves: h0h_0 and h1h_1

Now we can derive how to compute h0h_0 and h1h_1 from pp and rr.

The direct approach would be to bisect both p0p_0 and p1p_1 and then define h0h_0 as just the sum of the 22\ell lower bits of p0p_0 and p1p_1 minus r0r_0 and r1r_1. Similarly h1h_1 would be just the sum of upper bits of p0,p1p_0, p_1 and p2p_2 minus r2r_2. However, each bisection requires constraints for the decomposition and range checks for the two halves. Thus, we would like to avoid bisections as they are expensive.

Ideally, if our pp's lined up on the 22\ell boundary, we would not need to bisect at all. However, we are unlucky and it seems like we must bisect both p0p_0 and p1p_1. Fortunately, we can at least avoid bisecting p0p_0 by allowing it to be summed into h0h_0 like this

h0=p0+2p10r02r1h_0 = p_0 + 2^{\ell}\cdot p_{10} - r_0 - 2^{\ell}\cdot r_1

Note that h0h_0 is actually greater than 22\ell bits in length. This may not only be because it contains p0p_0 whose length is 2+12\ell + 1, but also because adding p10p_{10} may cause an overflow. The maximum length of h0h_0 is computed by substituting in the maximum possible binary value of 212^{\ell} - 1 for the added terms and 00 for the subtracted terms of the above equation.

maxbits(h0)=log2((21)(21)+(21)(21)p0+2(21)p10)=log2(22+12+2+2+222)=log2(32252+2)\begin{aligned} \mathsf{maxbits}(h_0) &= \log_2(\underbrace{(2^{\ell} - 1)(2^{\ell} - 1) + (2^{\ell} - 1)(2^{\ell} - 1)}_{p_0} + 2^{\ell} \cdot \underbrace{(2^{\ell} - 1)}_{p_{10}}) \\ &= \log_2(2^{2\ell + 1} - 2^{\ell + 2} + 2 + 2^{2\ell} - 2^\ell) \\ &= \log_2( 3\cdot 2^{2\ell} - 5 \cdot 2^\ell +2 ) \\ \end{aligned}

which is 2+22\ell + 2 bits.

N.b. This computation assumes correct sizes values for r0r_0 and r1r_1, which we assure by range checks on the limbs.

Next, we compute h1h_1 as

h1=p11+p2r2h_1 = p_{11} + p_2 - r_2

The maximum size of h1h_1 is computed as

maxbits(h1)=maxbits(p11+p2)\begin{aligned} \mathsf{maxbits}(h_1) &= \mathsf{maxbits}(p_{11} + p_2) \end{aligned}

In order to obtain the maximum value of p11p_{11}, we define p11:=p12p_{11} := \frac{p_1}{2^\ell}. Since the maximum value of p1p_1 was 22+22+3+42^{2\ell+2}-2^{\ell+3}+4, then the maximum value of p11p_{11} is 2+282^{\ell+2}-8. For p2p_2, the maximum value was 622122+66\cdot 2^{2\ell} - 12 \cdot 2^\ell + 6, and thus:

maxbits(h1)=log2(2+28p11+622122+6p2)=log2(622822)\begin{aligned} \mathsf{maxbits}(h_1) &= log_2(\underbrace{2^{\ell+2}-8}_{p_{11}} + \underbrace{6\cdot 2^{2\ell} - 12 \cdot 2^\ell + 6}_{p_2}) \\ &= \log_2(6\cdot 2^{2\ell} - 8 \cdot 2^\ell - 2) \\ \end{aligned}

which is 2+32\ell + 3 bits.

Termh0h_0h1h_1
Bits2+22\ell + 22+32\ell + 3

Thus far we have the following constraints

  1. Composition of p10p_{10} and p11p_{11} result in p1p_1
  2. Range check p11[0,2+2)p_{11} \in [0, 2^{\ell + 2})
  3. Range check p10[0,2)p_{10} \in [0, 2^{\ell})

For the next step we would like to constrain h0h_0 and h1h_1 to zero. Unfortunately, we are not able to do this!

  • Firstly, as defined h0h_0 may not be zero because we have not bisected it precisely at 22\ell bits, but have allowed it to contain the full 2+22\ell + 2 bits. Recall that these two additional bits are because p0p_0 is at most 2+12\ell + 1 bits long, but also because adding p10p_{10} increases it to 2+22\ell + 2. These two additional bits are not part of the first 22\ell bits of prp - r and, thus, are not necessarily zero. That is, they are added from the second 22\ell bits (i.e. h1h_1).

  • Secondly, whilst the highest +3\ell + 3 bits of prp - r would wrap to zero mod2t\mod 2^t, when placed into the smaller 2+32\ell + 3 bit h1h_1 in the native field, this wrapping does not happen. Thus, h1h_1's +3\ell + 3 highest bits may be nonzero.

We can deal with this non-zero issue by computing carry witness values.

Computing carry witnesses values v0v_0 and v1v_1

Instead of constraining h0h_0 and h1h_1 to zero, there must be satisfying witness v0v_0 and v1v_1 such that the following constraints hold.

  1. There exists v0v_0 such that h0=v022h_0 = v_0 \cdot 2^{2\ell}
  2. There exists v1v_1 such that h1=v12v0h_1 = v_1 \cdot 2^{\ell} - v_0

Here v0v_0 is the last two bits of h0h_0's 2+22\ell + 2 bits, i.e., the result of adding the highest bit of p0p_0 and any possible carry bit from the operation of h0h_0. Similarly, v1v_1 corresponds to the highest +3\ell + 3 bits of h1h_1. It looks like this

0             L             2L            3L            4L
|-------------|-------------|-------------|-------------|-------------|
:
|--------------h0-----------:--| 2L + 2
: ↖v0
:-------------h1-------------| 2L + 3
: \____________/
: v1➚

Remember we only need to prove the first 33\ell bits of prp - r are zero, since everything is mod2t\mod 2^t and t=3t = 3\ell. It may not be clear how this approach proves the 33\ell bits are indeed zero because within h0h_0 and h1h_1 there are bits that are nonzero. The key observation is that these bits are too high for mod2t\mod 2^t.

By making the argument with v0v_0 and v1v_1 we are proving that h0h_0 is something where the 22\ell least significant bits are all zeros and that h1+v0h_1 + v_0 is something where the \ell are also zeros. Any nonzero bits after 33\ell do not matter, since everything is mod2t\mod 2^t.

All that remains is to range check v0v_0 and v1v_1

  1. Range check v0[0,22)v_0 \in [0, 2^2)
  2. Range check v1=[0,2+3)v_1 =\in [0, 2^{\ell + 3})

Subtractions

When unconstrained, computing u0=p0+2p10(r0+2r1)u_0 = p_0 + 2^{\ell} \cdot p_{10} - (r_0 + 2^{\ell} \cdot r_1) could require borrowing. Fortunately, we have the constraint that the 22\ell least significant bits of u0u_0 are 0 (i.e. u0=22v0u_0 = 2^{2\ell} \cdot v_0), which means borrowing cannot happen.

Borrowing is prevented because when the 22\ell least significant bits of the result are 0 it is not possible for the minuend to be less than the subtrahend. We can prove this by contradiction.

Let

  • x=p0+2p10x = p_0 + 2^{\ell} \cdot p_{10}
  • y=r0+2r1y = r_0 + 2^{\ell} \cdot r_1
  • the 22\ell least significant bits of xyx - y be 0

Suppose that borrowing occurs, that is, that x<yx < y. Recall that the length of xx is at most 2+22\ell + 2 bits. Therefore, since x<yx < y the top two bits of xx must be zero and so we have

xy=x2y,x - y = x_{2\ell} - y,

where x2x_{2\ell} denotes the 22\ell least significant bits of xx.

Recall also that the length of yy is 22\ell bits. We know this because limbs of the result are each constrained to be in [0,2)[0, 2^{\ell}). So the result of this subtraction is 22\ell bits. Since the 22\ell least significant bits of the subtraction are 0 this means that

xy=0x=y,\begin{aligned} x - y & = 0 \\ x &= y, \end{aligned}

which is a contradiction with x<yx < y.

Costs

Range checks should be the dominant cost, let's see how many we have.

Range check (3) requires two range checks for p11=p1112+p110p_{11} = p_{111} \cdot 2^\ell + p_{110}

  • a) p110[0,2)p_{110} \in [0, 2^\ell)
  • b) p111[0,22)p_{111} \in [0, 2^2)

Range check (8) requires two range checks and a decomposition check that is merged in (6).

  • a) v10[0,2)v_{10} \in [0, 2^{\ell})
  • b) v11[0,23)v_{11} \in [0, 2^3)

The range checks on p0,p1p_0, p_1 and p2p_2 follow from the range checks on a,ba,b and qq.

So we have 3.a, 3.b, 4, 7, 8.a, 8.b.

Range checkGate type(s)WitnessRows
7(v03)(v02)(v01)v0(v_0 - 3)(v_0 - 2)(v_0 - 1)v_0v0v_0< 1
3.a(p1113)(p1112)(p1111)p111(p_{111} - 3)(p_{111} - 2)(p_{111} - 1)p_{111}p111p_{111}< 1
8.bdegree-8 constraint or plookupv11v_{11}1
3.b, 4, 8.amulti-range-checkp10,p110,v10p_{10}, p_{110}, v_{10}4

So we have 1 multi-range-check, 1 single-range-check and 2 low-degree range checks. This consumes just over 5 rows.

Use CRT to constrain abqfr0modna \cdot b - q \cdot f - r \equiv 0 \mod n

Until now we have constrained the equation mod2t\mod 2^t, but remember that our application of the CRT means that we must also constrain the equation modn\mod n. We are leveraging the fact that if the identity holds for all moduli in M={n,2t}\mathcal{M} = \{n, 2^t\}, then it holds for lcm(M)=2tn=M\mathtt{lcm} (\mathcal{M}) = 2^t \cdot n = M.

Thus, we must check abqfr0modna \cdot b - q \cdot f - r \equiv 0 \mod n, which is over Fn\mathbb{F}_n.

This gives us equality mod2tn\mod 2^t \cdot n as long as the divisors are coprime. That is, as long as gcd(2t,n)=1\mathsf{gcd}(2^t, n) = 1. Since the native modulus nn is prime, this is true.

Thus, to perform this check is simple. We compute

an=amodnbn=bmodnqn=qmodnrn=rmodnfn=fmodn\begin{aligned} a_n &= a \mod n \\ b_n &= b \mod n \\ q_n &= q \mod n \\ r_n &= r \mod n \\ f_n &= f \mod n \end{aligned}

using our native field modulus with constraints like this

an=22a2+2a1+a0bn=22b2+2b1+b0qn=22q2+2q1+q0rn=22r2+2r1+r0fn=22f2+2f1+f0\begin{aligned} a_n &= 2^{2\ell} \cdot a_2 + 2^{\ell} \cdot a_1 + a_0 \\ b_n &= 2^{2\ell} \cdot b_2 + 2^{\ell} \cdot b_1 + b_0 \\ q_n &= 2^{2\ell} \cdot q_2 + 2^{\ell} \cdot q_1 + q_0 \\ r_n & = 2^{2\ell} \cdot r_2 + 2^{\ell} \cdot r_1 + r_0 \\ f_n &= 2^{2\ell} \cdot f_2 + 2^{\ell} \cdot f_1 + f_0 \\ \end{aligned}

and then constrain

anbnqnfnrn=0modn.a_n \cdot b_n - q_n \cdot f_n - r_n = 0 \mod n.

Note that we do not use the negated foreign field modulus here.

This requires a single constraint of the form

  1. anbnqnfn=rna_n \cdot b_n - q_n \cdot f_n = r_n

with all of the terms expanded into the limbs according the above equations. The values an,bn,qn,fna_n, b_n, q_n, f_n and rnr_n do not need to be in the witness.

Range check both sides of ab=qf+ra \cdot b = q \cdot f + r

Until now we have constrained that equation ab=qf+ra \cdot b = q \cdot f + r holds modulo 2t2^t and modulo nn, so by the CRT it holds modulo M=2tnM = 2^t \cdot n. Remember that, as outlined in the "Overview" section, we must prove our equation over the positive integers, rather than modM\mod M. By doing so, we assure that our solution is not equal to some qM+rq' \cdot M + r' where q,rFMq', r' \in F_{M}, in which case qq or rr would be invalid.

First, let's consider the right hand side qf+rq \cdot f + r. We have

qf+r<2tnq \cdot f + r < 2^t \cdot n

Recall that we have parameterized 2tnf22^t \cdot n \ge f^2, so if we can bound qq and rr such that

qf+r<f2q \cdot f + r < f^2

then we have achieved our goal. We know that qq must be less than ff, so that is our first check, leaving us with

(f1)f+r<f2r<f2(f1)f=f\begin{aligned} (f - 1) \cdot f + r &< f^2 \\ r &< f^2 - (f - 1) \cdot f = f \end{aligned}

Therefore, to check qf+r<2tnq \cdot f + r < 2^t \cdot n, we need to check

  • q<fq < f
  • r<fr < f

This should come at no surprise, since that is how we parameterized 2tn2^t \cdot n earlier on. Note that by checking q<fq < f we assure correctness, while checking r<fr < f assures our solution is unique (sometimes referred to as canonical).

Next, we must perform the same checks for the left hand side (i.e., ab<2tna \cdot b < 2^t \cdot n). Since aa and bb must be less than the foreign field modulus ff, this means checking

  • a<fa < f
  • b<fb < f

So we have

ab(f1)(f1)=f22f+1\begin{aligned} a \cdot b &\le (f - 1) \cdot (f - 1) = f^2 - 2f + 1 \\ \end{aligned}

Since 2tnf22^t \cdot n \ge f^2 we have

f22f+1<f22tn    ab<2tn\begin{aligned} &f^2 - 2f + 1 < f^2 \le 2^t \cdot n \\ &\implies a \cdot b < 2^t \cdot n \end{aligned}

Bound checks

To perform the above range checks we use the upper bound check method described in the upper bound check section in Foreign Field Addition.

The upper bound check is as follows. We must constrain 0q<f0 \le q < f over the positive integers, so

2tq+2t<f+2t2tfq+2tf<2t\begin{aligned} 2^t \le q &+ 2^t < f + 2^t \\ 2^t - f \le q &+ 2^t - f < 2^t \\ \end{aligned}

Remember f=2tff' = 2^t - f is our negated foreign field modulus. Thus, we have

fq+f<2t\begin{aligned} f' \le q &+ f' < 2^t \\ \end{aligned}

So to check q<tq < t we just need to compute q=q+fq' = q + f' and check fq<2tf' \le q' < 2^t

Observe that

0q    fq0 \le q \implies f' \le q'

and that

q<2t    q<fq' < 2^t \implies q < f

So we only need to check

  • 0q0 \le q
  • q<2tq' < 2^t

The first check is always true since we are operating over the positive integers and qZ+q \in \mathbb{Z^+}. Observe that the second check also constrains that q<2tq < 2^t, since f2259<2tf \le 2^{259} < 2^t and thus

q2tq+f2tq2t(2tf)=fq<2t\begin{aligned} q' &\le 2^t \\ q + f' &\le 2^t \\ q &\le 2^t - (2^t - f) = f\\ q &< 2^t \end{aligned}

Therefore, to constrain q<fq < f we only need constraints for

  • q=q+fq' = q + f'
  • q<2tq' < 2^t

and we don't require an additional multi-range-check for q<2tq < 2^t.

Cost of bound checks

This section analyzes the structure and costs of bounds checks for foreign field addition and foreign field multiplication.

Addition

In our foreign field addition design the operands aa and bb do not need to be less than ff. The field overflow bit o\mathcal{o} for foreign field addition is at most 1. That is, a+b=of+ra + b = \mathcal{o} \cdot f + r, where rr is allowed to be greater than ff. Therefore,

(f+a)+(f+b)=1f+(f+a+b)(f + a) + (f + b) = 1 \cdot f + (f + a + b)

These can be chained along kk times as desired. The final result

r=(f++fk+a1+b1+ak+bk)r = (\underbrace{f + \cdots + f}_{k} + a_1 + b_1 + \cdots a_k + b_k)

Since the bit length of rr increases logarithmically with the number of additions, in Kimchi we must only check that the final rr in the chain is less than ff to constrain the entire chain.

Security note: In order to defer the r<fr < f check to the end of any chain of additions, it is extremely important to consider the potential impact of wraparound in Fn\mathbb{F_n}. That is, we need to consider whether the addition of a large chain of elements greater than the foreign field modulus could wrap around. If this could happen then the r<fr < f check could fail to detect an invalid witness. Below we will show that this is not possible in Kimchi.

Recall that our foreign field elements are comprised of 3 limbs of 88-bits each that are each represented as native field elements in our proof system. In order to wrap around and circumvent the r<fr < f check, the highest limb would need to wrap around. This means that an attacker would need to perform about kn/2k \approx n/2^{\ell} additions of elements greater than then foreign field modulus. Since Kimchi's native moduli (Pallas and Vesta) are 255-bits, the attacker would need to provide a witness for about k2167k \approx 2^{167} additions. This length of witness is greater than Kimchi's maximum circuit (resp. witness) length. Thus, it is not possible for the attacker to generate a false proof by causing wraparound with a large chain of additions.

In summary, for foreign field addition in Kimchi it is sufficient to only bound check the last result rr' in a chain of additions (and subtractions)

  • Compute bound r=r+fr' = r + f' with addition gate (2 rows)
  • Range check r<2tr' < 2^t (4 rows)

Multiplication

In foreign field multiplication, the situation is unfortunately different, and we must check that each of a,b,qa, b, q and rr are less than ff. We cannot adopt the strategy from foreign field addition where the operands are allowed to be greater than ff because the bit length of rr would increases linearly with the number of multiplications. That is,

(a1+f)(a2+f)=1f+f2+(a1+a21)f+a1a2r(a_1 + f) \cdot (a_2 + f) = 1 \cdot f + \underbrace{f^2 + (a_1 + a_2 - 1) \cdot f + a_1 \cdot a_2}_{r}

and after a chain of kk multiplication we have

r=fk++a1akr = f^k + \ldots + a_1 \cdots a_k

where r>fkr > f^k quickly overflows our CRT modulus 2tn2^t \cdot n. For example, assuming our maximum foreign modulus of f=2259f = 2^{259} and either of Kimchi's native moduli (i.e. Pallas or Vesta), fk>2tnf^k > 2^t \cdot n for k>2k > 2. That is, an overflow is possible for a chain of greater than 1 foreign field multiplication. Thus, we must check a,b,qa, b, q and rr are less than ff for each multiplication.

Fortunately, in many situations the input operands may already be checked either as inputs or as outputs of previous operations, so they may not be required for each multiplication operation.

Thus, the qq' and rr' checks (or equivalently qq and rr) are our main focus because they must be done for every multiplication.

  • Compute bound q=q+fq' = q + f' with addition gate (2 rows)
  • Compute bound r=r+fr' = r + f' with addition gate (2 rows)
  • Range check q<2tq' < 2^t (4 rows)
  • Range check r<2tr' < 2^t (4 rows)

This costs 12 rows per multiplication. In a subsequent section, we will reduce it to 8 rows.

2-limb decomposition

Due to the limited number of permutable cells per gate, we do not have enough cells for copy constraining qq' and rr' (or qq and rr) to their respective range check gadgets. To address this issue, we must decompose qq' into 2 limbs instead of 3, like so

q=q01+22q2q' = q'_{01} + 2^{2\ell} \cdot q'_2

and

q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1

Thus, qq' is decomposed into two limbs q01q'_{01} (at most 22\ell bits) and q2q'_2 (at most \ell bits).

Note that qq' must be range checked by a multi-range-check gadget. To do this the multi-range-check gadget must

  • Store a copy of the limbs q0,q1q'_0, q'_1 and q2q'_2 in its witness
  • Range check that they are \ell bit each
  • Constrain that q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1 (this is done with a special mode of the multi-range-check gadget)
  • Have copy constraints for q01q'_{01} and q2q'_2 as outputs of the ForeignFieldMul gate and inputs to the multi-range-check gadget

Note that since the foreign field multiplication gate computes qq' from qq which is already in the witness and q01q'_{01} and q2q'_2 have copy constraints to a multi-range-check gadget that fully constrains their decomposition from qq', then the ForeignFieldMul gate does not need to store an additional copy of q0q'_0 and q1q'_1.

An optimization

Since the q<fq < f and r<fr < f checks must be done for each multiplication it makes sense to integrate them into the foreign field multiplication gate. By doing this we can save 4 rows per multiplication.

Doing this doesn't require adding a lot more witness data because the operands for the bound computations q=q+fq' = q + f' and r=r+fr' = r + f' are already present in the witness of the multiplication gate. We only need to store the bounds qq' and rr' in permutable witness cells so that they may be copied to multi-range-check gates to check they are each less than 2t2^t.

To constrain x+f=xx + f' = x', the equation we use is

x+2t=of+x,x + 2^t = \mathcal{o} \cdot f + x',

where xx is the original value, o=1\mathcal{o}=1 is the field overflow bit and xx' is the remainder and our desired addition result (e.g. the bound). Rearranging things we get

x+2tf=x,x + 2^t - f = x',

which is just

x+f=x,x + f' = x',

Recall from the section "Avoiding borrows" that ff' is often larger than ff. At first this seems like it could be a problem because in multiplication each operation must be less than ff. However, this is because the maximum size of the multiplication was quadratic in the size of ff (we use the CRT, which requires the bound that ab<2tna \cdot b < 2^t \cdot n). However, for addition the result is much smaller and we do not require the CRT nor the assumption that the operands are smaller than ff. Thus, we have plenty of space in \ell-bit limbs to perform our addition.

So, the equation we need to constrain is

x+f=x.x + f' = x'.

We can expand the left hand side into the 2 limb format in order to obtain 2 intermediate sums

s01=x01+f01s2=x2+f2\begin{aligned} s_{01} = x_{01} + f_{01}' \\ s_2 = x_2 + f'_2 \\ \end{aligned}

where x01x_{01} and f01f'_{01} are defined like this

x01=x0+2x1f01=f0+2f1\begin{aligned} x_{01} = x_0 + 2^{\ell} \cdot x_1 \\ f'_{01} = f'_0 + 2^{\ell} \cdot f'_1 \\ \end{aligned}

and xx and ff' are defined like this

x=x01+22x2f=f01+22f2\begin{aligned} x = x_{01} + 2^{2\ell} \cdot x_2 \\ f' = f'_{01} + 2^{2\ell} \cdot f'_2 \\ \end{aligned}

Going back to our intermediate sums, the maximum bit length of sum s01s_{01} is computed from the maximum bit lengths of x01x_{01} and f01f'_{01}

(21)+2(21)x01+(21)+2(21)f01=22+12,\underbrace{(2^{\ell} - 1) + 2^{\ell} \cdot (2^{\ell} - 1)}_{x_{01}} + \underbrace{(2^{\ell} - 1) + 2^{\ell} \cdot (2^{\ell} - 1)}_{f'_{01}} = 2^{2\ell+ 1} - 2,

which means s01s_{01} is at most 2+12\ell + 1 bits long.

Similarly, since x2x_2 and f2f'_2 are less than 22^{\ell}, the max value of s2s_2 is

(21)+(21)=2+12,(2^{\ell} - 1) + (2^{\ell} - 1) = 2^{\ell + 1} - 2,

which means s2s_2 is at most +1\ell + 1 bits long.

Thus, we must constrain

s01+22s2x0122x2=0mod2t.s_{01} + 2^{2\ell} \cdot s_2 - x'_{01} - 2^{2\ell} \cdot x'_2 = 0 \mod 2^t.

The accumulation of this into parts looks like this.

0             L             2L            3L=t          4L
|-------------|-------------|-------------|-------------|-------------|
:
|------------s01------------:-| 2L + 1
: ↖w01
|------s2-----:-| L + 1
: ↖w2
:
|------------x'01-----------|
:
|------x'2----|
:
\____________________________/
≈ z01 \_____________/
≈ z2

The two parts are computed with

z01=s01x01z2=s2x2.\begin{aligned} z_{01} &= s_{01} - x'_{01} \\ z_2 &= s_2 - x'_2. \end{aligned}

Therefore, there are two carry bits w01w_{01} and w2w_2 such that

z01=22w01z2+w01=2w2\begin{aligned} z_{01} &= 2^{2\ell} \cdot w_{01} \\ z_2 + w_{01} &= 2^{\ell} \cdot w_2 \end{aligned}

In this scheme x01,x2,w01x'_{01}, x'_2, w_{01} and w2w_2 are witness data, whereas s01s_{01} and s2s_2 are formed from a constrained computation of witness data x01,x2x_{01}, x_2 and constraint system public parameter ff'. Note that due to carrying, witness x01x'_{01} and x2x'_2 can be different than the values s01s_{01} and s2s_2 computed from the limbs.

Thus, each bound addition x+fx + f' requires the following witness data

  • x01,x2x_{01}, x_2
  • x01,x2x'_{01}, x'_2
  • w01,w2w_{01}, w_2

where ff' is baked into the gate coefficients. The following constraints are needed

  • 22w01=s01x012^{2\ell} \cdot w_{01} = s_{01} - x'_{01}
  • 2w2=s2+w01x22^{\ell} \cdot w_2 = s_2 + w_{01} - x'_2
  • x01[0,22)x'_{01} \in [0, 2^{2\ell})
  • x2[0,2)x'_2 \in [0, 2^{\ell})
  • w01[0,2)w_{01} \in [0, 2)
  • w2[0,2)w_2 \in [0, 2)

Due to the limited number of copyable witness cells per gate, we are currently only performing this optimization for qq.

The witness data is

  • q0,q1,q2q_0, q_1, q_2
  • q01,q2q'_{01}, q'_2
  • qcarry01,qcarry2q'_{carry01}, q'_{carry2}

The checks are

  1. q0[0,2)q_0 \in [0, 2^{\ell})
  2. q1[0,2)q_1 \in [0, 2^{\ell})
  3. q0=q0+f0q'_0 = q_0 + f'_0
  4. q1=q1+f1q'_1 = q_1 + f'_1
  5. s01=q0+2q1s_{01} = q'_0 + 2^{\ell} \cdot q'_1
  6. q01[0,22)q'_{01} \in [0, 2^{2\ell})
  7. q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1
  8. qcarry01[0,2)q'_{carry01} \in [0, 2)
  9. 22qcarry01=s01q012^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}
  10. q2[0,2)q_2 \in [0, 2^{\ell})
  11. s2=q2+f2s_2 = q_2 + f'_2
  12. qcarry2[0,2)q'_{carry2} \in [0, 2)
  13. 2qcarry2=s2+w01q22^{\ell} \cdot q'_{carry2} = s_2 + w_{01} - q'_2

Checks (1) - (5) assure that s01s_{01} is at most 2+12\ell + 1 bits. Whereas checks (10) - (11) assure that s2s_2 is at most +1\ell + 1 bits. Altogether they are comprise a single multi-range-check of q0,q1q_0, q_1 and q2q_2. However, as noted above, we do not have enough copyable cells to output q1,q2q_1, q_2 and q3q_3 to the multi-range-check gadget. Therefore, we adopt a strategy where the 2 limbs q01q'_{01} and q2q'_2 are output to the multi-range-check gadget where the decomposition of q0q'_0 and q2q'_2 into q01=p0+2p1q'_{01} = p_0 + 2^{\ell} \cdot p_1 is constrained and then q0,q1q'_0, q'_1 and q2q'_2 are range checked.

Although q1,q2q_1, q_2 and q3q_3 are not range checked directly, this is safe because, as shown in the "Bound checks" section, range-checking that q[0,2t)q' \in [0, 2^t) also constrains that q[0,2t)q \in [0, 2^t). Therefore, the updated checks are

  1. q0[0,2)q_0 \in [0, 2^{\ell}) multi-range-check
  2. q1[0,2)q_1 \in [0, 2^{\ell}) multi-range-check
  3. q0=q0+f0q'_0 = q_0 + f'_0 ForeignFieldMul
  4. q1=q1+f1q'_1 = q_1 + f'_1 ForeignFieldMul
  5. s01=q0+2q1s_{01} = q'_0 + 2^{\ell} \cdot q'_1 ForeignFieldMul
  6. q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1 multi-range-check
  7. qcarry01[0,2)q'_{carry01} \in [0, 2) ForeignFieldMul
  8. 22qcarry01=s01q012^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01} ForeignFieldMul
  9. q2[0,2)q_2 \in [0, 2^{\ell}) multi-range-check
  10. s2=q2+f2s_2 = q_2 + f'_2 ForeignFieldMul
  11. qcarry2[0,2)q'_{carry2} \in [0, 2) ForeignFieldMul
  12. 2qcarry2=s2+qcarry01q22^{\ell} \cdot q'_{carry2} = s_2 + q'_{carry01} - q'_2 ForeignFieldMul

Note that we don't need to range-check q01q'_{01} is at most 2+12\ell + 1 bits because it is already implicitly constrained by the multi-range-check gadget constraining that q0,q1q'_0, q'_1 and q2q'_2 are each at most \ell bits and that q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1. Furthermore, since constraining the decomposition is already part of the multi-range-check gadget, we do not need to do it here also.

To simplify things further, we can combine some of these checks. Recall our checked computations for the intermediate sums

s01=q01+f01s2=q2+f2\begin{aligned} s_{01} &= q_{01} + f'_{01} \\ s_2 &= q_2 + f'_2 \\ \end{aligned}

where q01=q0+2q1q_{01} = q_0 + 2^{\ell} \cdot q_1 and f01=f0+2f1f'_{01} = f'_0 + 2^{\ell} \cdot f'_1. These do not need to be separate constraints, but are instead part of existing ones.

Checks (10) and (11) can be combined into a single constraint 2qcarry2=(q2+f2)+qcarry01q22^{\ell} \cdot q'_{carry2} = (q_2 + f'_2) + q'_{carry01} - q'_2. Similarly, checks (3) - (5) and (8) can be combined into 22qcarry01=q01+f01q012^{2\ell} \cdot q'_{carry01} = q_{01} + f'_{01} - q'_{01} with q01q_{01} and f01f'_{01} further expanded. The reduced constraints are

  1. q0[0,2)q_0 \in [0, 2^{\ell}) multi-range-check
  2. q1[0,2)q_1 \in [0, 2^{\ell}) multi-range-check
  3. q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1 multi-range-check
  4. qcarry01[0,2)q'_{carry01} \in [0, 2) ForeignFieldMul
  5. 22qcarry01=s01q012^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01} ForeignFieldMul
  6. q2[0,2)q_2 \in [0, 2^{\ell}) multi-range-check
  7. qcarry2[0,2)q'_{carry2} \in [0, 2) ForeignFieldMul
  8. 2qcarry2=s2+w01q22^{\ell} \cdot q'_{carry2} = s_2 + w_{01} - q'_2 ForeignFieldMul

Finally, there is one more optimization that we will exploit. This optimization relies on the observation that for bound addition the second carry bit qcarry2q'_{carry2} is always zero. This this may be obscure, so we will prove it by contradiction. To simplify our work we rename some variables by letting x0=q01x_0 = q_{01} and x1=q2x_1 = q_2. Thus, qcarry2q'_{carry2} being non-zero corresponds to a carry in x1+f1x_1 + f'_1.

Proof: To get a carry in the highest limbs x1+f1x_1 + f'_1 during bound addition, we need

2<x1+ϕ0+f121+ϕ0+f12^{\ell} < x_1 + \phi_0 + f'_1 \le 2^{\ell} - 1 + \phi_0 + f'_1

where 212^{\ell} - 1 is the maximum possible size of x1x_1 (before it overflows) and ϕ0\phi_0 is the overflow bit from the addition of the least significant limbs x0x_0 and f0f'_0. This means

2ϕ0f1<x1<22^{\ell} - \phi_0 - f'_1 < x_1 < 2^{\ell}

We cannot allow xx to overflow the foreign field, so we also have

x1<(fx0)/22x_1 < (f - x_0)/2^{2\ell}

Thus,

2ϕ0f1<(fx0)/22=f/22x0/222^{\ell} - \phi_0 - f'_1 < (f - x_0)/2^{2\ell} = f/2^{2\ell} - x_0/2^{2\ell}

Since x0/22=ϕ0x_0/2^{2\ell} = \phi_0 we have

2ϕ0f1<f/22ϕ02^{\ell} - \phi_0 - f'_1 < f/2^{2\ell} - \phi_0

so

2f1<f/222^{\ell} - f'_1 < f/2^{2\ell}

Notice that f/22=f1f/2^{2\ell} = f_1. Now we have

2f1<f1f1>2f12^{\ell} - f'_1 < f_1 \\ \Longleftrightarrow \\ f'_1 > 2^{\ell} - f_1

However, this is a contradiction with the definition of our negated foreign field modulus limb f1=2f1f'_1 = 2^{\ell} - f_1. \blacksquare

We have proven that qcarry2q'_{carry2} is always zero, so that allows use to simplify our constraints. We now have

  1. q0[0,2)q_0 \in [0, 2^{\ell}) multi-range-check
  2. q1[0,2)q_1 \in [0, 2^{\ell}) multi-range-check
  3. q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1 multi-range-check
  4. qcarry01[0,2)q'_{carry01} \in [0, 2) ForeignFieldMul
  5. 22qcarry01=s01q012^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01} ForeignFieldMul
  6. q2[0,2)q_2 \in [0, 2^{\ell}) multi-range-check
  7. q2=s2+w01q'_2 = s_2 + w_{01} ForeignFieldMul

In other words, we have eliminated constraint (7) and removed qcarry2q'_{carry2} from the witness.

Since we already needed to range-check qq or qq', the total number of new constraints added is 4: 3 added to ForeignFieldMul and 1 added to multi-range-check gadget for constraining the decomposition of q01q'_{01}.

This saves 2 rows per multiplication.

Chaining multiplications

Due to the limited number of cells accessible to gates, we are not able to chain multiplications into multiplications. We can chain foreign field additions into foreign field multiplications, but currently do not support chaining multiplications into additions (though there is a way to do it).

Constraining the computation

Now we collect all of the checks that are required to constrain foreign field multiplication

1. Range constrain aa

If the aa operand has not been constrained to [0,f)[0, f) by any previous foreign field operations, then we constrain it like this

  • Compute bound a=a+fa' = a + f' with addition gate (2 rows) ForeignFieldAdd
  • Range check a[0,2t)a' \in [0, 2^t) (4 rows) multi-range-check

2. Range constrain bb

If the bb operand has not been constrained to [0,f)[0, f) by any previous foreign field operations, then we constrain it like this

  • Compute bound b=b+fb' = b + f' with addition gate (2 rows) ForeignFieldAdd
  • Range check b[0,2t)b' \in [0, 2^t) (4 rows) multi-range-check

3. Range constrain qq

The quotient qq is constrained to [0,f)[0, f) for each multiplication as part of the multiplication gate

  • Compute bound q=q+fq' = q + f' with ForeignFieldMul constraints
  • Range check q[0,2t)q' \in [0, 2^t) (4 rows) multi-range-check
  • Range check q0q \ge 0 ForeignFieldMul (implicit by storing q in witness)

4. Range constrain rr

The remainder rr is constrained to [0,f)[0, f) for each multiplication using an external addition gate.

  • Compute bound r=r+fr' = r + f' with addition gate (2 rows) ForeignFieldAdd
  • Range check r[0,2t)r' \in [0, 2^t) (4 rows) multi-range-check

5. Compute intermediate products

Compute and constrain the intermediate products p0,p1p_0, p_1 and p2p_2 as:

  • p0=a0b0+q0f0p_0 = a_0 \cdot b_0 + q_0 \cdot f'_0 ForeignFieldMul
  • p1=a0b1+a1b0+q0f1+q1f0p_1 = a_0 \cdot b_1 + a_1 \cdot b_0 + q_0 \cdot f'_1 + q_1 \cdot f'_0 ForeignFieldMul
  • p2=a0b2+a2b0+a1b1+q0f2+q2f0+q1f1p_2 = a_0 \cdot b_2 + a_2 \cdot b_0 + a_1 \cdot b_1 + q_0 \cdot f'_2 + q_2 \cdot f'_0 + q_1 \cdot f'_1 ForeignFieldMul

where each of them is about 22\ell-length elements.

6. Native modulus checked computations

Compute and constrain the native modulus values, which are used to check the constraints modulo nn in order to apply the CRT

  • an=22a2+2a1+a0modna_n = 2^{2\ell} \cdot a_2 + 2^{\ell} \cdot a_1 + a_0 \mod n
  • bn=22b2+2b1+b0modnb_n = 2^{2\ell} \cdot b_2 + 2^{\ell} \cdot b_1 + b_0 \mod n
  • qn=22q2+2q1+q0modnq_n = 2^{2\ell} \cdot q_2 + 2^{\ell} \cdot q_1 + q_0 \mod n
  • rn=22r2+2r1+r0modnr_n = 2^{2\ell} \cdot r_2 + 2^{\ell} \cdot r_1 + r_0 \mod n
  • fn=22f2+2f1+f0modnf_n = 2^{2\ell} \cdot f_2 + 2^{\ell} \cdot f_1 + f_0 \mod n

7. Decompose middle intermediate product

Check that p1=2p11+p10p_1 = 2^{\ell} \cdot p_{11} + p_{10}:

  • p1=2p11+p10p_1 = 2^\ell \cdot p_{11} + p_{10} ForeignFieldMul
  • Range check p10[0,2)p_{10} \in [0, 2^\ell) multi-range-check
  • Range check p11[0,2+2)p_{11} \in [0, 2^{\ell+2})
    • p11=p1112+p110p_{11} = p_{111} \cdot 2^\ell + p_{110} ForeignFieldMul
    • Range check p110[0,2)p_{110} \in [0, 2^\ell) multi-range-check
    • Range check p111[0,22)p_{111} \in [0, 2^2) with a degree-4 constraint ForeignFieldMul

8. Zero sum for multiplication

Now we have to constrain the zero sum

(p0r0)+288(p1r1)+2176(p2r2)=0mod2t(p_0 - r_0) + 2^{88}(p_1 - r_1) + 2^{176}(p_2 - r_2) = 0 \mod 2^t

We constrain the first and the second halves as

  • v022=p0+2p10r02r1v_0 \cdot 2^{2\ell} = p_0 + 2^\ell \cdot p_{10} - r_0 - 2^\ell \cdot r_1 ForeignFieldMul
  • v12=(p1112+p110)+p2r2+v0v_1 \cdot 2^{\ell} = (p_{111} \cdot 2^\ell + p_{110}) + p_2 - r_2 + v_0 ForeignFieldMul

And some more range checks

  • Check that v0[0,22)v_0 \in [0, 2^2) with a degree-4 constraint ForeignFieldMul
  • Check that v1[0,2+3)v_1 \in [0, 2^{\ell + 3})
    • Check v1=v11288+v10v_1 = v_{11} \cdot 2^{88} + v_{10} ForeignFieldMul
    • Check v11[0,23]v_{11} \in [0, 2^3] ForeignFieldMul
    • Check v10<2v_{10} < 2^\ell with range constraint multi-range-check

To check that v11[0,23)v_{11} \in [0, 2^3) (i.e. that v11v_{11} is at most 3 bits long) we first range-check v11[0,212)v_{11} \in [0, 2^{12}) with a 12-bit plookup. This means there can be no higher bits set beyond the 12-bits of v11v_{11}. Next, we scale v11v_{11} by 292^9 in order to move the highest 123=912 - 3 = 9 bits beyond the 1212th bit. Finally, we perform a 12-bit plookup on the resulting value. That is, we have

  • Check v11[0,212)v_{11} \in [0, 2^{12}) with a 12-bit plookup (to prevent any overflow)
  • Check scaledv11=29v11\mathsf{scaled}_{v_{11}} = 2^9 \cdot v_{11}
  • Check scaledv11\mathsf{scaled}_{v_{11}} is a 12-bit value with a 12-bit plookup

Kimchi's plookup implementation is extremely flexible and supports optional scaling of the lookup target value as part of the lookup operation. Thus, we do not require two witness elements, two lookup columns, nor the scaledv11=29v11\mathsf{scaled}_{v_{11}} = 2^9 \cdot v_{11} custom constraint. Instead we can just store v11v_{11} in the witness and define this column as a "joint lookup" comprised of one 12-bit plookup on the original cell value and another 12-bit plookup on the cell value scaled by 292^9, thus, yielding a 3-bit check. This eliminates one plookup column and reduces the total number of constraints.

9. Native modulus constraint

Using the checked native modulus computations we constrain that

anbnqnfnrn=0modn.a_n \cdot b_n - q_n \cdot f_n - r_n = 0 \mod n.

10. Compute intermediate sums

Compute and constrain the intermediate sums s01s_{01} and s2s_2 as:

  • s01=q01+f01s_{01} = q_{01} + f'_{01}
  • s2=q2+f2s_2 = q_2 + f_2'
  • q01=q0+2q1q_{01} = q_0 + 2^{\ell} \cdot q_1
  • f01=f0+2f1f'_{01} = f'_0 + 2^{\ell} \cdot f'_1

11. Decompose the lower quotient bound

Check that q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1.

Done by (3) above with the multi-range-check on qq'

  • q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1
  • Range check q0[0,2)q'_0 \in [0, 2^\ell)
  • Range check q1[0,2)q'_1 \in [0, 2^\ell)

12. Zero sum for quotient bound addition

We constrain that

s01q01+22(s2q2)=0mod2ts_{01} - q'_{01} + 2^{2\ell} \cdot (s_2 - q'_2) = 0 \mod 2^t

by constraining the two halves

  • 22qcarry01=s01q012^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}
  • 2qcarry2=s2+qcarry01q22^{\ell} \cdot q'_{carry2} = s_2 + q'_{carry01} - q'_2

We also need a couple of small range checks

  • Check that qcarry01q'_{carry01} is boolean ForeignFieldMul
  • Check that q2q'_2 is boolean ForeignFieldMul

Layout

Based on the constraints above, we need the following 12 values copied from the range check gates.

a0, a1, a2, b0, b1, b2, q0, q1, q2, r0, r1, r2

Since we need 12 copied values for the constraints, they must span 2 rows.

The q<fq < f bound limbs q0,q1q'_0, q'_1 and q2q'_2 must be in copyable cells so they can be range-checked. Similarly, the limbs of the operands aa, bb and the result rr must all be in copyable cells. This leaves only 2 remaining copyable cells and, therefore, we cannot compute and output r=r+fr' = r + f'. It must be deferred to an external ForeignFieldAdd gate with the rr cells copied as an argument.

NB: the ff and ff' values are publicly visible in the gate coefficients.

CurrNext
ColumnForeignFieldMulZero
0a0a_0 (copy)r0r_0 (copy)
1a1a_1 (copy)r1r_1 (copy)
2a2a_2 (copy)r2r_2 (copy)
3b0b_0 (copy)q01q'_{01} (copy)
4b1b_1 (copy)q2q'_2 (copy)
5b2b_2 (copy)p10p_{10} (copy)
6v10v_{10} (copy)p110p_{110} (copy)
7v11v_{11} (scaled plookup)
8v0v_0
9q0q_0
10q1q_1
11q2q_2
12qcarry01q'_{carry01}
13p111p_{111}
14

Checked computations

As described above foreign field multiplication has the following intermediate computations

  1. p0=a0b0+q0f0p_0 = a_0b_0 + q_0f'_0
  2. p1=a0b1+a1b0+q0f1+q1f0p_1 = a_0b_1 + a_1b_0 + q_0f'_1 + q_1f'_0
  3. p2=a0b2+a2b0+a1b1+q0f2+q2f0+q1f1p_2 = a_0b_2 + a_2b_0 + a_1b_1 + q_0f'_2 + q_2f'_0 + q_1f'_1.

For the qq bound addition we must also compute

  1. s01=q01+f01s_{01} = q_{01} + f'_{01}
  2. s2=q2+f2s_2 = q_2 + f_2'
  3. q01=q0+2q1q_{01} = q_0 + 2^{\ell} \cdot q_1
  4. f01=f0+2f1f'_{01} = f'_0 + 2^{\ell} \cdot f'_1

Note the equivalence

s01=q01+f01=q0+2q1+f0+2f1=q0+f0+2(q1+f1)=q0+2q1\begin{aligned} s_{01} &= q_{01} + f'_{01} \\ &= q_0 + 2^{\ell} \cdot q_1 + f'_0 + 2^{\ell} \cdot f'_1 \\ &= q_0 + f'_0 + 2^{\ell} \cdot (q'_1 + f'_1) \\ &= q'_0 + 2^{\ell} \cdot q'_1 \end{aligned}

where q0=q0+f0q'_0 = q_0 + f'_0 and q1=q1+f1q'_1 = q_1 + f'_1 can be done with checked computations.

Next, for applying the CRT we compute

  1. an=22a2+2a1+a0modna_n = 2^{2\ell} \cdot a_2 + 2^{\ell} \cdot a_1 + a_0 \mod n
  2. bn=22b2+2b1+b0modnb_n = 2^{2\ell} \cdot b_2 + 2^{\ell} \cdot b_1 + b_0 \mod n
  3. qn=22q2+2q1+q0modnq_n = 2^{2\ell} \cdot q_2 + 2^{\ell} \cdot q_1 + q_0 \mod n
  4. rn=22r2+2r1+r0modnr_n = 2^{2\ell} \cdot r_2 + 2^{\ell} \cdot r_1 + r_0 \mod n
  5. fn=22f2+2f1+f0modnf_n = 2^{2\ell} \cdot f_2 + 2^{\ell} \cdot f_1 + f_0 \mod n

Checks

In total we require the following checks

  1. p111[0,22)p_{111} \in [0, 2^2)
  2. v10[0,2)v_{10} \in [0, 2^{\ell}) multi-range-check
  3. p110[0,2)p_{110} \in [0, 2^{\ell}) multi-range-check
  4. p10[0,2)p_{10} \in [0, 2^{\ell}) multi-range-check
  5. p11=2p111+p110p_{11} = 2^{\ell} \cdot p_{111} + p_{110}
  6. p1=2p11+p10p_1 = 2^{\ell} \cdot p_{11} + p_{10}
  7. v0[0,22)v_0 \in [0, 2^2)
  8. 22v0=p0+2p10r02r12^{2\ell} \cdot v_0 = p_0 + 2^{\ell} \cdot p_{10} - r_0 - 2^{\ell} \cdot r_1
  9. v11[0,23)v_{11} \in [0, 2^{3})
  10. v1=2v11+v10v_1 = 2^{\ell} \cdot v_{11} + v_{10}
  11. 2v1=v0+p11+p2r22^{\ell} \cdot v_1 = v_0 + p_{11} + p_2 - r_2
  12. anbnqnfn=rna_n \cdot b_n - q_n \cdot f_n = r_n
  13. q0[0,2)q'_0 \in [0, 2^{\ell}) multi-range-check
  14. q1[0,2)q'_1 \in [0, 2^{\ell}) multi-range-check
  15. q2[0,2)q'_2 \in [0, 2^{\ell}) multi-range-check
  16. q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1 multi-range-check
  17. qcarry01[0,2)q'_{carry01} \in [0, 2)
  18. 22qcarry01=s01q012^{2\ell} \cdot q'_{carry01} = s_{01} - q'_{01}
  19. q2=s2+qcarry01q'_2 = s_2 + q'_{carry01}

Constraints

These checks can be condensed into the minimal number of constraints as follows.

First, we have the range-check corresponding to (1) as a degree-4 constraint

C1: p111(p1111)(p1112)(p1113)p_{111} \cdot (p_{111} - 1) \cdot (p_{111} - 2) \cdot (p_{111} - 3)

Checks (2) - (4) are all handled by a single multi-range-check gadget.

C2: multi-range-check v10,p10,p110v_{10}, p_{10}, p_{110}

Next (5) and (6) can be combined into

C3: 2(2p111+p110)+p10=p12^{\ell} \cdot (2^{\ell} \cdot p_{111} + p_{110}) + p_{10} = p_1

Now we have the range-check corresponding to (7) as another degree-4 constraint

C4: v0(v01)(v02)(v03)v_0 \cdot (v_0 - 1) \cdot (v_0 - 2) \cdot (v_0 - 3)

Next we have check (8)

C5: 22v0=p0+2p10r02r12^{2\ell} \cdot v_0 = p_0 + 2^{\ell} \cdot p_{10} - r_0 - 2^{\ell} \cdot r_1

Up next, check (9) is a 3-bit range check

C6: Plookup v11v_{11} (12-bit plookup scaled by 292^9)

Now checks (10) and (11) can be combined into

C7: 2(v112+v10)=p2+p11+v0r22^{\ell} \cdot (v_{11} \cdot 2^{\ell} + v_{10}) = p_2 + p_{11} + v_0 - r_2

Next, for our use of the CRT, we must constrain that ab=qf+rmodna \cdot b = q \cdot f + r \mod n. Thus, check (12) is

C8: anbnqnfn=rna_n \cdot b_n - q_n \cdot f_n = r_n

Next we must constrain the quotient bound addition.

Checks (13) - (16) are all combined into multi-range-check gadget

C9: multi-range-check q0,q1,q2q'_0, q'_1, q'_2 and q01=q0+2q1q'_{01} = q'_0 + 2^{\ell} \cdot q'_1.

Check (17) is a carry bit boolean check

C10: qcarry01(qcarry011)q'_{carry01} \cdot (q'_{carry01} - 1)

Next, check (18) is

C11: 22qcarry10=s01q012^{2\ell} \cdot q'_{carry10} = s_{01} - q'_{01}

Finally, check (19) is

C12: q2=s2+qcarry01q'_2 = s_2 + q'_{carry01}

The Zero gate has no constraints and is just used to hold values required by the ForeignFieldMul gate.

External checks

The following checks must be done with other gates to assure the soundness of the foreign field multiplication

  • Range check input
    • multi-range-check aa
    • multi-range-check bb
  • Range check witness data
    • multi-range-check qq' and check q01=q0+2q1q_{01}' = q'_0 + 2^{\ell} q'_1
    • multi-range-check \ell-bit limbs: p10,p110,p111p_{10}, p_{110}, p_{111}
  • Range check output
    • multi-range-check either rr or rr'
  • Compute and constrain rr' the bound on rr
    • ForeignFieldAdd r+2t=1f+rr + 2^t = 1 \cdot f + r'

Copy constraints must connect the above witness cells to their respective input cells within the corresponding external check gates witnesses.