Foreign Field Multiplication RFC
This document explains how we constrain foreign field multiplication (i.e. non-native field multiplication) in the Kimchi proof system.
|Joseph Spadavecchia and Anais Querol||October 2022||Design of foreign field multiplication in Kimchi|
This gate constrains
where , a foreign field with modulus , using the native field with prime modulus .
In foreign field multiplication the foreign field modulus could be bigger or smaller than the native field modulus . When the foreign field modulus is bigger, then we need to emulate foreign field multiplication by splitting the foreign field elements up into limbs that fit into the native field element size. When the foreign modulus is smaller everything can be constrained either in the native field or split up into limbs.
Since our projected use cases are when the foreign field modulus is bigger (more on this below) we optimize our design around this scenario. For this case, not only must we split foreign field elements up into limbs that fit within the native field, but we must also operate in a space bigger than the foreign field. This is because we check the multiplication of two foreign field elements by constraining its quotient and remainder. That is, renaming to , we must constrain
where the maximum size of and is and so we have
Thus, the maximum size of the multiplication is quadratic in the size of foreign field.
Naïvely, this implies that we have elements of size that must split them up into limbs of size at most . For example, if the foreign field modulus is and the native field modulus is bits, then we’d need bits and, thus, require native limbs. However, each limb cannot consume all bits of the native field element because we need space to perform arithmetic on the limbs themselves while constraining the foreign field multiplication. Therefore, we need to choose a limb size that leaves space for performing these computations.
Later in this document (see the section entitled “Choosing the limb configuration”) we determine the optimal number of limbs that reduces the number of rows and gates required to constrain foreign field multiplication. This results in bits as our optimal limb size. In the section about intermediate products we place some upperbounds on the number of bits required when constraining foreign field multiplication with limbs of size thereby proving that the computations can fit within the native field size.
Observe that by combining the naïve approach above with a limb size of bits, we would require limbs for representing foreign field elements. Each limb is stored in a witness cell (a native field element). However, since each limb is necessarily smaller than the native field element size, it must be copied to the range-check gate to constrain its value. Since Kimchi only supports 7 copyable witness cells per row, this means that only one foreign field element can be stored per row. This means a single foreign field multiplication would consume at least 4 rows (just for the operands, quotient and remainder). This is not ideal because we want to limit the number of rows for improved performance.
Chinese remainder theorem
Fortunately, we can do much better than this, however, by leveraging the chinese remainder theorem (CRT for short) as we will now show. The idea is to reduce the number of bits required by constraining our multiplication modulo two coprime moduli: and . By constraining the multiplication with both moduli the CRT guarantees that the constraints hold with the bigger modulus .
The advantage of this approach is that constraining with the native modulus is very fast, allowing us to speed up our non-native computations. This practically reduces the costs to constraining with limbs over , where is something much smaller than .
For this to work, we must select a value for that is big enough. That is, we select such that . As described later on in this document, by doing so we reduce the number of bits required for our use cases to . With bit limbs this means that each foreign field element only consumes witness elements, and that means the foreign field multiplication gate now only consumes rows. The section entitled “Choosing ” describes this in more detail.
Bringing it all together, our approach is to verify that
over . In order to do this efficiently we use the CRT, which means that the equation holds mod . For the equation to hold over the integers we must also check that each side of the equation is less than .
The overall steps are therefore
- Apply CRT to equation (1)
- Check validity with binary modulus
- Check validity with native modulus
- Check each side of equation (1) is less than
This then implies that
where . That is, multiplied with is equal to where . There is a lot more to each of these steps. That is the subject of the rest of this document.
Within our overall approach, aside from the CRT, we also use a number of other strategies to reduce the number and degree of constraints.
- Avoiding borrows and carries
- Intermediate product elimination
- Combining multiplications
The rest of this document describes each part in detail.
This section describes important parameters that we require and how they are computed.
- Native field modulus
- Foreign field modulus
- Binary modulus
- CRT modulus
- Limb size in bits
Under the hood, we constrain . Since this is a multiplication in the foreign field , the maximum size of and are , so this means
Therefore, we need the modulus such that
which is the same as saying, given and , we must select such that
Thus, we have an lower bound on .
Instead of dynamically selecting for every and combination, we fix a that will work for the different selections of and relevant to our use cases.
To guide us, from above we know that
and we know the field moduli for our immediate use cases.
vesta = 2^254 + 45560315531506369815346746415080538113 (255 bits) pallas = 2^254 + 45560315531419706090280762371685220353 (255 bits) secp256k1 = 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1 (256 bits)
So we can create a table
and know that to cover our use cases we need .
Next, given our native modulus and , we can compute the maximum foreign field modulus supported. Actually, we compute the maximum supported bit length of the foreign field modulus .
So we get
With we have
which is not enough space to handle anything larger than 256 bit moduli. Instead, we will use , giving bits.
The above formula is useful for checking the maximum number of bits supported of the foreign field modulus, but it is not useful for computing the maximum foreign field modulus itself (because is too coarse). For these checks, we can compute our maximum foreign field modulus more precisely with
The max prime foreign field modulus satisfying the above inequality for both Pallas and Vesta is
Choosing the limb configuration
Choosing the right limb size and the right number of limbs is a careful balance between the number of constraints (i.e. the polynomial degree) and the witness length (i.e. the number of rows). Because one limiting factor that we have in Kimchi is the 12-bit maximum for range check lookups, we could be tempted to introduce 12-bit limbs. However, this would mean having many more limbs, which would consume more witness elements and require significantly more rows. It would also increase the polynomial degree by increasing the number of constraints required for the intermediate products (more on this later).
We need to find a balance between the number of limbs and the size of the limbs. The limb configuration is dependent on the value of and our maximum foreign modulus (as described in the previous section). The larger the maximum foreign modulus, the more witness rows we will require to constrain the computation. In particular, each limb needs to be constrained by the range check gate and, thus, must be in a copyable (i.e. permuteable) witness cell. We have at most 7 copyable cells per row and gates can operate on at most 2 rows, meaning that we have an upperbound of at most 14 limbs per gate (or 7 limbs per row).
As stated above, we want the foreign field modulus to fit in as few rows as possible and we need to constrain operands , the quotient and remainder . Each of these will require cells for each limb. Thus, the number of cells required for these is
It is highly advantageous for performance to constrain foreign field multiplication with the minimal number of gates. This not only helps limit the number of rows, but also to keep the gate selector polynomial small. Because of all of this, we aim to constrain foreign field multiplication with a single gate (spanning at most rows). As mentioned above, we have a maximum of 14 permuteable cells per gate, so we can compute the maximum number of limbs that fit within a single gate like this.
Thus, the maximum number of limbs possible in a single gate configuration is 3.
Using and that covers our use cases (see the previous section), we are finally able to derive our limbs size
Therefore, our limb configuration is:
- Limb size bits
- Number of limbs is
When we constrain we want to be as efficient as possible.
Observe that the expansion of into limbs would also have subtractions between limbs, requiring our constraints to account for borrowing. Dealing with this would create undesired complexity and increase the degree of our constraints.
In order to avoid the possibility of subtractions we instead use where
The negated modulus becomes part of our gate coefficients and is not constrained because it is publicly auditable.
Using the substitution of the negated modulus, we now must constrain .
Observe that since and that when .
This section explains how we expand our constraints into limbs and then eliminate a number of extra terms.
We must constrain on the limbs, rather than as a whole. As described above, each foreign field element is split into three 88-bit limbs: , where contains the least significant bits and contains the most significant bits and so on.
Expanding the right-hand side into limbs we have
Since , the terms scaled by and are a multiple of the binary modulus and, thus, congruent to zero . They can be eliminated and we don’t need to compute them. So we are left with 3 intermediate products that we call :
So far, we have introduced these checked computations to our constraints
- Computation of
Let’s call . Remember that our goal is to constrain that (recall that any more significant bits than the 264th are ignored in ). Decomposing that claim into limbs, that means
We face two challenges
- Since are at least bits each, the right side of the equation above does not fit in
- The subtraction of the remainder’s limbs and could require borrowing
For the moment, let’s not worry about the possibility of borrows and instead focus on the first problem.
The first problem is that our native field is too small to constrain . We could split this up by multiplying and separately and create constraints that carefully track borrows and carries between limbs. However, a more efficient approach is combined the whole computation together and accumulate all the carries and borrows in order to reduce their number.
The trick is to assume a space large enough to hold the computation, view the outcome in binary and then split it up into chunks that fit in the native modulus.
To this end, it helps to know how many bits these intermediate products require. On the left side of the equation, is at most bits. We can compute this by substituting the maximum possible binary values (all bits set to 1) into like this
So fits in bits. Similarly, needs at most bits and is at most bits.
The diagram below shows the right hand side of the zero-sum equality from equation (2). That is, the value . Let’s look at how the different bits of and impact it.
0 L 2L 3L 4L |-------------|-------------|-------------|-------------|-------------| : |--------------p0-----------:-| 2L + 1 : |-------------:-p1------------| 2L + 2 p10➚ : p11➚ |----------------p2-------------| 2L + 3 : |-----r0------| : : |-----r1------| : |-----r2------| \__________________________/ \______________________________/ ≈ h0 ≈ h1
Within our native field modulus we can fit up to bits, for small values of (but sufficient for our case). Thus, we can only constrain approximately half of at a time. In the diagram above the vertical line at 2L bisects into two bit values: and (the exact definition of these values follow). Our goal is to constrain and .
Computing the zero-sum halves: and
Now we can derive how to compute and from and .
The direct approach would be to bisect both and and then define as just the sum of the lower bits of and minus and . Similarly would be just the sum of upper bits of and minus . However, each bisection requires constraints for the decomposition and range checks for the two halves. Thus, we would like to avoid bisections as they are expensive.
Ideally, if our ’s lined up on the boundary, we would not need to bisect at all. However, we are unlucky and it seems like we must bisect both and . Fortunately, we can at least avoid bisecting by allowing it to be summed into like this
Note that is actually greater than bits in length. This may not only be because it contains whose length is , but also because adding may cause an overflow. The maximum length of is computed by substituting in the maximum possible binary value of for the added terms and for the subtracted terms of the above equation.
which is bits.
N.b. This computation assumes correct sizes values for and , which we assure by range checks on the limbs.
Next, we compute as
The maximum size of is computed as
In order to obtain the maximum value of , we define . Since the maximum value of was , then the maximum value of is . For , the maximum value was , and thus:
which is bits.
Thus far we have the following constraints
- Composition of and result in
- Range check
- Range check
For the next step we would like to constrain and to zero. Unfortunately, we are not able to do this!
Firstly, as defined may not be zero because we have not bisected it precisely at bits, but have allowed it to contain the full bits. Recall that these two additional bits are because is at most bits long, but also because adding increases it to . These two additional bits are not part of the first bits of and, thus, are not necessarily zero. That is, they are added from the second bits (i.e. ).
Secondly, whilst the highest bits of would wrap to zero , when placed into the smaller bit in the native field, this wrapping does not happen. Thus, ’s highest bits may be nonzero.
We can deal with this non-zero issue by computing carry witness values.
Computing carry witnesses values and
Instead of constraining and to zero, there must be satisfying witness and such that the following constraints hold.
- There exists such that
- There exists such that
Here is the last two bits of ’s bits, i.e., the result of adding the highest bit of and any possible carry bit from the operation of . Similarly, corresponds to the highest bits of . It looks like this
0 L 2L 3L 4L |-------------|-------------|-------------|-------------|-------------| : |--------------h0-----------:--| 2L + 2 : ↖v0 :-------------h1-------------| 2L + 3 : \____________/ : v1➚
Remember we only need to prove the first bits of are zero, since everything is and . It may not be clear how this approach proves the bits are indeed zero because within and there are bits that are nonzero. The key observation is that these bits are too high for .
By making the argument with and we are proving that is something where the least significant bits are all zeros and that is something where the are also zeros. Any nonzero bits after do not matter, since everything is .
All that remains is to range check and
- Range check
- Range check
When unconstrained, computing could require borrowing. Fortunately, we have the constraint that the least significant bits of are
0 (i.e. ), which means borrowing cannot happen.
Borrowing is prevented because when the the least significant bits of the result are
0 it is not possible for the minuend to be less than the subtrahend. We can prove this by contradiction.
- the least significant bits of be
Suppose that borrowing occurs, that is, that . Recall that the length of is at most bits. Therefore, since the top two bits of must be zero and so we have
where denotes the least significant bits of .
Recall also that the length of is bits. We know this because limbs of the result are each constrained to be in . So the result of this subtraction is bits. Since the least significant bits of the subtraction are
0 this means that
which is a contradiction with .
Range checks should be the dominant cost, let’s see how many we have.
Range check (3) requires two range checks for
Range check (8) requires two range checks and a decomposition check that is merged in (6).
The range checks on and follow from the range checks on and .
So we have 3.a, 3.b, 4, 7, 8.a, 8.b.
|Range check||Gate type(s)||Witness||Rows|
|8.b||degree-8 constraint or plookup||1|
|3.b, 4, 8.a||4|
So we have 1 multi-range-check, 1 single-range-check and 2 low-degree range checks. This consumes just over 5 rows.
Use CRT to constrain
Until now we have constrained the equation , but remember that our application of the CRT means that we must also constrain the equation . We are leveraging the fact that if the identity holds for all moduli in , then it holds for .
Thus, we must check , which is over .
This gives us equality as long as the divisors are coprime. That is, as long as . Since the native modulus is prime, this is true.
Thus, to perform this check is simple. We compute
using our native field modulus with constraints like this
and then constrain
Note that we do not use the negated foreign field modulus here.
This requires a single constraint of the form
with all of the terms expanded into the limbs according the the above equations. The values and do not need to be in the witness.
Range check both sides of
Until now we have constrained that equation holds modulo and modulo , so by the CRT it holds modulo . Remember that, as outlined in the “Overview” section, we must prove our equation over the positive integers, rather than . By doing so, we assure that our solution is not equal to some where , in which case or would be invalid.
First, let’s consider the right hand side . We have
Recall that we have parameterized , so if we can bound and such that
then we have achieved our goal. We know that must be less than , so that is our first check, leaving us with
Therefore, to check , we need to check
This should come at no surprise, since that is how we parameterized earlier on. Note that by checking we assure correctness, while checking assures our solution is unique (sometimes referred to as canonical).
Next, we must perform the same checks for the left hand side (i.e., ). Since and must be less than the foreign field modulus , this means checking
So we have
Since we have
To perform the above range checks we use the upper bound check method described in the Foreign Field Addition RFC.
The upper bound check is as follows. We must constrain over the positive integers, so
Remember is our negated foreign field modulus. Thus, we have
So to check we just need to compute and check
So we only need to check
The first check is always true since we are operating over the positive integers and . Observe that the second check also constrains that , since and thus
Therefore, to constrain we only need constraints for
and we don’t require an additional multi-range-check for .
Cost of bound checks
This section analyzes the structure and costs of bounds checks for foreign field addition and foreign field multiplication.
In our foreign field addition design the operands and do not need to be less than . The field overflow bit for foreign field addition is at most 1. That is, , where is allowed to be greater than . Therefore,
These can be chained along times as desired. The final result
Since the bit length of increases logarithmically with the number of additions, in Kimchi we must only check that the final in the chain is less than to constrain the entire chain.
Security note: In order to defer the check to the end of any chain of additions, it is extremely important to consider the potential impact of wraparound in . That is, we need to consider whether the addition of a large chain of elements greater than the foreign field modulus could wrap around. If this could happen then the check could fail to detect an invalid witness. Below we will show that this is not possible in Kimchi.
Recall that our foreign field elements are comprised of 3 limbs of 88-bits each that are each represented as native field elements in our proof system. In order to wrap around and circumvent the check, the highest limb would need to wrap around. This means that an attacker would need to perform about additions of elements greater than then foreign field modulus. Since Kimchi’s native moduli (Pallas and Vesta) are 255-bits, the attacker would need to provide a witness for about additions. This length of witness is greater than Kimchi’s maximum circuit (resp. witness) length. Thus, it is not possible for the attacker to generate a false proof by causing wraparound with a large chain of additions.
In summary, for foreign field addition in Kimchi it is sufficient to only bound check the last result in a chain of additions (and subtractions)
- Compute bound with addition gate (2 rows)
- Range check (4 rows)
In foreign field multiplication, the situation is unfortunately different, and we must check that each of and are less than . We cannot adopt the strategy from foreign field addition where the operands are allowed to be greater than because the bit length of would increases linearly with the number of multiplications. That is,
and after a chain of multiplication we have
where quickly overflows our CRT modulus . For example, assuming our maximum foreign modulus of and either of Kimchi’s native moduli (i.e. Pallas or Vesta), for . That is, an overflow is possible for a chain of greater than 1 foreign field multiplication. Thus, we must check and are less than for each multiplication.
Fortunately, in many situations the input operands may already be checked either as inputs or as outputs of previous operations, so they may not be required for each multiplication operation.
Thus, the and checks (or equivalently and ) are our main focus because they must be done for every multiplication.
- Compute bound with addition gate (2 rows)
- Compute bound with addition gate (2 rows)
- Range check (4 rows)
- Range check (4 rows)
This costs 12 rows per multiplication. In a subsequent section, we will reduce it to 8 rows.
Due to the limited number of permutable cells per gate, we do not have enough cells for copy constraining and (or and ) to their respective range check gadgets. To address this issue, we must decompose into 2 limbs instead of 3, like so
Thus, is decomposed into two limbs (at most bits) and (at most bits).
Note that must be range checked by a
multi-range-check gadget. To do this the
multi-range-check gadget must
- Store a copy of the limbs and in its witness
- Range check that they are bit each
- Constrain that (this is done with a special mode of the
- Have copy constraints for and as outputs of the
ForeignFieldMulgate and inputs to the
Note that since the foreign field multiplication gate computes from which is already in the witness and and have copy constraints to a
multi-range-check gadget that fully constrains their decomposition from , then the
ForeignFieldMul gate does not need to store an additional copy of and .
Since the and checks must be done for each multiplication it makes sense to integrate them into the foreign field multiplication gate. By doing this we can save 4 rows per multiplication.
Doing this doesn’t require adding a lot more witness data because the operands for the bound computations and are already present in the witness of the multiplication gate. We only need to store the bounds and in permutable witness cells so that they may be copied to multi-range-check gates to check they are each less than .
To constrain , the equation we use is
where is the original value, is the field overflow bit and is the remainder and our desired addition result (e.g. the bound). Rearranging things we get
which is just
Recall from the section “Avoiding borrows” that is often larger than . At first this seems like it could be a problem because in multiplication each operation must be less than . However, this is because the maximum size of the multiplication was quadratic in the size of (we use the CRT, which requires the bound that ). However, for addition the result is much smaller and we do not require the CRT nor the assumption that the operands are smaller than . Thus, we have plenty of space in -bit limbs to perform our addition.
So, the equation we need to constrain is
We can expand the left hand side into the 2 limb format in order to obtain 2 intermediate sums
where and are defined like this
and and are defined like this
Going back to our intermediate sums, the maximum bit length of sum is computed from the maximum bit lengths of and
which means is at most bits long.
Similarly, since and are less than , the max value of is
which means is at most bits long.
Thus, we must constrain
The accumulation of this into parts looks like this.
0 L 2L 3L=t 4L |-------------|-------------|-------------|-------------|-------------| : |------------s01------------:-| 2L + 1 : ↖w01 |------s2-----:-| L + 1 : ↖w2 : |------------x'01-----------| : |------x'2----| : \____________________________/ ≈ z01 \_____________/ ≈ z2
The two parts are computed with
Therefore, there are two carry bits and such that
In this scheme and are witness data, whereas and are formed from a constrained computation of witness data and constraint system public parameter . Note that due to carrying, witness and can be different than the values and computed from the limbs.
Thus, each bound addition requires the following witness data
where is baked into the gate coefficients. The following constraints are needed
Due to the limited number of copyable witness cells per gate, we are currently only performing this optimization for .
The witness data is
The checks are
Checks (1) - (5) assure that is at most bits. Whereas checks (10) - (11) assure that is at most bits. Altogether they are comprise a single
multi-range-check of and . However, as noted above, we do not have enough copyable cells to output and to the
multi-range-check gadget. Therefore, we adopt a strategy where the 2 limbs and are output to the
multi-range-check gadget where the decomposition of and into is constrained and then and are range checked.
Although and are not range checked directly, this is safe because, as shown in the “Bound checks” section, range-checking that also constrains that . Therefore, the updated checks are
Note that we don’t need to range-check is at most bits because it is already implicitly constrained by the
multi-range-check gadget constraining that and are each at most bits and that . Furthermore, since constraining the decomposition is already part of the
multi-range-check gadget, we do not need to do it here also.
To simplify things further, we can combine some of these checks. Recall our checked computations for the intermediate sums
where and . These do not need to be separate constraints, but are instead part of existing ones.
Checks (10) and (11) can be combined into a single constraint . Similarly, checks (3) - (5) and (8) can be combined into with and further expanded. The reduced constraints are
Finally, there is one more optimization that we will exploit. This optimization relies on the observation that for bound addition the second carry bit is always zero. This this may be obscure, so we will prove it by contradiction. To simplify our work we rename some variables by letting and . Thus, being non-zero corresponds to a carry in .
Proof: To get a carry in the highest limbs during bound addition, we need
where is the maximum possible size of (before it overflows) and is the overflow bit from the addition of the least significant limbs and . This means
We cannot allow to overflow the foreign field, so we also have
Since we have
Notice that . Now we have
However, this is a contradiction with the definition of our negated foreign field modulus limb .
We have proven that is always zero, so that allows use to simplify our constraints. We now have