Original source: 0062-bitswap.md
Summary
This RFC proposes adding Bitswap to our libp2p networking stack in order to address issues related to our current gossip network pub/sub layer.
Motivation
Mina has very large messages that are broadcast over the gossip net pub/sub layer. This incurs a high bandwidth cost due to the nature of our pub/sub rebroadcast cycles work in order to consistently broadcast messages throughout the network. For example, we observer blocks on mainnet as large as ~2mb. This would represent only a single block, and each block broadcast message has a multiplicative cost on bandwidth as it's being broadcast throughout the network. This bandwidth cost also translates into CPU cost due to the cost of hashing incoming messages to check against the de-duplication cache before processing them. We currently observe behaviors where the libp2p helper process can be pegged at 100% CPU on certain hardware setups when there is high pub/sub throughput on the network. Since gossip pub/sub is used not only for blocks, but also transactions and snark work, the broadcasting of each of these simultaneously ends up compounding the issue.
Implementing Bitswap in our libp2p layer will address this issue by allowing us to immediately reduce our pub/sub message size, while making the larger data referenced by pub/sub messages available upon request. It provides a mechanism for breaking up large data into chunks that can be distributed throughout the network (streamed back from multiple peers), and a system for finding peers on the network who are able to serve that data.
Detailed design
Bitswap is a module provided by libp2p that enables distributed data synchronization over a p2p network, somewhat comparable to how BitTorrent works. It works by splitting up data into chunks called blocks (we will explicitly refer to these as "Bitswap blocks" to disambiguate them from "blockchain blocks"), which are structured into a DAG with a single root. When a node on the network wants to download to some data, it asks it's peers to see which (if any) have the root Bitswap block corresponding to that data. If none of the peers have the data, it falls back to querying the gossip network's DHT to find a suitable node that can serve the data.
In this design, we will lay out an architecture to support Bitswap in our Mina implementation, along with a strategy for migrating Mina blocks into Bitswap to reduce current gossip pub/sub pressure. We limit the scope of migrating Mina data to Bitswap only to blocks for the context of this RFC, but in the future, we will also investigate moving snark work, transactions, and ledger data into Bitswap. Snark work and transactions will likely be modeled similarly to Mina blocks with respect to Bitswap, but ledger data will require some special thought since it's Bitswap block representation will have overlapping Bitswap blocks across different ledgers.
Bitswap Block Format
Bitswap blocks are chunks of arbitrary binary data which are content addressed by IPFS CIDs. There is no pre-defined maximum size of each Bitswap block, but IPFS uses 256kb, and the maximum recommended size of a Bitswap block is 1mb. Realistically, we want Bitswap blocks to be as small as possible, so we should start at 256kb for our maximum size, but keep the size of Bitswap blocks as a parameter we can tune so that we can optimize for block size vs block count.
While the Bitswap specification does not care about what data is stored in each block, we do require each block have a commonly-defined format:
[2 bytes]count of links n[n * 32 bytes]links (each link is a 256-bit hash)[up to (maxBlockSize - 2 - 32 * n) bytes]data
Hence, data blob is converted to a tree of blocks. We advertise the "root" block of the tree as the initial block to download for each resource we store in Bitswap, and the libp2p helper process will automatically explore all the child blocks referenced throughout the tree. To construct the full binary blob out of this tree, breadth-first search (BFS) algorithm should be utilized to traverse the tree. BFS is a more favourable approach to DFS (another traversal order) as it allows to lazily load the blob by each time following nodes links to which we already have from the root block (counter to the order induced by DFS where one has to go to the deepest level before emitting a chunk of data).
For links a 256-bit version of Blake2b hash is to be used. Packing algorithm can
be implemented in the way that no padding is used in blocks and there are
maximum n = (blobSize - 32) / (maxBlockSize - 34) blocks generated with
n - 1 blocks of exactly maxBlockSize bytes.
Bitswap Block Database
There exist some key constraints in choosing a good solution for the Bitswap block db. Importantly, the block database needs to support a concurrent readers and a writer at the same time. Additionally, the database should be optimized for existence checks and reads, as these are the operations we will perform the most frequently against it. Some consistency guarantees (ability to rely on happens-before relation between write and read events) are also required. The database would ideally be persisted, so that we can quickly reload the data in the event of node crash (we want to avoid increasing bootstrap time for a node, as that keeps stake offline after crashes).
Given these constraints, LMDB is a good choice for the Bitswap block storage. It meets all of the above criteria, including persistence. As a bonus, it's light on extra RAM usage.
Database Schema
| Key | Value | Key bytes |
|---|---|---|
status/<root_cid> | integer: 0..2 | <1><32-byte blake2b digest> |
block/<cid> | bitswap block bytes | <0><32-byte blake2b digest> |
Status is an integer taking one of the values:
0(not all descendant blocks present)1(all descendant blocks present)2(delete in process)
Additional data to store in the Daemon db
In addition to data already stored in the DB controlled by the Daemon, we would need to store:
headerHashToRootCid: relation between header and associated root cid (if known)rootCidsToDelete: list of root cids marked for deletion (i.e. root cids related to blocks which were completely removed from frontier)recentlyCreatedBlocks: list of recently created blocks for which we didn't receive confirming upcall
Invariants
The following invariants are maintained for the block storage:
- For each
status/{rcid}in DB, there is an entry for{rcid}in theheaderHashToRootCid - For each cid for which
block/{cid}exists in DB, either holds:- There exists
status/{cid} - There exists
cid'such thatcidis in the link list of bitswap block atblock/{cid'}
- There exists
status/{rc}can progress strictly in the order:null -> 0 -> 1 -> 2 -> null
Initialization
Daemon initialization:
- Remove keys
kfromrootCidsToDeletefor whichstatus/{k}is absent - Remove blocks
bfromrecentlyCreatedBlocksfor whichstatus/{b.root_cid}is present and is not equal to0 - Start Helper
- Send bulk delete request to Helper for keys from
rootCidsToDelete - Send bulk download request to Helper for blocks that are known to frontier
but which do not have
status/{b.root_cid} == 1 - Send add resource request for each block in
recentlyCreatedBlocks
Helper has no initialization at all.
Helper to Daemon interface
Helper receives requests of kind:
- Delete resources with root cid in list
[cid1, cid2, ...]- Sends a single upcall upon deletion of all of these resources
- Download resources with root cid in list
[cid1, cid2, ...]- Sends an upcall upon full retrieval of each resource (one per resource)
- Add resource with root cid
{root_cid}and bitswap blocks[block1, ..., blockN](no checks for hashes are made)- Sends an upcall confirming the resource was successfully added
Synchronization
Block creation:
- Block is added to
recentlyCreatedBlocks - Add resource request is sent to Helper
- Upon add resource confirmation upcall, block is removed from
recentlyCreatedBlocks
Frontier is moved forward and some old blocks get removed:
- Remove record for block from
headerHashToRootCid - Add block to
rootCidsToDelete - Delete resource request is sent to Helper
- Upon delete resource confirmation upcall, block is removed from
rootCidsToDelete
A gossip for the new header is received and Daemon decides that the block body corresponding to the header has to be fetched:
- Store record in
headerHashToRootCidfor the header - Send download request to Helper
- Upon download upcall is received, block and header are added to frontier a. In case downloaded block came late and the block is not more of an interest, launch the deletion flow as described above
(We assume root cid is received along with the header in the gossip)
Migrating Mina Blocks to Bitswap
To migrate Mina block propagation to Bitswap, we will separate a Mina block into
2 portions: a block header, and a block body. Most of the data in a Mina block
is stored inside of the staged_ledger_diff. The common data in every Mina
block is ~8.06kb (including the protocol_state_proof), so using everything
except for the staged_ledger_diff as the block header seems natural. The
staged_ledger_diff would then act as the block body for Mina blocks, and would
be downloaded/made available via Bitswap rather than broadcast over pub/sub.
When blocks are broadcast through the network now, only the block header and a
root CID for the staged_ledger_diff are in the message. When a node receives a
new block header, the node will first verify the protocol_state_proof (all
public information that needs to be fed in for proof verification will be
available in the block header). Once the proof is checked, a node would then
download the staged_ledger_diff via Bitswap. Once that is downloaded, the node
would follow the same pattern right now for generating a breadcrumb by expanding
the staged_ledger from the parent breadcrumb and the new staged_ledger_diff,
after which the Mina block will be fully validated. At this point, the
breadcrumb is added to the frontier.
In summation, the proposed changes in order to move Mina blocks into Bitswap are:
- Define separate block header (block w/o
staged_ledger_diffwith new fieldstaged_ledger_diff_root_cid). - Verify
staged_ledger_diff_root_cidrelation to the header. - Rebroadcast block headers after proofs are checked, but before
staged_ledger_diffs are verified and the breadcrumb is added to the frontier. - Punish block producer public keys if they submit an invalid
staged_ledger_diffby ignoring all future block headers from that producer within the epoch (do not punish senders, as they may not have banned or checked thestaged_ledger_diffyet).
Punishing is done only within the epoch as otherwise punish lists would be accumulating without boundary and real adversaries will nevertheless find the way around by mnoving stake for to other address for the next epoch.
Verifying relation between root CID and header
One large difference from before is that nodes will rebroadcast the block header
to other nodes on the network before the staged_ledger_diff is downloaded and
verified, in order to avoid increasing block propagation time on the network
with the new addition of Bitswap. This change brings some unique problems that
we need to solve now, as previously, we wouldn't forward Mina blocks to other
nodes until we knew the block was fully valid.
In the new world, an adversary could broadcast around the same block header and
proof, but swap out the staged_ledger_diff root Bitswap block CID with
different values to attack the network. An adversary can do that both being a
block producer and as a man-in-the-middle (MITM).
To rule out the MITM attack vector, a signature is to be carried along with the header. The signature is made of a pair of root CID and header hash with block producer's public key. No MITM actor can forge the signature, hence the attack becomes infeasible.
To mitigate adversary the block producer attack case, the following tactic is
employed: if a node ever downloads a staged_ledger_diff which does not achieve
the target staged ledger hash after application to the parent staged ledger,
that node will ban the block producer public key associated with the block. This
significantly raises the cost of attack and makes it in effect pointless.
Shipping plan
Shipping of the feature is two-staged with first stage being shipped as a soft-fork release and the second stage being shipped as part of a hard-fork.
Soft-fork
For soft-fork stage, here is the anticipated changeset:
- Support for sharing blocks via Bitswap
- New pub/sub topics:
mina/block-body/1.0.0mina/tx/1.0.0mina/snark-work/1.0.0
- Engine to rebroadcast blocks from the new topics to old
coda/consensus-messages/0.0.1and vice versa
Legacy topic management
Most new nodes will support both old and new topics for broadcast. Nodes are
able to filter subscriptions from other nodes based on what they subscribe to,
configured using the
WithSubscriptionFilter option.
Utilizing this, nodes that support the new topics can filter out the old topic
from nodes that support both topics. By filtering the topics like this, nodes
running the new version can broadcast new blocks over both topics while avoiding
sending the old message format to other nodes which support the new topic.
In particular, following method of the filter is to be implemented:
type SubscriptionFilter interface {
...
FilterIncomingSubscriptions(peer.ID, []*pb.RPC_SubOpts) ([]*pb.RPC_SubOpts, error)
}
On receiving of these incoming subscription for the old topic, we check whether the same peer is already subscribed to all three new topics and if so, subscriptions is filtered out. Node configuration code shall be implemented accordingly, subscribing to the new topics first.
Over time, when most of the network participants adopt the newer version, only a few specifically configured nodes (including some seeds) will continue servicing the old topic, while most of the network will live entirely on the new topic.
Mina node will take two additional arguments:
--pubsub-v1:rw,roor defaultnone--pubsub-v2: defaultrw,roornone
Daemon runs message propagation for legacy topic as follows:
- If
--pubsub-v2=rwor--pubsub-v2=ro, listen tomina/tx/1.0.0andmina/snark-work/1.0.0- For each valid message if
--pubsub-v1=rwresend it tocoda/consensus-messages/0.0.1
- For each valid message if
- If
--pubsub-v2=rwor--pubsub-v2=ro, listen tomina/block/1.0.0and add headers to frontier - For each new block in frontier, publish it to
coda/consensus-messages/0.0.1if--pubsub-v1=rw - If
--pubsub-v1=rwor--pubsub-v1=ro, listen tocoda/consensus-messages/0.0.1- For each valid block message if
--pubsub-v2=rwresend corresponding header tomina/block/1.0.0 - For each valid transaction message if
--pubsub-v2=rwresend it tomina/tx/1.0.0 - For each valid snark work message if
--pubsub-v2=rwresend it tomina/snark-work/1.0.0
- For each valid block message if
Releasing of Bitswap will happen in two stages:
- Release with default parameters
--pubsub-v1=rwand--pubsub-v2=rw - After most block producers adopt new version, change to default
--pubsub-v1=none - Launch a network of "pubsub relays" which will service both
--pubsub-v1=rwand--pubsub-v2=rwuntil the next hardfork
New block topic
New block topic presents a new message type comprising:
- Block header (as defined above)
- Block body certificate
Block body certificate in turn is the data structure with the following fields:
- 32-byte block body root
- Signature
Block body root is a hash of a root bitswap block containing bytes of block body.
Signature is a digital signature of pair
(block body root hash, block header hash) made with secret key corresponding
to block producer's key specified in the block's header.
Frontier absorbs block headers along with the corresponding block body roots and keeps them as 1:1 relation. Whenever the same block header is received with a different block body root (and a valid block producer signature), block producer is banned from producing a block for the given slot.
Additionally, if bitswap block referenced by block body root is broken, block producer is banned from producing a block for the given slot.
In case of ban for a slot, all of the corresponding data is cleaned out, ban is local.
Hard-fork
In the next hardfork old topic becomes entirely abandoned.
Also, consider migrating to hash-based message ids as described in issue #9876.
Drawbacks
This adds significant complexity to how the protocol gossips around information. The control flow for validating blocks is more complex than before, and there is new state to synchronize between the processes in the architecture. It also adds new delays to when the full block data will be available to each node (but the tradeoff here is that we are able to more consistently gossip block s around the network within the same slot those blocks are produced).
Rationale and alternatives
- it would be possible to download larger data from peers via RPC and still
reduce the pub/sub message size, though there are some issues with this
approach
- it is not guaranteed that any of your peers will have the data you need, in which case you need some alternative mechanism to discover who does have it
- puts a lot of bandwidth pressure on individual peers rather than spreading the load between multiple peers (which helps with both bandwidth pressure and data redundancy for increase availability)
- alternatives to using LMDB as the Bitswap cache
- use SQLite
- even though all we need is a key/value db, not a relational db, SQLite is portable and performant
- would require us to enable both the write-ahead logging and use memory-mapped I/O features in order to use it the way we would like to
- use raw Linux filesystem (from @georgeee)
- would use a lot of inodes and file descriptors if we do not build a mechanism that stores multiple key-value pairs in shared files, which could prove tricky to implement
- would need to solve concurrency problems related to concurrent readers/writers, which could be tricky to get correct and have confidence in
- use SQLite
Verifying root CID to header relation
There is another option to verify the relation: include a commitment in the
snark not only to the target staged ledger hash, but also the root Bitswap block
CID of the staged_ledger_diff that brings us to that state.
Two options compare with one other in the following way:
| Snark commitment | Signature | |
|---|---|---|
| Snark proving time | Increased | Unchanged |
| Computational complexity | High | Low |
| Soft-fork | Incompatible | Compatible |
For the Snark commitment option, adversary needs to generate a snark proof for
each staged_ledger_diff they want to broadcast to the network, hence the high
computational complexity of an attack, which is a desirable property. However,
Signature option is preferred due to the fact that it's softfork-compatible
and doesn't increase the complexity of circuit.
Appendix
For reference on the above computation of ~8.06kb for a block without a staged
ledger diff, here is a snippet of OCaml code that can be run in
dune utop src/lib/mina_transition
let open Core in
let open Mina_transition in
let precomputed_block = External_transition.Precomputed_block.t_of_sexp @@ Sexp.of_string External_transition_sample_precomputed_block.sample_block_sexp in
let small_precomputed_block = {precomputed_block with staged_ledger_diff = Staged_ledger_diff.empty_diff} in
let conv (t : External_transition.Precomputed_block.t) =
External_transition.create
~protocol_state:t.protocol_state
~protocol_state_proof:t.protocol_state_proof
~staged_ledger_diff:t.staged_ledger_diff
~delta_transition_chain_proof:t.delta_transition_chain_proof
~validation_callback:(Mina_net2.Validation_callback.create_without_expiration ())
()
in
Protocol_version.set_current (Protocol_version.create_exn ~major:0 ~minor:0 ~patch:0) ;
External_transition.Stable.Latest.bin_size_t (conv small_precomputed_block) ;;