RFC 0042: Node Status Collection
Original source: 0042-node-status-collection.md
Summary
This RFC proposes a system that collects various stats related to node health to better understand the health of the network participants. This node status collection system would collect stats from the node and send those stats to a url (by default those info would be send to o1labs google cloud backend). And if those stats were sent to the o1labs backend, then those would be persisted and be used for debugging and monitoring purpose.
Motivation
In current version of the network, we saw some performance and scaling issues. By collecting more stats from the network, we hope to improve the speed of the mina protocol by analysize those stats.
Protocol between the node status collection service and o1labs backend
- Node submits a payload to the backend using
POST /submitevery 5 slots - Server saves the request
- Server replies with
200with{"status": "OK"}if the payload is valid
Interface design
The node status collection system would make the following post request to the
designated url every 5 slots.
POST /submitto submit a json payload containing the following data:
{ "data":
{ "peer_id": "<base64 encodeing of the libp2p peer id>"
, "ip_address": "<hash of ip address of the submitter>"
, "mina_version": "<mina version>"
, "git_hash": "<git hash of the mina node>"
, "timestamp": "<current time using RFC-3339 representation>"
, "libp2p_input_bandwidth": "<input bandwidth for libp2p helper system>"
, "libp2p_output_bandwidth": "<output bandwith for libp2p helper system>"
, "libp2p_cpu_usage": "<cpu usage for libp2p helper system>"
, "pubsub_msg_received": { "new_state": { "count": "<integer, e.g. 10>"
, "bandwidth": "<integer, bandwidth in bytes>" }
, "transaction_pool_diff": { "count": "<integer, e.g. 11>"
, "bandwidth": "<integer, bandwidth in bytes>" }
, "snark_pool_diff": { "count": "<integer, e.g. 9>"
, "bandwidth": "<integer, bandwidth in bytes>" }
}
, "pubsub_msg_broadcasted": { "new_state": { "count": "<integer, e.g. 3>"
, "bandwidth": "<integer, bandwidth in bytes>" }
, "transaction_pool_diff": { "count": "<integer, e.g. 6>"
, "bandwidth": "<integer, bandwidth in bytes>" }
, "snark_pool_diff": { "count": "<integer, e.g. 1>"
, "bandwidth": "<integer, bandwidth in bytes>" }
}
, "rpc_received": { "get_transition_chain_proof": { "count": "<integer>"
, "bandwidth": "<integer, bandwidth in bytes>" }
, ...
}
, "rpc_sent": { "get_transition_chain_proof": { "count": "<integer>"
, "bandwidth": "integer, bandwidth in bytes>" }
, ...
}
, "block_height_at_best_tip": "<integer, e.g. 43>"
, "max_observed_block_height": "<integer, e.g. 44>"
, "max_observed_unvalidated_block_height": "<integer, e.g. 44>"
, "slot_and_epoch_number_at_best_tip": { "epoch": "<integer, e.g. 8>"
, "slot": "<integer, e.g. 111>"
}
, "sync_status": "<Synced | Catchup | Bootstrap | Offline | Listening>"
, "catchup_job_states": { "To_initial_validate": "<integer, e.g. 10>"
, "Finished": "<integer, e.g. 11>"
, "To_verify": "<integer, e.g. 29>"
, "To_download": "<integer, e.g. 4>"
}
, "uptime_of_node": "<integer, duration of uptime in secs, e.g. 45678>"
, "blocks_received": [ { "hash": "<base64 encoding of block hash"
, "sender": {"ip": "<ip address>"
,"peer_id": "<base64 encoding of peer id>"}
, "received_at": "<timestamp at which block is received>"
, "is_valid": "<bool>"
, "reason_for_rejection": "<null | reason for why it was rejected"
}
, ...
]
, "peer_count": "<integer, number of peers in libp2p layer, e.g. 152>"
}
}
If the post request is made against o1labs' backend service, then there would
be the following possible responses:
400 Bad Requestwith{"error": "<description of the error"}when input is malformed500 Internal Server Errorwith{"error": "description of the error"}for any server error200with{"status": "OK"}
daemon.json config to enable/disable the node status collection service
This service would be disabled by default. It can be turned on by setting the
statusReport field in daemon.json file to be true (by default it is set to
be false).
Cloud storage
Cloud storage for o1labs backend has the following structure:
<hash_of_ip_address>/<peer_id>/<created_at>.json
<hash_of_ip_address> is the hash of the ip address of the submitter
<peer_id> is the base64 encoding of the libp2p peer id of the submitter The
json file contains the content of the request if the request is valid