KIP 227: Candidate and Validator Evaluation
| Author | Joseph |
|---|---|
| Discussions-To | https://github.com/kaiachain/kips/issues/84 |
| Status | Draft |
| Type | Core |
| Created | 2025-01-07 |
Simple Summary
This proposal presents VRank, a framework for quantitatively assessing the performance and stability of candidates and validators in the Kaia Chain network.
Abstract
As Kaia Chain transitions to a permissionless network structure, it is critical to hold validators more accountable. They play an important role in ensuring that the network runs smoothly, securely, and without problems. This KIP introduces VRank, a Validator Reputation Evaluation Framework that quantitatively assesses the performance and stability of both candidates and validators. VRank aims to ensure that nodes involved in the consensus mechanism are trustworthy and capable of meeting the security requirements of the permissionless Kaia Chain network.
Introduction
It is planned for the Kaia Chain network to change from a Permissioned network to a Permissionless network. In a permissionless network, anyone can become a validator without being approved first. This makes the network more decentralized, safe, and open to everyone. This change fits with Kaia Chain’s goal of making the blockchain ecosystem more open and strong. Please refer to KGP-4: Permissionless Kaia Chain for more thorough details for switching to a permissionless network.
Motivation
In decentralized networks employing blockchain technology, the reliability and performance of validator nodes are crucial for maintaining network stability, security, and efficiency. Validator nodes propose and validate new blocks, ensuring ledger integrity and establishing trust among participants through consistent operation.
However, not all validator nodes operate at optimal efficiency. Certain individuals may experience frequent failures or delays, while others may exhibit malicious behavior, either deliberately or due to external pressures. These problems may lead to delays in the network, a chance of forks, and higher susceptibility to attacks, such as double-spending and censorship.
The current systems for evaluating validator performance may ineffectively penalize persistent underperformance or malicious behavior, and they fail to consistently encourage optimal performance incentives. A thorough evaluation system is essential to precisely evaluate the reliability of validator nodes, discourage inadequate performance, and improve the overall integrity of the network.
This proposal introduces an innovative evaluation framework for candidates and validators, highlighting measurable performance metrics. Our objective is to create a more resilient and fair framework for assessing node performance by defining metrics such as the Proposal Failure Score (PFS) and Message Transmission Failure Scores (Total and Consecutive). This framework aims to identify underperforming or malicious nodes, which helps preserve high standards among validators.
The framework promotes consistent uptime and reliability. Validators have an incentive to maintain the stability and responsiveness of their nodes, thereby maintaining the performance of the network.
Specification
Parameters
| Constant | Value/Definition |
|---|---|
FORK_BLOCK |
TBD |
CANDIDATE_READY_TIMEOUT |
Protocol parameter (milliseconds). Default = 200ms. |
BLOCK_TIME |
1 second per block |
EPOCH_LENGTH |
86,400 blocks (approximately 1 day, assuming 1-second block time) |
MAX_BYZANTINE_NODES (F) |
Calculated as F = (n - 1) // 3, where n is the number of validators |
DOWNTIME_THRESHOLD |
0.5% per day (equivalent to fewer than 432 blocks missed per day) |
PFS_FAILURE_THRESHOLD |
Threshold for the number of block proposal failures per day |
CONSECUTIVE_FAILURE_LENGTH_10_CF |
10 consecutive failures define a 10-CF |
CONSECUTIVE_FAILURE_LENGTH_15_CF |
15 consecutive failures define a 15-CF |
Definition of a Stable Node
VRank defines a node qualified to perform the role of a validator as follows.
Downtime: A stable node must have downtime due to network or node issues below the DOWNTIME_THRESHOLD (equivalent to fewer than 432 blocks missed).
Block Proposal Participation: A stable node must consistently participate in block proposals with no more than PFS_FAILURE_THRESHOLD block proposal failures per day for any reason.
Node Models Definition
VRank categorizes nodes into four models to evaluate their performance and stability:
| Node | Performance | Impact |
|---|---|---|
| Uptime > 99.5%, No network issues | Excellent | Contribute the network stability |
| Uptime about 99.5% temporally unstable | Good | May delay block time |
| Uptime < 99.5% | Not good | May Fail to propose a block |
| Halts continuously regardless uptime | Bad | May affect consensus if consist of nodes experiencing this |
| Uptime > 99.5%, try to destabilize the network | N/A | Threat network integrity |
-
Node A: Stable Node
Characteristics: Capable of performing validation duties with optimal performance and stability.
Impact on Network: Contributes positively to network stability and performance.
-
Node B: Temporarily Unstable Node
Characteristics: Experiences brief, frequent network disruptions that last a few seconds.
Impact on Network: May delay block creation if selected as a proposer but does not cause a round change.
-
Node C: Intermittently Stopping Node
Characteristics: Experiences longer network disruptions (tens of seconds).
Impact on Network: May fail to propose a block when selected as a proposer, resulting in round changes and significant delays.
-
Node M: Malicious Node
Characteristics: Intentionally attempts to destabilize the network through malicious actions.
Impact on Network: Threatens network security and integrity. VRank aims to mitigate the influence of such nodes.
Block Header Changes
Starting from FORK_BLOCK, the block proposer must include a new field vrank in the block header.
class Header:
parentHash: hash
# ... existing fields ...
extra: bytes
governance: bytes
vote: bytes
baseFee: int
randomReveal: bytes
mixHash: bytes
vrank: bytes # New field
The vrank field comprises two subfields: pfReport and crReport
pfReport: Proposal-failure report for the current block N. It records all round-change events that occurred while reaching consensus for block N, as an ordered list of (round number, proposer). If no round change occurred for block N, pfReport MUST be an empty list.
crReport: CandidateReady report for the previous block. In the header of block N+1, it records the candidates who successfully submitted a valid CandidateReady message for target block N, formatted as [candidate ID, signature]. The crReport list MUST contain at most one entry per candidate ID. Candidate failures are derived from absence (missing entry) in crReport.
vrank encoding MUST be either empty bytes, or RLP([pfReport, crReport]). If vrank is empty bytes, it MUST be interpreted as pfReport = [] and crReport = [].
CandidateReady Message Format
proposal_hash is the proposal identifier hash derived from the proposal received in preprepare (i.e., a hash that MUST NOT depend on the vrank field and can be computed by all validators/candidates at preprepare time).
CandidateReady MUST contain (block_number, proposal_hash, signature), where block_number is the target height N and proposal_hash is the proposal hash derived from the proposal received in preprepare for block N. The signature MUST be produced with the candidate’s validator signing key over a typed hash that includes domain separation, chain_id, fork_id, block_number, and proposal_hash (e.g., sig = Sign(keccak256("VRANK_CANDIDATE_READY_V1" || chain_id || fork_id || block_number || proposal_hash)), with an unambiguous canonical encoding). Validators MAY accept CandidateReady that arrives before proposal evidence is available, store it as pending, and validate it once (N, proposal_hash) is known.
Changes to Block Validation Process
Once FORK_BLOCK is reached, validators MUST validate the newly added vrank field in the block header. The values of the subfields (pfReport and crReport) are used to evaluate node performance using the components of the VRank framework.
Validators MUST apply the following rules:
pfReportMUST be ordered by increasing round number and MUST match the validator’s local round-change record for the same block height.crReportMUST contain at most one entry per candidate ID.- Each
crReportentry MUST be verifiable against the candidate’s validator key and MUST carry aCandidateReadysignature bound to(block_number = N, proposal_hash)for the reported targetN = H - 1, whereHis the block height of the header that contains thecrReport.
VRank Score Components
The VRank framework evaluates node performance using three independent metrics. Metrics reset at the epoch start block. These metrics apply separately to validators and candidates, allowing for a more focused assessment of each role’s responsibilities.
1. Proposal Failure Score (PFS)
Definition: The Proposal Failure Score (PFS) measures the number of times a validator fails to propose a block successfully.
Measurement Method: If a validator fails to propose a block, resulting in a round change, the proposal failure count increases by one.
Consensus Method: The proposer of block N MUST record the proposal failure information of block N in the pfReport field of the block header at height N, as an ordered list of (round number, proposer) for each failed round. Validators MUST compare pfReport(N) with their local round-change record for block N to reach consensus.
Score (Aggregation): For epoch index k, PFS MUST be computed from headers H ∈ [k*EPOCH_LENGTH, (k+1)*EPOCH_LENGTH - 1] by counting each validator’s proposal failures from pfReport.
2. Message Transmission Failure Score (MFS)
The Message Transmission Failure Score is divided into two components:
a. Total Message Transmission Failure Score (TMFS)
b. Consecutive Message Transmission Failure Score (CMFS)
2.1 Total Message Transmission Failure Score (TMFS)
Definition: Measures the number of times a candidate fails to transmit the expected CandidateReady message during a block proposal cycle, after removing the highest F of the failure counts to address measurement distortions.
Measurement Method: During the evaluation period, if the next proposer N+1 receives a block proposal from proposer N and does not receive the CandidateReady message within the specified timeout (CANDIDATE_READY_TIMEOUT), the total failure count for C increases by 1.
Consensus Method: The proposer of block N+1 MUST record, in the crReport field of the block header at height N+1, the list of (candidate ID, signature) for candidates that successfully submitted a valid CandidateReady message for target block N within the specified timeout (CANDIDATE_READY_TIMEOUT). The crReport list MUST contain at most one entry per candidate ID.
Score (Aggregation): For epoch k, TMFS MUST be computed from headers H ∈ [k*EPOCH_LENGTH + 1, (k+1)*EPOCH_LENGTH - 1], where each header reports target block N = H - 1 (thus covering EPOCH_LENGTH - 1 targets). For each candidate and reporter (the proposer of header H), missing entry = 1 failure in crReport(H); failures MUST be summed per reporter over the epoch, the highest F reporter totals discarded, and the remainder summed as the filtered TMFS.
Example: Mitigating Measurement Distortion in TMFS Calculation
Consider a network with 10 validators and 5 candidates. The following table shows the number of message transmission failures reported by each proposer (P1 to P10) for each candidate (C1 to C5).
| Candidate \ Proposer | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | Total Failures | Filtered Failures |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C1 | 14 | 12 | 15 | 34 | 12 | 32 | 20 | 8640 | 8637 | 8634 | 26050 | 139 |
| C2 | 48 | 10 | 59 | 33 | 49 | 49 | 41 | 8640 | 8637 | 8634 | 26200 | 289 |
| C3 | 48 | 22 | 40 | 41 | 44 | 27 | 61 | 8640 | 8637 | 8634 | 26194 | 283 |
| C4 | 50 | 29 | 45 | 30 | 23 | 2 | 42 | 56 | 56 | 64 | 397 | 221 |
| C5 | 71 | 34 | 62 | 5 | 11 | 20 | 18 | 30 | 19 | 13 | 283 | 116 |
Explanation
The high failure counts reported by P8, P9, and P10 for candidates C1 to C3 indicate potential measurement distortion by malicious validators.
To mitigate this, we exclude the highest F (in this case, F=3) failure counts for each candidate.
The Filtered Failures column shows the total failures after excluding the highest 3 counts.
2.2 Consecutive Message Transmission Failure Score (CMFS)
Definition: CMFS measures extended sequences of consecutive transmission failures. This metric helps identify nodes with chronic instability.
Measurement Method:
If 10 consecutive proposers report that candidate C fails to send the CandidateReady message, it is recorded as 1 instance of 10-CFs, and the CMFS increases by 1 after 15 such instances.
If 15 consecutive proposers report that candidate C fails to send the CandidateReady message, it is recorded as 1 instance of 15-CFs, and the CMFS increases by 2 after 10 such instances.
Consensus Method: Same as TMFS.
Score (Aggregation): For epoch k, CMFS MUST be computed over the same EPOCH_LENGTH - 1 targets and N = H - 1 mapping as TMFS (missing crReport(H) entry = 1 failure), then track consecutive failures per candidate. The score MUST be CMFS = (count_10_cf // 15) * 1 + (count_15_cf // 10) * 2.
CMFS Calculation Example
Consider the following example where candidates C1 and C2 have message transmission failures over a series of blocks:
| Block Number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 3-CFs | 5-CFs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C1 | x | x | x | x | x | x | x | 3 | 0 | ||||
| C2 | x | x | x | x | x | x | x | x | 1 | 1 |
Explanation:
C1 has three instances of 3 consecutive failures but no instance of 5 consecutive failures.
C2 has one instance of both 3-CFs and 5-CFs.
Pseudocode for CMFS Calculation
def update_cmfs(candidate, success, block_number):
state = cmfs_state[candidate]
if not success:
if state['cf_start'] == 0:
# Start of a new consecutive failure sequence
state['cf_start'] = block_number
else:
if state['cf_start'] != 0:
# End of a consecutive failure sequence
cf_length = block_number - state['cf_start']
if cf_length >= 15:
state['count_15_cf'] += 1
if state['count_15_cf'] == 10:
state['cmfs_score'] += 2
elif cf_length >= 10:
state['count_10_cf'] += 1
if state['count_10_cf'] == 15:
state['cmfs_score'] += 1
# Reset the start point
state['cf_start'] = 0
Rationale
Importance of Mitigating Malicious Behavior
Byzantine Nodes
In a permissionless environment, some validators may act maliciously, attempting to disrupt the network or unfairly penalize honest nodes. It is assumed that up to one-third of the validators may behave maliciously.
Filtering Mechanisms
To mitigate the impact of malicious validators, the highest F failure reports are excluded in TMFS calculations. This ensures that the actions of a few Byzantine nodes do not distort the evaluation of honest candidates.
Robust Scoring Algorithms
VRank’s design ensures that honest nodes are not unfairly penalized due to the actions of Byzantine nodes.
Importance of the 200ms Deadline
Ensuring Consensus Responsiveness
The 200ms deadline for CANDIDATE_READY_TIMEOUT ensures that candidates respond promptly, supporting the network’s goal of generating blocks every second.
Balancing Network Latency
The deadline accounts for global network conditions, allowing for network latency variations without unfairly penalizing candidates.
Score Aggregation Range at Epoch Boundaries (TMFS/CMFS)
TMFS/CMFS are derived from crReport committed in the next block header (N+1). Therefore, for epoch k, the measurable header range is H ∈ [k*EPOCH_LENGTH + 1, (k+1)*EPOCH_LENGTH - 1] (equivalently targets N = H - 1, i.e., exactly EPOCH_LENGTH - 1 targets).
The final target (k+1)*EPOCH_LENGTH - 1 is excluded because it would require collecting header (k+1)*EPOCH_LENGTH. However, getCandidate((k+1)*EPOCH_LENGTH) must be determined before (k+1)*EPOCH_LENGTH block consensus, leaving no time to wait for candidate messages at the epoch boundary.
Importance of Chronic Failures (CMFS)
Nodes that repeatedly fail to transmit messages consecutively pose a significant risk to network stability. By implementing a robust policy that identifies and addresses these chronic failures through CMFS, the network ensures that only reliable and stable nodes continue participating in consensus.
Backward Compatibility
The introduction of VRank does not affect existing nodes before FORK_BLOCK. Nodes operating prior to FORK_BLOCK will continue to function as before. After FORK_BLOCK, the new vrank field and associated validation processes come into effect.
Security Considerations
Handling Byzantine Nodes
Assumption of One-Third Malicious Validators
We accept the standard Byzantine fault tolerance assumption that up to one-third of validators may behave maliciously. Kaia Chain relies on the assumption that less than one-third of participants are malicious to ensure safety and liveness. VRank’s scoring mechanism is designed with this threshold in mind, allowing the network to function correctly even in the presence of some malicious actors.
Limitations and Contingencies
If the number of malicious validators exceeds one-third, the network’s ability to reach consensus and maintain integrity may be compromised.
Justification for the Assumption
While it’s challenging to prevent all malicious activity, assuming that up to one-third of validators could be compromised provides a practical balance between security and network performance.
Implementation
TBD
Copyright
Copyright and related rights waived via CC0.