KIP 227: Candidate and Validator Evaluation Source

AuthorJoseph, Lewis, Ian, Ollie, Lake
Discussions-Tohttps://github.com/kaiachain/kips/issues/84
StatusDraft
TypeCore
Created2025-01-07

Simple Summary

This proposal presents VRank, a framework for quantitatively assessing the performance and stability of candidates and validators in the Kaia Chain network.

Abstract

As Kaia Chain transitions to a permissionless network structure, it is critical to hold validators more accountable. They play an important role in ensuring that the network runs smoothly, securely, and without problems. This KIP introduces VRank, a Validator Reputation Evaluation Framework that quantitatively assesses the performance and stability of both candidates and validators. VRank aims to ensure that nodes involved in the consensus mechanism are trustworthy and capable of meeting the security requirements of the permissionless Kaia Chain network.

Introduction

It is planned for the Kaia Chain network to change from a Permissioned network to a Permissionless network. In a permissionless network, anyone can become a validator without being approved first. This makes the network more decentralized, safe, and open to everyone. This change fits with Kaia Chain’s goal of making the blockchain ecosystem more open and strong. Please refer to KGP-4: Permissionless Kaia Chain for more thorough details for switching to a permissionless network.

Motivation

In decentralized networks employing blockchain technology, the reliability and performance of validator nodes are crucial for maintaining network stability, security, and efficiency. Validator nodes propose and validate new blocks, ensuring ledger integrity and establishing trust among participants through consistent operation.

However, not all validator nodes operate at optimal efficiency. Certain individuals may experience frequent failures or delays, while others may exhibit malicious behavior, either deliberately or due to external pressures. These problems may lead to delays in the network, a chance of forks, and higher susceptibility to attacks, such as double-spending and censorship.

The current systems for evaluating validator performance may ineffectively penalize persistent underperformance or malicious behavior, and they fail to consistently encourage optimal performance incentives. A thorough evaluation system is essential to precisely evaluate the reliability of validator nodes, discourage inadequate performance, and improve the overall integrity of the network.

This proposal introduces an innovative evaluation framework for candidates and validators, highlighting measurable performance metrics. Our objective is to create a more resilient and fair framework for assessing node performance by defining metrics such as the Proposal Failure Score (PFS) and Candidate Failure Score (CFS). This framework aims to identify underperforming or malicious nodes, which helps preserve high standards among validators.

The framework promotes consistent uptime and reliability. Validators have an incentive to maintain the stability and responsiveness of their nodes, thereby maintaining the performance of the network.

Specification

Parameters

Constant Value/Definition
FORK_BLOCK TBD
CANDIDATE_MSG_TIMEOUT Protocol parameter (milliseconds). Default = 500ms.
EPOCH_LENGTH 86,400 blocks (approximately 1 day, assuming 1-second block time)
MAX_BYZANTINE_NODES (F) Calculated as F = (n - 1) // 3, where n is the number of validators
PFS_THRESHOLD 2 (max proposal failures per epoch)
CFS_THRESHOLD 300 (max candidate failures per epoch)

Data Structures and Protocol Primitives

Block Header Extension (VRank)

Starting from FORK_BLOCK, the block header includes a new field VRank. The VRank field contains RLPEncode(cfReport) or nil if cfReport is empty.

type Header struct {
    ParentHash   common.Hash
    // ... existing fields ...
    Extra        []byte
    Governance   []byte
    Vote         []byte
    BaseFee      *big.Int
    RandomReveal []byte
    MixHash      []byte
    VRank        []byte  // New field
}

Reports (pfReport, cfReport)

Both pfReport and cfReport are per-block data structures. A node’s presence in either report is undesirable: it indicates a failure, and the node may be penalized in future epoch evaluations.

pfReport (Proposal Failure Report): Extractable from header.Extra. Contains the list of proposers who induced round-change during the consensus of the block.

Format: pfReport(N) -> [proposerAddrRound0, proposerAddrRound1, ...] with at most one entry per validator (validator(N)).

cfReport (Candidate Failure Report): Encoded in header.VRank at block N for target block N-1. Contains the list of candidates (nodes in CandTesting) that failed to send a valid VRankCandidate message on-time for block N-1. If N % k*EPOCH_LENGTH == 0 (epoch start), cfReport(N) MUST be empty.

Format: cfReport(N) -> [candidateAddr1, candidateAddr2, ...] with at most one entry per candidate of previous block (candidate(N-1)).

VRankPreprepare

VRankPreprepare is a message type sent by the proposer of block N to all candidates under CandTesting after having sent Istanbul Preprepare messages to consensus participants. It triggers candidates to respond with VRankCandidate. The timeout for candidate response is CANDIDATE_MSG_TIMEOUT (default 500ms).

type VRankPreprepare struct {
	Block *types.Block
	View  *istanbul.View
}

VRankCandidate

VRankCandidate is a message type sent by each candidate (node in CandTesting) to all validators under ValActive upon receiving VRankPreprepare. A candidate must send VRankCandidate within CANDIDATE_MSG_TIMEOUT of the counterparty’s preprepared_time to be counted as on-time.

Signature scheme: The signature MUST be produced with the candidate’s validator signing key over keccak256("VRANK_CANDIDATE_V1" || chain_id || block_number || round || block_hash), with an unambiguous canonical encoding of each field.

type VRankCandidate struct {
	BlockNumber uint64
	Round       uint8
	BlockHash   common.Hash
	Sig         []byte
}

Consensus Protocol Integration

VRank runs in parallel with consensus. Per block, reports (pfReport and cfReport) are produced during consensus and committed in the next block header.

Proposer of block N

  1. After having sent Istanbul Preprepare messages to consensus participants, the proposer MUST send VRankPreprepare to all candidates in CandTesting.
  2. The round information is recorded in header.Extra as part of the existing consensus. If the proposer fails to propose and a round change occurs, the failed proposer’s address is recorded in pfReport(N).

Validators during consensus for block N

  1. When block N enters the preprepared pBFT state, each validator MUST record preprepared_time.
  2. Each validator MUST collect VRankCandidate messages from candidates in CandTesting and record each message’s arrival time.
  3. If a validator receives more than one VRankCandidate from the same candidate for the same view (block number N and round R), only the first valid message MUST be accepted; subsequent messages MUST be ignored.
  4. A candidate is counted as on-time if the message is valid and either (a) it arrives before preprepared_time, or (b) arrival_time - preprepared_time ≤ CANDIDATE_MSG_TIMEOUT. Otherwise, it will be recorded in cfReport(N+1).

Candidates (nodes in CandTesting)

  1. Upon receiving VRankPreprepare for block N, each candidate MUST broadcast VRankCandidate to all validators in ValActive.
  2. To be counted as on-time, the VRankCandidate MUST arrive at each validator within CANDIDATE_MSG_TIMEOUT of that validator’s preprepared_time for block N.

Proposer of block N+1

  1. When proposing block N+1, the proposer MUST build header.VRank from cfReport(N).
  2. cfReport(N+1) MUST include each candidate (in CandTesting at block N) who either (a) did not send a valid VRankCandidate for block N on-time, or (b) sent an invalid message.
  3. The proposer MUST encode cfReport in header.VRank as RLPEncode(cfReport) or nil if empty.
  4. Candidates in cfReport are counted as failures for CFS aggregation.

Block Validation

After FORK_BLOCK, validators MUST validate the newly added VRank field in the block header. The values of the subfields (cfReport) are used to evaluate node performance using the components of the VRank framework. Given a header with number N:

  • header.VRank MUST be nil or RLPEncode(cfReport(N)).
  • cfReport(N) MUST contain at most one entry per candidate ID. Each entry must be a candidate address from candidates(N-1).

Failure Scores (PFS, CFS)

Each score is per epoch, computed from pfReport and cfReport in epoch blocks. Higher values indicate worse performance; zero indicates no failures.

Proposal Failure Score (PFS): For a given block number N, PFS MUST be computed from pfReport(b) for blocks x ∈ [epochStart(N), N]. For each validator, count how many times the validator appears across all pfReports in the epoch (each round change adds one entry). PFS maps each validator address to its total proposal failure count.

Format: pfs(N) -> map[proposerAddr]score

Candidate Failure Score (CFS): For a given block number N, CFS MUST be computed from cfReport(b) for blocks b ∈ [epochStart(N), N] (note that cfReport(epochStart(N)) is empty). For each candidate C and reporter (proposer of block N): if C is in cfReport(N), that counts as 1 failure. For each candidate, sum failures per reporter over the epoch, discard the highest F reporter totals (Byzantine filtering), and sum the remainder to obtain CFS.

Format: cfs(N) -> map[candidateAddr]score

Example: Byzantine filtering in CFS

Example 1: Short epoch

epoch = 5
len(validator) = 4
len(candidates) = 3
F = 1

proposer(5)=P1, cfReport(5)=[]
proposer(6)=P2, cfReport(6)=[]
proposer(7)=P3, cfReport(7)=[C1,C2,C3]
proposer(8)=P4, cfReport(8)=[C1,C2]
proposer(9)=P4, cfReport(9)=[C1,C2]

Aggregated cfReport(N) where N ∈ [5, 9]:

Candidate \ Reporter raw data (cfReport) summary (CFS)
P1P2P3P4 TotalFilteredByzantine filtering
C1001231P4 is not counted
C2001231P4 is not counted
C3001010P3 is not counted

Example 2: Byzantine behavior

Consider a network with 10 validators (proposers P1–P10) and 5 candidates (C1–C5). The table shows how many times each candidate appears in cfReport(N) when each proposer produced a report (i.e., failures reported per candidate per reporter). P8, P9, and P10 report abnormally high counts for C1–C3, suggesting Byzantine behavior. With F = 3, we discard the highest 3 reporter totals per candidate and sum the remainder to obtain the filtered CFS.

Candidate \ Reporter raw data (cfReport) summary (CFS)
P1P2P3P4P5P6P7P8P9P10 TotalFilteredByzantine filtering
C11412153412322086408637863426050139exclude reports from P8,P9,P10
C24810593349494186408637863426200289exclude reports from P8,P9,P10
C34822404144276186408637863426194283exclude reports from P8,P9,P10
C45029453023242565664397221exclude reports from P8,P9,P10
C57134625112018301913283116exclude reports from P1,P2,P3

Note: Each cfReport(N) is in the header of block N and reports on target block N - 1.

Score thresholds

A node must meet the following stability requirements to participate in consensus:

  • Block proposal participation: At most PFS_THRESHOLD proposal failures per epoch.
  • Downtime: Less than 0.5% downtime per epoch (fewer than 432 blocks missed).

Violations:

  • a validator exceeding PFS_THRESHOLD in an epoch is classified as not qualified.
  • a candidate exceeding CFS_THRESHOLD in an epoch is classified as not qualified.

The handling of not-qualified nodes is specified in KIP-286.

Rationale

Choice of CFS_THRESHOLD

Historical data from vrank logs indicates that most healthy nodes pass the evaluation with CFS well below 300 per epoch. The threshold of 300 is therefore set to distinguish underperforming or unstable candidates while allowing normal nodes to qualify for validator promotion.

Importance of Mitigating Malicious Behavior

Byzantine Nodes
In a permissionless environment, some validators may act maliciously, attempting to disrupt the network or unfairly penalize honest nodes. It is assumed that up to one-third of the validators may behave maliciously.

Filtering Mechanisms
To mitigate the impact of malicious validators, the highest F failure reports are excluded in CFS calculations. This ensures that the actions of a few Byzantine nodes do not distort the evaluation of honest candidates.

Robust Scoring Algorithms
VRank’s design ensures that honest nodes are not unfairly penalized due to the actions of Byzantine nodes.

Importance of the 500ms Deadline

Ensuring Consensus Responsiveness

The 500ms deadline for CANDIDATE_MSG_TIMEOUT ensures that candidates respond promptly, supporting the network’s goal of generating blocks every second.

Regional Centralization

Candidate-to-validator latency depends on distance: candidates near the validator cluster (where most validators are located) observe shorter latency; those farther away observe longer latency. A short timeout would favor candidates collocated with the validator majority and effectively exclude those at greater distance, reinforcing regional concentration. This carries significant risks: (a) a regional network outage could affect a large fraction of the validator set, and (b) operators outside the cluster face a higher barrier to participation.
A longer timeout (e.g., 500ms) allows candidates farther from the cluster to participate, improving decentralization and resilience.

Block Time Impact

A longer timeout does not necessarily slow block production. Block progression is driven by the proposer’s Istanbul Preprepare being sent and committed on time; VRankCandidate is an auxiliary evaluation message collected in parallel. The 500ms window accounts for global network latency variations without unfairly penalizing distant candidates, while the primary consensus path remains unaffected.

Design Choice

Given that regional centralization poses a greater risk than the marginal impact of a 500ms evaluation window, the design favors a longer timeout to support geographic diversity.

The omission of signatures in cfReport

If we required valid VRankCandidate messages (with signatures) to be included in cfReport, then each entry would need a verifiable signature. However, the proposer has the authority to include any candidate in cfReport regardless. A malicious proposer could intentionally omit a candidate’s valid signature and claim that the candidate did not send any message—thereby falsely penalizing an honest candidate (a false positive).

Including signatures would block false negatives (a proposer could not falsely claim a candidate failed when they actually sent a valid message). However, if the proposer is an accomplice of the candidate, they could collude to omit the candidate from cfReport even when the candidate failed—bypassing the signature check.

Given that signatures cannot fully prevent manipulation in either direction, and that signatures add significant size to the report, we decided to simplify: cfReport is a list of candidate addresses only (no signatures). The Byzantine filtering in CFS (excluding the highest F reporter totals) mitigates the impact of malicious proposers.

The exclusion of pfReport from header.VRank

pfReport is extracted from header.Extra rather than stored in header.VRank. Round-change information is recorded during consensus, before the block is finalized. If pfReport were written into header.VRank upon each round change, the header would need to be updated mid-consensus. Supporting such updates would require substantial changes to the current implementation. The Extra field is already populated during consensus with round-change data, so pfReport is derived from there instead.

Empty CfReport(k*EPOCH_LENGTH)

The validator set changes every EPOCH_LENGTH, so there may be new validators at block k*EPOCH_LENGTH that were not validators at block k*EPOCH_LENGTH - 1. Those new validators did not participate in consensus for block k*EPOCH_LENGTH - 1 and therefore could not have collected VRankCandidate messages. The proposer of block k*EPOCH_LENGTH may be such a new validator, so they cannot produce a valid cfReport(k*EPOCH_LENGTH). Hence the vrank field in the header at k*EPOCH_LENGTH MUST be nil.

Backward Compatibility

The introduction of VRank does not affect existing nodes before FORK_BLOCK. Nodes operating prior to FORK_BLOCK will continue to function as before. After FORK_BLOCK, the new vrank field and associated validation processes come into effect.

Security Considerations

Handling Byzantine Nodes

Assumption of One-Third Malicious Validators
We accept the standard Byzantine fault tolerance assumption that up to one-third of validators may behave maliciously. Kaia Chain relies on the assumption that less than one-third of participants are malicious to ensure safety and liveness. VRank’s scoring mechanism is designed with this threshold in mind, allowing the network to function correctly even in the presence of some malicious actors.

Limitations and Contingencies

If the number of malicious validators exceeds one-third, the network’s ability to reach consensus and maintain integrity may be compromised.

Justification for the Assumption
While it’s challenging to prevent all malicious activity, assuming that up to one-third of validators could be compromised provides a practical balance between security and network performance.

Implementation

TBD

Appendix: Node Models

The following node models were considered when designing VRank and defining a stable node. They describe the philosophy behind the scoring thresholds and help clarify the types of behavior VRank aims to distinguish.

VRank categorizes nodes into four models to evaluate their performance and stability:

Node Performance Impact
Uptime > 99.5%, No network issues Excellent Contribute to network stability
Uptime about 99.5% temporally unstable Good May delay block time
Uptime < 99.5% Not good May fail to propose a block
Halts continuously regardless uptime Bad May affect consensus if consists of nodes experiencing this
Uptime > 99.5%, try to destabilize the network N/A Threat network integrity
  1. Node A: Stable Node

    Characteristics: Capable of performing validation duties with optimal performance and stability.

    Impact on Network: Contributes positively to network stability and performance.

  2. Node B: Temporarily Unstable Node

    Characteristics: Experiences brief, frequent network disruptions that last a few seconds.

    Impact on Network: May delay block creation if selected as a proposer but does not cause a round change.

  3. Node C: Intermittently Stopping Node

    Characteristics: Experiences longer network disruptions (tens of seconds).

    Impact on Network: May fail to propose a block when selected as a proposer, resulting in round changes and significant delays.

  4. Node M: Malicious Node

    Characteristics: Intentionally attempts to destabilize the network through malicious actions.

    Impact on Network: Threatens network security and integrity. VRank aims to mitigate the influence of such nodes.

Copyright and related rights waived via CC0.