Skip to content

feat: optimize rpc calls#5394

Merged
sbackend123 merged 25 commits intomasterfrom
feat/rpc-calls-optimisation
Apr 15, 2026
Merged

feat: optimize rpc calls#5394
sbackend123 merged 25 commits intomasterfrom
feat/rpc-calls-optimisation

Conversation

@sbackend123
Copy link
Copy Markdown
Contributor

@sbackend123 sbackend123 commented Mar 12, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Removed RPC patterns in the transaction flow. Added cache layer for BlockNumber rpc call.
Fixed lint issues in the new cache package and cleaned up minor lint findings in adjacent test/helper code.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

#5388

Screenshots (if appropriate):

Screenshot from 2026-03-16 13-29-22

Highly likely hit ratio is low because of high load errors (EOF), which is fix by PR

@sbackend123 sbackend123 changed the title RPC calls optimisation Feat: RPC calls optimisation Mar 12, 2026
@sbackend123 sbackend123 force-pushed the feat/rpc-calls-optimisation branch from f8ca4e2 to e55ac11 Compare March 13, 2026 07:54
@sbackend123 sbackend123 force-pushed the feat/rpc-calls-optimisation branch from e55ac11 to 04f6520 Compare March 13, 2026 08:00
@sbackend123 sbackend123 changed the title Feat: RPC calls optimisation feat: optimize rpc calls Mar 13, 2026
@sbackend123 sbackend123 marked this pull request as ready for review March 16, 2026 12:31
Comment thread cmd/bee/cmd/deploy.go Outdated
Comment thread pkg/bmt/proof_test.go
t.Helper()

var expSegments [][]byte
expSegments := make([][]byte, 0, len(exp))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are fine but feel like they belong in a separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but linter was failing so I decided to fix everything. Next time will create separate PR

Comment thread pkg/node/chain.go Outdated
Comment thread pkg/transaction/wrapped/cache/metrics.go Outdated
Comment thread pkg/transaction/wrapped/cache/cache.go Outdated
@gacevicljubisa
Copy link
Copy Markdown
Member

Suggestion: Instead of a fixed TTL (blocktime*85/100, blocktime-2) + BlockNumber RPC, use a single HeaderByNumber(nil) call periodically to get both the block number and timestamp, then extrapolate between syncs with zero RPC calls.

How it works:

  • Every ~20 blocks, call HeaderByNumber(nil) (latest) — returns block number + timestamp in one RPC call. Store as anchor point.
  • Between syncs, estimate the current block from pure math: currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime. No RPC calls at all.
  • Cache TTL: anchorTimestamp + (currentBlock - anchorBlock + 1) * blockTime - now.
    ?

Comment thread pkg/transaction/wrapped/cache/cache.go Outdated
Comment thread pkg/transaction/wrapped/cache/cache_test.go Outdated
Comment thread pkg/transaction/wrapped/cache/keys.go Outdated
Comment thread pkg/transaction/wrapped/cache/metrics.go Outdated
Comment thread pkg/transaction/wrapped/cache/cache.go Outdated
Comment thread pkg/transaction/wrapped/cache/cache_test.go Outdated
Comment thread pkg/transaction/wrapped/cache/cache.go Outdated
Comment thread pkg/transaction/wrapped/wrapped.go Outdated
Comment thread cmd/bee/cmd/deploy.go Outdated
Comment thread pkg/transaction/wrapped/cache/cache.go Outdated
@gacevicljubisa
Copy link
Copy Markdown
Member

gacevicljubisa commented Mar 24, 2026

I don't see how this cache meaningfully reduces RPC calls. The BlockNumber callers (postage listener, storage incentives, tx confirmation, API) run on independent staggered timers and rarely overlap within the same block window. With the TTL at 85% of block time, each caller almost always finds the cache expired by its next poll, resulting in a fresh RPC call anyway. The singleflight deduplication only helps when callers hit BlockNumber at the exact same instant, which is uncommon given the staggered schedules. On Gnosis mainnet the block time is only 5s, so the cache TTL would be ~4.25s — hardly enough to span multiple independent caller cycles.

If we do want to reduce BlockNumber RPC calls, a more effective approach would be to use a single HeaderByNumber(nil) call periodically (e.g. every ~20 blocks, or more) to get both the block number and timestamp, then extrapolate between syncs: currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime (mentioned here).

The filterPendingTransactions refactor (eliminating redundant TransactionByHash calls in nextNonce) is a genuine improvement and worth keeping.

@janos wdyt?

@janos
Copy link
Copy Markdown
Member

janos commented Mar 24, 2026

I don't see how this cache meaningfully reduces RPC calls. The BlockNumber callers (postage listener, storage incentives, tx confirmation, API) run on independent staggered timers and rarely overlap within the same block window. With the TTL at 85% of block time, each caller almost always finds the cache expired by its next poll, resulting in a fresh RPC call anyway. The singleflight deduplication only helps when callers hit BlockNumber at the exact same instant, which is uncommon given the staggered schedules. On Gnosis mainnet the block time is only 5s, so the cache TTL would be ~4.25s — hardly enough to span multiple independent caller cycles.

If we do want to reduce BlockNumber RPC calls, a more effective approach would be to use a single HeaderByNumber(nil) call periodically (e.g. every ~20 blocks, or more) to get both the block number and timestamp, then extrapolate between syncs: currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime (mentioned here).

The filterPendingTransactions refactor (eliminating redundant TransactionByHash calls in nextNonce) is a genuine improvement and worth keeping.

@janos wdyt?

Yes, given that block numbers are changing very frequently on 5 to 12s depending on the network (gnosis and sepolia), but not exactly the same period every time. It is a very good suggestion to calculate block number as you described to reduce even more the frequency of rpc call to get the block number and estimate the block time by HeaderByNumber. In that case even signleflight is not needed, as internally, block numbers will always be returned by its specific cache.

I would even go further and use the block time value that is calculated from the HeaderByNumber instead specifying it using options statically.

The consequence could be that it is required to get the block number as the node startup as both block number and block time are needed as known values. Maybe that is even good to do, and to exit the application in case that there are problems with the rpc endpoint when getting the block number on start. Just thinking.

@sbackend123
Copy link
Copy Markdown
Contributor Author

sbackend123 commented Mar 25, 2026

Suggestion: Instead of a fixed TTL (blocktime*85/100, blocktime-2) + BlockNumber RPC, use a single HeaderByNumber(nil) call periodically to get both the block number and timestamp, then extrapolate between syncs with zero RPC calls.

How it works:

* Every ~20 blocks, call `HeaderByNumber(nil)` (latest) — returns block number + timestamp in one RPC call. Store as anchor point.

* Between syncs, estimate the current block from pure math: `currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime`. No RPC calls at all.

* Cache TTL: `anchorTimestamp + (currentBlock - anchorBlock + 1) * blockTime - now`.
  ?

For me it sounds like overengineering a little bit from one side, and not so flexible implementations from another side (may be I miss smth?). I thought about whatif would like to cache another type of value? Than

I don't see how this cache meaningfully reduces RPC calls. The BlockNumber callers (postage listener, storage incentives, tx confirmation, API) run on independent staggered timers and rarely overlap within the same block window. With the TTL at 85% of block time, each caller almost always finds the cache expired by its next poll, resulting in a fresh RPC call anyway. The singleflight deduplication only helps when callers hit BlockNumber at the exact same instant, which is uncommon given the staggered schedules. On Gnosis mainnet the block time is only 5s, so the cache TTL would be ~4.25s — hardly enough to span multiple independent caller cycles.
If we do want to reduce BlockNumber RPC calls, a more effective approach would be to use a single HeaderByNumber(nil) call periodically (e.g. every ~20 blocks, or more) to get both the block number and timestamp, then extrapolate between syncs: currentBlock = anchorBlock + (now - anchorTimestamp) / blockTime (mentioned here).
The filterPendingTransactions refactor (eliminating redundant TransactionByHash calls in nextNonce) is a genuine improvement and worth keeping.
@janos wdyt?

Yes, given that block numbers are changing very frequently on 5 to 12s depending on the network (gnosis and sepolia), but not exactly the same period every time. It is a very good suggestion to calculate block number as you described to reduce even more the frequency of rpc call to get the block number and estimate the block time by HeaderByNumber. In that case even signleflight is not needed, as internally, block numbers will always be returned by its specific cache.

I would even go further and use the block time value that is calculated from the HeaderByNumber instead specifying it using options statically.

The consequence could be that it is required to get the block number as the node startup as both block number and block time are needed as known values. Maybe that is even good to do, and to exit the application in case that there are problems with the rpc endpoint when getting the block number on start. Just thinking.

I have couple of concerns:

  1. Drift over time: block production is not perfectly uniform, so we can accumulate error between syncs, especially if the resync interval is relatively large.

  2. Risk of overshooting: extrapolation may produce a block number that has not actually been produced yet

  3. With this approach we resolve problem with block_number call only. If in future we would need to reduce some other rpc calls, we might need to add smth else.

I think, there can be other corner cases, which would be more difficult to catch, since debugging will become more complex.

Copy link
Copy Markdown
Member

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbackend123 thank you for addressing my comments. As far as they are concerned, all is fine.

Given that there is a light on some different approaches, and responses on how sensitive storage incentivization is to the block number precision, I am still not approving it, as there is a possibility to reduce the frequency of rpc call for block number even more.

@gacevicljubisa
Copy link
Copy Markdown
Member

  1. Drift over time
  • Block time variance on Gnosis is small (typically <1s around the 5s average). With a resync every ~20 blocks (~100s), the maximum accumulated error is ~1 block. Using floor() in the extrapolation ensures you never jump ahead.
  • You can also calculate average block time dynamically on each resync
  1. Risk of overshooting
  • floor((now - anchorTimestamp) / blockTime) naturally rounds down, so you'll be at most 1 block behind
  1. Only solves block_number
  • true, but also, using only caching layer, I am not sure how much can we achive with other rpc calls

Copy link
Copy Markdown
Contributor

@akrem-chabchoub akrem-chabchoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some git conflicts that needs to be resolved

Comment thread pkg/storer/reserve_test.go Outdated
chunks = make([]swarm.Chunk, 0, int(chunksPerPO)*2)
putter = storer.ReservePutter()
)
chunks = make([]swarm.Chunk, 0, chunksPerPO*2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a duplicate,there is already allocation on line 543

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's also why the linter is complaining

Comment thread pkg/transaction/wrapped/cache/metrics.go Outdated
Comment thread pkg/transaction/wrapped/wrapped.go Outdated
Comment thread go.mod Outdated
Comment thread pkg/transaction/transaction.go Outdated
Comment thread pkg/transaction/wrapped/wrapped.go Outdated
"github.com/ethersphere/bee/v2/pkg/transaction/wrapped/cache"
)

const defaultBlockSyncInterval = 10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the value is always this one, i think it makes more sense to have the CLI flag default to be 10 and get rid of this const. it just complicates the configuration and adjustment (and gets rid of the necessity of using a sentinel value)

Comment thread pkg/transaction/wrapped/cache/cache.go Outdated
c.mu.RLock()
defer c.mu.RUnlock()

if c.valid {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this generalized cache object is already aware of the expiresAt time, i think that it can already be "smart enough" to know something has expired already. you can already determine here whether valid is still valid, and in fact you can do away with the field on the type altogether and just check whether now() >= expiresAt, since any subsequent usage of this value has to also compare now to the expiresAt time (otherwise you're blindly using that value).

also a note that generics increases the complexity here with no obvious place where it can be reused.

Comment thread pkg/transaction/wrapped/cache/cache.go Outdated
}
}

func (c *ExpiringSingleFlightCache[T]) Peek() (T, time.Time, bool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method does not need to be exported apart from for unit tests. can we not have it tested by just checking the callbacks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can

Comment thread pkg/transaction/wrapped/cache/cache.go Outdated
}

func (c *ExpiringSingleFlightCache[T]) PeekOrLoad(ctx context.Context, now time.Time, canReuse ReuseEvaluator[T], loader Loader[T]) (T, error) {
if value, expiresAt, ok := c.Peek(); ok {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not 100% sure that Peek is needed. since the cache is expiring, and it already has a notion of expiration time, then why do you need to use both peer and pass now (passed by the caller) back to the caller using the data they have already provided just in order to check whether now() >= expirationTime?

would it not be easier to inline the mutex locking and the now>expirationTime check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I’m going too much into the details, but PeekOrLoad is already not the most trivial function, and I would keep Peek at least to avoid expanding PeekOrLoad(). Especially since there is also Set, which works with the mutex as well, so we end up with three methods, each responsible for a specific task. Does it makes sense for you?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree it is not trivial and there's already a bit too much indirection for what it is supposed to do. I just find it odd that the caller passes now to PeekOrLoad... just to end up getting it back in the callback (plus shadowing) and do the relevance check in the callback with:

			if now.Before(expiresAt) {
				return true, expiresAt
			}

So why pass now at all? you can have it cleaner with just passing c.value and c.expiresAt into the callback and tell the callback to do the check and call it a day. In both cases Peek is not needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, simplified

Comment thread pkg/transaction/wrapped/wrapped.go Outdated
number: header.Number.Uint64(),
timestamp: time.Unix(int64(header.Time), 0).UTC(),
}
return anchor, b.nextExpectedBlockTime(anchor, 0), nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure i understand this very well. intuitively i'm interpreting this as this fetched block value will expire in the next upcoming block tick, but the whole change is about "caching" (or trusting the guessing of block numbers) in chunks (i.e. expiration date is more like a next sync date).
also, the passing of 0 here as a time value isn't clear - why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After brief discussion decided to remove expiresAt, so nextExpectedBlockTime is not needed anymore and we can simplify implementation

minimumGasTipCap: int64(minimumGasTipCap),
blockTime: blockTime,
metrics: newMetrics(),
blockSyncInterval: blockSyncInterval,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should prevent blockSyncInterval to be 0. Maybe to set it to 1 in that case?
In pkg/transaction/wrapped/wrapped.go#L135 division by zero can happen

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Contributor

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As general feedback - I'd find it helpful to have a bit more comments in the code as it is just making the reviews easier to do. They don't have to be long, but the general direction of what the code does or should do (method level documentation) is generally good and gives the reader a bit more info while going through the code.

Comment thread cmd/bee/cmd/cmd.go
optionNameRedistributionAddress = "redistribution-address"
optionNameStakingAddress = "staking-address"
optionNameBlockTime = "block-time"
optionNameBlockSyncInterval = "block-sync-interval"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the packaging folder with the new flag (in docker-compose,...)

# Conflicts:
#	packaging/docker/docker-compose.yml
#	packaging/docker/env
@sbackend123 sbackend123 merged commit 53aa35e into master Apr 15, 2026
16 checks passed
@sbackend123 sbackend123 deleted the feat/rpc-calls-optimisation branch April 15, 2026 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants