Skip to content

Conversation

@julienrbrt
Copy link
Member

Closes: #2890

Add submission strategies.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 13, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 15, 2026, 10:32 PM

@claude
Copy link
Contributor

claude bot commented Jan 13, 2026

Claude finished @julienrbrt's task in 3m 5s —— View job


Comprehensive Code Review Complete ✅

I've completed a thorough review of the DA submission batching strategies implementation. This is an excellent enhancement that adds much-needed flexibility for optimizing DA submission costs and latency. The code quality is high overall, with good test coverage and clear separation of concerns. Below are my detailed findings organized by priority and category.


Strengths

  1. Clean Architecture: The batching strategy interface is well-designed with clear separation between strategy logic and submission mechanics
  2. Excellent Test Coverage: Comprehensive unit tests for all four strategies with edge cases covered (81.89% patch coverage)
  3. Marshalling Optimization: The recent fix to cache marshalled bytes in pending_base.go elegantly solves the double-marshalling concern raised in earlier reviews
  4. Configuration Flexibility: Well-documented configuration options with sensible defaults
  5. Proper Concurrency Control: Good use of mutexes and atomic operations to prevent race conditions
  6. Error Handling: Critical errors (like ErrOversizedItem) are properly detected and halt the node to prevent live lock

🔴 Critical Issues

1. Potential Division by Zero in Batching Strategies

Location: batching_strategy.go:65, batching_strategy.go:129

threshold := int(float64(maxBlobSize) * s.sizeThreshold)

If maxBlobSize is 0, this will result in threshold = 0, which is safe but may not be the intended behavior. More critically, if this value comes from misconfiguration, it could lead to unexpected submission behavior. Consider adding validation:

if maxBlobSize <= 0 {
    return false // or return an error/log warning
}
threshold := int(float64(maxBlobSize) * s.sizeThreshold)

Impact: Medium - Could cause unexpected behavior with misconfigured max blob sizes
Recommendation: Add validation in ShouldSubmit methods or during strategy construction


2. Max Blob Size Mismatch Risk

Location: block/internal/common/consts.go:3

The DefaultMaxBlobSize is set to 7MB, but the PR description mentions increasing it from 2MB to 8MB, and the review comments discuss Celestia's 8MB limit with padding considerations. There's a discrepancy:

const DefaultMaxBlobSize = 7 * 1024 * 1024 // 7MB fallback blob size limit

Concerns:

  • The comment says this accounts for padding, but it's unclear if 7MB is sufficient margin
  • The submitter uses this constant in submitter.go:202 for strategy decisions
  • If padding overhead is more than 1MB (12.5%), submissions could still fail

Recommendation:

  1. Document exactly how much overhead padding adds (consult Celestia docs)
  2. Either adjust the constant or add a configuration option for padding margin
  3. Consider making this configurable per DA layer since different DAs have different overhead

🟡 High Priority Issues

3. Concurrent Goroutine Leak Risk in Submission Loop

Location: submitter.go:183-228, submitter.go:237-283

The submission loop spawns goroutines using TryLock() and go func(), but there's a potential issue:

if s.headerSubmissionMtx.TryLock() {
    go func() {
        defer s.headerSubmissionMtx.Unlock()
        // ... submission logic
    }()
}

Concerns:

  1. If the context is cancelled while goroutines are running, they may not terminate cleanly
  2. The wg.Wait() in Stop() only waits for the main loops, not these spawned goroutines
  3. During shutdown, pending goroutines could continue running after the context is cancelled

Recommendation:

if s.headerSubmissionMtx.TryLock() {
    s.wg.Add(1)  // Track the goroutine
    go func() {
        defer s.wg.Done()
        defer s.headerSubmissionMtx.Unlock()
        
        // Check context before proceeding
        select {
        case <-s.ctx.Done():
            return
        default:
        }
        
        // ... submission logic
    }()
}

4. Marshalled Cache Memory Growth

Location: block/internal/cache/pending_base.go:27-28

The marshalled cache uses sync.Map without size limits:

// Marshalling cache to avoid redundant marshalling
marshalledCache sync.Map // key: uint64 (height), value: []byte

Concerns:

  1. If DA submissions are delayed (network issues, DA downtime), the cache grows unbounded
  2. Each marshalled header/data is ~several KB, so thousands of pending items could consume significant memory
  3. The cache is only cleared in setLastSubmittedHeight after successful submission

Impact: Medium - Could cause memory pressure during DA outages
Recommendation:

  1. Add a maximum cache size or TTL
  2. Consider implementing an LRU eviction policy
  3. Monitor cache size with metrics

5. Error Handling in Size Estimation

Location: submitter.go:194-197, submitter.go:249-252

While the marshalling is now cached, the code retrieving from cache doesn't handle the case where marshalling failed during caching:

// Calculate total size (excluding signature)
totalSize := 0
for _, marshalled := range marshalledHeaders {
    totalSize += len(marshalled)
}

If GetPendingHeaders encounters a marshalling error (line pending_headers.go:68-70), it returns an error, but the error handling in the submitter only logs it:

headers, marshalledHeaders, err := s.cache.GetPendingHeaders(s.ctx)
if err != nil {
    s.logger.Error().Err(err).Msg("failed to get pending headers for batching decision")
    return
}

Recommendation: This is actually handled correctly - returning early if marshalling fails. No changes needed, but consider adding a metric to track marshalling failures.


🟠 Medium Priority Issues

6. Race Condition in Time Tracking

Location: submitter.go:177-178, submitter.go:234-235

lastSubmitNanos := s.lastHeaderSubmit.Load()
timeSinceLastSubmit := time.Since(time.Unix(0, lastSubmitNanos))

There's a tiny window where lastHeaderSubmit could be updated between Load() and time.Since(), leading to slightly inaccurate time calculations.

Impact: Very Low - Worst case is a batch submits slightly earlier/later than optimal
Recommendation: Accept this as-is, or snapshot the time at the start of the ticker loop


7. Configuration Validation Missing

Location: batching_strategy.go:18-32

The NewBatchingStrategy function validates strategy names but doesn't validate the combination of parameters:

case "immediate":
    return &ImmediateStrategy{}, nil

Issues:

  • For "immediate" strategy, BatchSizeThreshold, BatchMaxDelay, and BatchMinItems are ignored silently
  • No warning to users that they configured unused parameters
  • Could lead to confusion when users expect BatchMinItems=10 to work with "immediate" strategy

Recommendation:

case "immediate":
    if cfg.BatchSizeThreshold != 0.8 || cfg.BatchMaxDelay.Duration != 0 || cfg.BatchMinItems != 1 {
        // Log warning that these params are ignored for immediate strategy
    }
    return &ImmediateStrategy{}, nil

8. Default BatchMaxDelay of 0 Could Be Confusing

Location: pkg/config/defaults.go:83

BatchMaxDelay: DurationWrapper{0}, // 0 means use DA BlockTime

The "0 means use DA BlockTime" convention is only documented in the comment and in the strategy constructors. Users might expect 0 to mean "no delay" instead.

Recommendation:

  1. Make the default explicit: BatchMaxDelay: DurationWrapper{6 * time.Second}
  2. Or add validation in config loading to set it to BlockTime if 0
  3. Document this behavior prominently in configuration docs

🟢 Low Priority / Suggestions

9. Test Coverage for Edge Cases

The test coverage is good (81.89%), but a few additional test cases would be valuable:

  1. Concurrent submission attempts: Test behavior when multiple goroutines try to submit simultaneously (though the mutex should prevent this)
  2. Context cancellation during submission: Verify clean shutdown when context is cancelled mid-submission
  3. Large batch splitting: Test the limitBatchBySize function with various edge cases
  4. Cache invalidation: Test that marshalled cache is properly cleared after submissions

10. Code Documentation

Some complex logic could benefit from additional comments:

Location: submitter.go:164

checkInterval := max(s.config.DA.BlockTime.Duration/4, 100*time.Millisecond)

Recommendation: Explain why BlockTime/4 was chosen and why 100ms is the minimum


11. Metrics Enhancements

Consider adding these metrics for better observability:

  1. da_submission_batch_size - histogram of actual batch sizes submitted
  2. da_submission_strategy_decision_latency - time spent in ShouldSubmit
  3. da_marshalled_cache_size - current size of marshalled cache
  4. da_submission_trigger_reason - counter per strategy (size/time/immediate)

12. Performance: Unnecessary Time Operations

Location: submitter.go:224, submitter.go:279

s.lastHeaderSubmit.Store(time.Now().UnixNano())

time.Now() is called twice: once to store and potentially once in the next iteration to calculate duration. Minor optimization would be to reuse the time value.


📋 Code Quality & Style

Positive Observations:

  • ✅ Consistent error wrapping with context
  • ✅ Proper use of structured logging with zerolog
  • ✅ Good separation of concerns (strategy, submitter, cache)
  • ✅ Idiomatic Go code throughout
  • ✅ Table-driven tests with clear naming

Minor Style Suggestions:

  1. Consider extracting magic numbers to constants (e.g., checkInterval calculation)
  2. Some functions are getting long (daSubmissionLoop ~130 lines) - consider extracting submission logic to helper methods

🏗️ Architecture & Design

The overall architecture is solid:

  • Interface-based design allows easy addition of new strategies
  • Dependency injection makes testing straightforward
  • Configuration-driven behavior enables runtime tuning
  • Separation of concerns between strategy decision and submission mechanics

Suggestion for Future Enhancement: Consider making strategies pluggable at runtime without restart, possibly via a strategy registry pattern.


🔐 Security Considerations

No significant security issues identified. The code properly:

  • Uses context for cancellation
  • Validates signatures before submission
  • Checks proposer address matches genesis
  • Handles authentication tokens securely (marked with #nosec where appropriate)

📊 Performance Considerations

The marshalling cache optimization is excellent and addresses the main performance concern. Additional considerations:

  1. Memory: Monitor marshalled cache size during DA outages
  2. CPU: Marshalling happens once per item, which is optimal
  3. I/O: Store operations are minimal and well-batched
  4. Goroutines: Properly bounded with mutexes

Summary & Recommendation

This is a high-quality PR that significantly enhances the DA submission system. The batching strategies are well-designed, properly tested, and will provide meaningful cost savings and flexibility.

Recommended Actions Before Merge:

  1. 🔴 Address the goroutine lifecycle tracking (Issue Transaction gossiping #3)
  2. 🟡 Clarify/document the max blob size and padding calculations (Issue Peer discovery #2)
  3. 🟡 Add validation for division by zero in strategies (Issue ABCI interface #1)
  4. 🟠 Consider adding marshalled cache size limits (Issue Fraud proofs gossiping #4)

Nice-to-Have (can be addressed in follow-up PRs):

  • Enhanced metrics for observability
  • Additional test coverage for concurrent scenarios
  • Configuration validation warnings for unused parameters

📝 Existing Review Comments Status

Reviewing the previous comments:

Gemini's double-fetching concern - Resolved by caching marshalled bytes
Gemini's marshalling error handling - Actually handled correctly by returning early
Division by zero warning - Still relevant (see Issue #1)
Marshalling CPU waste - Resolved with cache optimization
Padding/max blob size - Partially addressed but needs clarification (see Issue #2)


Overall Assessment: Approve with minor recommendations

The implementation is production-ready with the understanding that the critical issues listed above should be addressed soon (either before merge or in immediate follow-up). The code demonstrates strong engineering practices and will provide real value to users optimizing their DA costs.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the Data Availability (DA) layer submission process by introducing a robust and configurable batching system. Instead of immediate submission, the system can now intelligently decide when to send data based on various strategies, such as accumulating a certain amount of data, waiting for a specific time interval, or an adaptive combination of both. This enhancement aims to improve the efficiency and cost-effectiveness of data submissions, particularly for DA layers like Celestia, by better utilizing blob capacity and reducing transaction overhead. The maximum blob size has also been updated to reflect current DA layer capabilities.

Highlights

  • Flexible DA Submission Strategies: Introduced an interface and four concrete implementations for batching strategies: Immediate, Size-Based, Time-Based, and Adaptive. This allows for configurable logic to determine when to submit data to the Data Availability layer, optimizing for latency, throughput, or cost.
  • Increased Default Max Blob Size: The DefaultMaxBlobSize has been increased from 2MB to 8MB across the codebase, aligning with Celestia's current blob size limits. This change allows for larger batches of data to be submitted in a single transaction.
  • Configurable Batching Parameters: New configuration flags and fields have been added to DAConfig to control the chosen batching strategy, size thresholds, maximum delays, and minimum item counts, providing fine-grained control over DA submission behavior.
  • Integrated into Submitter Loop: The DA submission loop in submitter.go has been refactored to utilize the new batching strategies. It now periodically checks the strategy to decide whether to submit pending headers and data, improving efficiency by avoiding premature or undersized submissions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces configurable batching strategies for DA submissions, a valuable feature for optimizing costs and latency. The implementation is well-structured, adding immediate, size, time, and adaptive strategies, along with comprehensive tests. My review focuses on improving the efficiency and robustness of the new logic in the submitter, particularly around size estimation and data fetching.

// Wait if current utilization is below minimum threshold
// Use epsilon for floating point comparison
const epsilon = 0.001
currentUtilization := float64(currentSize) / float64(maxBlobSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a potential for division by zero if maxBlobSize is 0. In Go, floating-point division by zero results in +Inf or NaN rather than a panic, but this can lead to unexpected behavior in the comparison that follows. It would be safer to add a guard against this, similar to the pattern used in calculateBatchMetrics.

Comment on lines 198 to 201
data, err := h.MarshalBinary()
if err == nil {
totalSize += len(data)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Errors from h.MarshalBinary() are silently ignored. This could lead to an inaccurate totalSize, causing the batching strategy to make a suboptimal decision (e.g., delaying a submission). It's better to log these errors for visibility.

Additionally, this size estimation logic is duplicated for data submission. Consider extracting it into a shared helper function to improve maintainability.

              data, err := h.MarshalBinary()
              if err != nil {
                s.logger.Warn().Err(err).Msg("failed to marshal header for size estimation")
                continue
              }
              totalSize += len(data)

Comment on lines 256 to 259
data, err := d.MarshalBinary()
if err == nil {
totalSize += len(data)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the header submission logic, errors from d.MarshalBinary() are silently ignored here. This can lead to inaccurate size estimations and suboptimal batching. It's better to log these errors for improved diagnostics and robustness.

              data, err := d.MarshalBinary()
              if err != nil {
                s.logger.Warn().Err(err).Msg("failed to marshal data for size estimation")
                continue
              }
              totalSize += len(data)

@codecov
Copy link

codecov bot commented Jan 15, 2026

Codecov Report

❌ Patch coverage is 81.89300% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.22%. Comparing base (52825bf) to head (ef3f6fd).

Files with missing lines Patch % Lines
block/internal/submitting/submitter.go 71.95% 18 Missing and 5 partials ⚠️
block/internal/submitting/da_submitter.go 72.22% 5 Missing and 5 partials ⚠️
block/internal/cache/pending_data.go 78.94% 2 Missing and 2 partials ⚠️
block/internal/submitting/batching_strategy.go 93.22% 2 Missing and 2 partials ⚠️
block/internal/cache/pending_headers.go 89.47% 1 Missing and 1 partial ⚠️
block/internal/cache/manager.go 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2973      +/-   ##
==========================================
+ Coverage   58.77%   59.22%   +0.44%     
==========================================
  Files         101      102       +1     
  Lines        9685     9826     +141     
==========================================
+ Hits         5692     5819     +127     
- Misses       3381     3390       +9     
- Partials      612      617       +5     
Flag Coverage Δ
combined 59.22% <81.89%> (+0.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt marked this pull request as ready for review January 15, 2026 12:40
Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left two comments, i think we should cleanup marshaling flow to not marshal then not use the data. it will end up in wasted cpu cycles. not a large issue in our case but we should strive to reduce cpu cycles

return nil, fmt.Errorf("failed to sign envelope: %w", err)
}
// Create the envelope and marshal it
envelope, err := header.MarshalDAEnvelope(envelopeSignature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can open an issue for this as we are marshaling the header twice in this flow. one in the cache(this pr, previously elsewhere) and in the envelope again, but we sign over the encoded data. if we made this a function that takes in the encoded header then it removes the passing of headers on top of marshaled data.

Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job

@tac0turtle tac0turtle enabled auto-merge January 15, 2026 17:24
@julienrbrt
Copy link
Member Author

checking e2e, will fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] bump blob size to 8mb

3 participants