Skip to content

Make _state atomic to prevent concurrent _read_buf mutation on TCP fa…#3347

Open
chenBright wants to merge 1 commit into
apache:masterfrom
chenBright:fix_rdma_handshake
Open

Make _state atomic to prevent concurrent _read_buf mutation on TCP fa…#3347
chenBright wants to merge 1 commit into
apache:masterfrom
chenBright:fix_rdma_handshake

Conversation

@chenBright

@chenBright chenBright commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: resolve

Problem Summary:

RdmaEndpoint::_state is a plain (non-atomic) enum, but it is written
by the RDMA handshake bthread and read concurrently by the
event dispatcher bthread in OnNewDataFromTcp. This is a data race,
and on a weak memory model (ARM / POWER / RISC-V) it can let the two
threads concurrently mutate _socket->_read_buf and corrupt an IOBuf.

time   bthread A (OnNewDataFromTcp)                    bthread B (ProcessHandshakeAtServer)
────   ─────────────────────────                       ─────────────────────────────────────
 T0                                                    Program order:
                                                         s->_read_buf.append(magic, ...)
                                                         ep->_state = FALLBACK_TCP

       ═════════════════════════════ Actual order after CPU reordering ══════════════════════

 T1                                                    ep->_state = FALLBACK_TCP
                                                         ← hoisted: this store takes effect FIRST
                                                         (_nevent is still 0,
                                                          _read_buf still has no magic in it)

 T2   (new bytes arrive on the fd,
        epoll edge-trigger fires)
       Socket::StartInputEvent
         _nevent.fetch_add(acq_rel): 0 → 1, == 0
         → spawn bthread A: OnEdge → OnNewDataFromTcp

 T3   read ep->_state (plain read, no acquire)
       => observes FALLBACK_TCP ✓ (T1 is visible)

 T4   else if (_state == FALLBACK_TCP) ✓
       → InputMessenger::OnNewMessages(m)
         └─ DoRead →
              _read_buf.append_from_file_descriptor(fd, ...)
         ⚠ A is now WRITING into _read_buf

 T5                                                    s->_read_buf.append(magic, MAGIC_STR_LEN)
                                                         ← reordered: this append happens NOW
                                                         ⚠⚠⚠ A and B are concurrently mutating
                                                              the SAME IOBuf object!

                                                       Consequences:
                                                         • IOBuf internal linked list /
                                                           _ref_array / _data pointer / refcount
                                                           are mutated concurrently
                                                         • memory corruption, use-after-free,
                                                           refcount drift, crash
                                                         • or — more insidious — the bytes
                                                           appear to be accepted but the
                                                           internal state of _read_buf is
                                                           already corrupted

What is changed and the side effects?

Changed:

Make _state a butil::atomic<State>.

  • Terminal-state stores use memory_order_release and the corresponding loads use
    memory_order_acquire, so that everything published before a terminal state becomes
    visible to the reader:
    • the magic bytes appended back into _socket->_read_buf before FALLBACK_TCP;
    • the RDMA window/resource setup done before ESTABLISHED.
  • All other (non-terminal) handshake transitions use memory_order_relaxed.

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

…llback

RdmaEndpoint::_state was a plain enum, written by the handshake bthread
and read concurrently by the event-dispatching thread (OnNewDataFromTcp).
This is a data race, and on a weak memory model it can let the two threads
concurrently mutate _socket->_read_buf.

Make _state a butil::atomic<State>:
- Terminal-state stores use release and the matching loads use acquire,
  so data published before a terminal state (the magic bytes put back
  into _read_buf, and the RDMA window/resource setup before ESTABLISHED)
  is visible to the reader.

- Non-terminal handshake transitions use relaxed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a concurrency data race in the RDMA/TCP handshake path by making RdmaEndpoint::_state atomic and using acquire/release semantics for terminal handshake transitions, preventing concurrent mutation of Socket::_read_buf during fallback/establishment.

Changes:

  • Changed RdmaEndpoint::_state from a plain State enum to butil::atomic<State>.
  • Updated all reads/writes of _state to use load()/store() with appropriate memory orders (release for terminal publishes, acquire for corresponding reads).
  • Minor refactors/formatting adjustments around handshake logic (e.g., rdma_on flag computation, line wrapping).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/brpc/rdma/rdma_endpoint.h Makes _state an atomic and adjusts TryReadOnTcp declaration to match new usage patterns.
src/brpc/rdma/rdma_endpoint.cpp Converts _state accesses to atomic load/store and adds acquire/release ordering to prevent handshake/TCP read buffer races.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@chenBright chenBright requested a review from yanglimingcn June 15, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants