MBarrier: Data Structure & APIs

MBarrier Object (64-bit, in shared memory)

phase
0 or 1
pending_count
arrivals left
expected_count
total expected
tx-count
bytes pending
phase Current phase (0 or 1). Flips automatically on completion.
pending_count Remaining arrivals needed before phase completes.
expected_count Total arrivals expected per phase (set by init).
tx-count Pending async bytes. Increased by arrive.expect_tx, decreased by HW on transfer completion.

Phase Completion Condition

pending_count == 0
&&
tx-count ≤ 0
A phase completes when all expected threads have arrived AND all expected async bytes have been transferred. Upon completion, the barrier auto-resets: phase flips, pending_count resets to expected_count, tx-count resets to 0.

Core APIs

InstructionRoleEffect on MBarrier
mbarrier.init setup Initialize barrier with expected arrival count phase=0, pending=count, tx=0
mbarrier.arrive producer Thread signals arrival at the barrier pending_count -= 1
mbarrier.arrive.expect_tx producer Arrive + declare expected async transfer bytes pending -= 1, tx_count += txCount
tcgen05.commit producer Tensor core signals MMA completion arrive::one on mbarrier
mbarrier.try_wait consumer Wait (suspend thread) until phase completes blocks until phase done
Three arrival patterns: (1) mbarrier.arrive — thread directly decrements pending_count. (2) arrive.expect_tx — thread arrives + declares bytes; TMA engine auto-decrements tx_count when transfer finishes. (3) tcgen05.commit — tensor core auto-arrives when MMA completes.