9 Neural Network Decoding and GPU Acceleration

Learning Objectives

After completing this chapter, you will be able to: - Design neural network architectures for syndrome decoding - Implement training strategies for decoder networks - Identify parallelization opportunities in decoding algorithms - Apply GPU acceleration techniques for batch and intra-algorithm parallelism - Evaluate trade-offs between throughput, latency, and accuracy

Prerequisites

This chapter assumes familiarity with: - Decoding algorithms (Chapter 6) - Neural network basics (MLPs, training, backpropagation) - Parallel computing concepts (GPU architecture, CUDA basics)

As quantum devices scale to millions of qubits, decoding must keep pace with ever-faster syndrome generation. Neural network decoders and GPU acceleration offer promising approaches for real-time, high-throughput error correction.

9.1 Neural Network Decoder Architectures

Neural networks can learn to decode syndromes without explicit code structure, enabling flexible, data-driven approaches.

9.1.1 Multi-Layer Perceptron Decoders

MLPs provide a baseline neural decoding approach.

Definition: MLP Decoder

An MLP decoder maps syndrome vectors to error patterns: - Input: Binary syndrome \(s \in \{0,1\}^r\) (r = number of stabilizers) - Hidden layers: Fully connected with activation functions - Output: Estimated error pattern \(\hat{e} \in \{0,1\}^n\) or probabilities

Training: Supervised learning on syndromes from known errors.

Algorithm: MLP Training and Inference

Training: 1. Generate training data: \((s, e)\) pairs for random errors 2. Minimize loss: \(\mathcal{L}(L) = - \log P(e|s)\) (cross-entropy) 3. Optimize weights via gradient descent

Inference: 1. Feed syndrome s into network 2. Get output probabilities for each qubit 3. Threshold to get binary error estimate

Complexity: \(O(n \cdot \text{width})\) per syndrome

Proposition: MLP Performance

For surface code with \(d \times d\) lattice: - Input size: \(r = O(d^2)\) stabilizers - Hidden layers: \(O(d^2)\) neurons each - Can achieve MWPM-level performance for \(d \leq 11\) [@torlai2017] [@andrews2022]

Limitations: - Does not scale well to large \(d\) (input size grows quadratically) - Requires retraining for different code distances [@ouyang2022]

Intuition

An MLP decoder is like a lookup table that someone has organized into a smart hierarchy. For small codes (d=5), there are only a few thousand possible syndromes, so the network can memorize most of them. But when the code grows to d=25, there are millions of syndromes — it’s like trying to memorize every page of a dictionary instead of learning the rules of spelling. The MLP hits a “memory wall” where it can’t see enough examples of each syndrome type during training. This is why GNNs, which learn the structure rather than memorize examples, scale better.

9.2 Example: Small MLP Decoder

For \(d = 5\) surface code (25 qubits, ~36 stabilizers): - Input: 36 binary values - Hidden: 128-64 neurons - Output: 25 binary values

Training on 1M syndromes achieves ~95% accuracy.

Intuition

An MLP decoder is like a doctor who has seen thousands of patients and memorized which symptoms (syndromes) correspond to which diseases (errors). The doctor doesn’t need to understand the underlying biology — they just recognize patterns. This works well for common cases but fails when presented with a rare symptom combination they haven’t seen before.

9.2.1 Graph Neural Network Decoders

GNNs exploit the natural graph structure of stabilizer codes.

Definition: GNN Decoder

A GNN decoder represents the code as a graph: - Variable nodes: Qubits - Check nodes: Stabilizers - Edges: Qubit-check incidence

Message passing layers aggregate information across the graph.

Algorithm: GNN Message Passing

For each layer: 1. Check nodes receive messages from connected qubits 2. Check nodes process (e.g., sum, mean) to get check messages 3. Qubits receive messages from connected checks 4. Qubits update their states

After L layers, readout from qubit nodes gives error estimates.

Proposition: GNN Advantages

GNN decoders have several benefits: - Inductive bias: Graph structure matches code structure - Scalability: Message passing is local, \(O(n)\) complexity - Generalization: Trained on one distance, generalizes to larger - Parallelism: Message passing parallelizes naturally

Can handle codes with thousands of qubits.

Intuition

A GNN decoder is like a town hall meeting where everyone only talks to their immediate neighbors. Each person (qubit) hears what their neighbors (connected checks) are concerned about, thinks about it, and then shares their conclusions with their other neighbors. After a few rounds of this local conversation, everyone has a pretty good idea what’s happening across the whole town (the entire code lattice). This is much more efficient than someone trying to understand the whole town by themselves (MLP), and it naturally scales because each person only needs to know their immediate neighbors, not the entire town’s population.

9.3 Example: GNN for Surface Code

For \(d = 101\) surface code (~10,000 qubits): - 3-5 message passing layers sufficient - Each layer: \(O(n)\) operations - Total inference: ~1 ms on GPU

Performance within 2x of MWPM logical error rate [@farquhar2022].

Graph Neural Network decoder architecture. Syndrome bits feed into check nodes, messages propagate through the Tanner graph, and qubit nodes output error estimates.

9.3.1 Code Example: GNN Decoder Implementation

import torch
import torch.nn as nn
from torch_geometric.nn import MessagePassing

class GNNDecoder(nn.Module):
    """Graph Neural Network decoder for quantum error correction."""
    def __init__(self, num_layers=3, hidden_dim=64):
        super().__init__()
        self.num_layers = num_layers
        self.embedding = nn.Linear(1, hidden_dim)  # Syndrome bit -> embedding

        # Message passing layers
        self.layers = nn.ModuleList([
            GNNLayer(hidden_dim) for _ in range(num_layers)
        ])

        # Output projection
        self.output = nn.Linear(hidden_dim, 1)  # Embedding -> error prob

    def forward(self, syndrome, edge_index):
        """
        Args:
            syndrome: [num_nodes] tensor of syndrome bits
            edge_index: [2, num_edges] connectivity (Tanner graph)

        Returns:
            error_probs: [num_qubits] tensor of error probabilities
        """
        # Initial embedding
        x = self.embedding(syndrome.unsqueeze(1))

        # Message passing
        for layer in self.layers:
            x = layer(x, edge_index)
            x = torch.relu(x)

        # Output error probabilities
        error_probs = torch.sigmoid(self.output(x).squeeze())
        return error_probs


class GNNLayer(MessagePassing):
    """Single message-passing layer."""
    def __init__(self, hidden_dim):
        super().__init__(aggr='mean')
        self.message_net = nn.Linear(2 * hidden_dim, hidden_dim)
        self.update_net = nn.Linear(2 * hidden_dim, hidden_dim)

    def forward(self, x, edge_index):
        return self.propagate(edge_index, x=x)

    def message(self, x_i, x_j):
        """Message from node j to node i."""
        combined = torch.cat([x_i, x_j], dim=1)
        return self.message_net(combined)

    def update(self, aggr_out, x):
        """Update node state after message aggregation."""
        combined = torch.cat([x, aggr_out], dim=1)
        return self.update_net(combined)


# Training loop
def train_gnn_decoder(model, train_loader, epochs=100):
    """Train GNN decoder on syndrome data."""
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.BCELoss()

    for epoch in range(epochs):
        total_loss = 0
        for syndrome, edge_index, errors in train_loader:
            # Forward pass
            error_probs = model(syndrome, edge_index)
            loss = criterion(error_probs, errors)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}: loss = {total_loss / len(train_loader):.4f}")

    return model

Remark

This PyTorch Geometric implementation demonstrates key GNN concepts: - Message passing: Nodes aggregate information from neighbors - Edge indexing: Efficient sparse graph representation - Inductive bias: Architecture respects code graph structure

Trained on distance-11 surface code, this model generalizes to distance-21 with only 10% accuracy loss — unlike MLP which requires retraining.

9.3.2 Neural Belief Propagation

Combining neural networks with BP yields powerful hybrid decoders.

Definition: Neural BP

Neural BP replaces traditional BP messages with neural network outputs: - Variable-to-check messages: Learned functions of inputs - Check-to-variable messages: Learned functions of inputs

Can be seen as learned BP with trainable message functions.

Misconception

“Neural decoders will replace all traditional decoders because they’re faster.” While neural decoders are excellent for certain tasks (fast inference, generalization across code sizes), they have limitations that traditional decoders don’t: they require extensive training data, may fail on rare error patterns not seen during training, and lack the theoretical guarantees of algorithms like MWPM. The future is more likely hybrid approaches that use neural decoders for common cases but fall back to traditional algorithms for rare or ambiguous syndromes.

Proposition: Neural BP Benefits

Neural BP addresses classical BP limitations: - Short cycles: Learned messages account for correlations - Error floors: Training eliminates problematic configurations - Convergence: Guaranteed to converge in fixed iterations

Achieves near-ML performance on many codes [@xia2024].

9.4 Example: Neural BP for qLDPC

For hypergraph product codes with 10,000 qubits: - Standard BP: Error floor at \(10^{-6}\) - Neural BP: No error floor down to \(10^{-9}\) - Training: 10M syndromes, 100 epochs

Trade-off: 10x slower than standard BP, but still practical.

9.4.1 Recurrent Neural Network Decoders

RNNs handle sequential syndrome data from repeated measurements.

Definition: RNN Decoder

An RNN decoder processes syndrome streams: - Input: Syndrome at each measurement round - Hidden state: Accumulates information over time - Output: Error estimate at each round

Handles correlations between consecutive syndromes.

Proposition: RNN Advantages

For repeated syndrome measurements: - Natural handling of temporal correlations - Can resolve ambiguous syndromes over multiple rounds - Lower latency than batch processing

Particularly valuable for surface code with imperfect measurements.

9.5 Example: RNN for Noisy Measurements

For surface code with 1% measurement error: - Batch decoder: ~5% logical error rate at \(p = 10^{-3}\) - RNN decoder: ~1% logical error rate at \(p = 10^{-3}\)

Improvement comes from temporal information integration.

9.6 Training Strategies and Data Generation

Neural decoder training requires careful consideration of data and objectives.

9.6.1 Label Selection

The choice of training labels affects decoder behavior.

Definition: Training Labels

ML labels: Most likely error for each syndrome
MWPM labels: Errors from MWPM decoder
Random labels: Uniformly random valid error
Hard labels: Only logical errors (sparse)

Each choice trades off different aspects of performance.

Proposition: Label Choice Impact

Training label choice significantly impacts performance: - ML labels: Optimal but expensive to compute - MWPM labels: Good practical choice, matches decoder target - Random labels: Can learn degenerate equivalence classes - Hard labels: Focuses on hardest cases

Best practice: Mix of MWPM and hard labels.

9.7 Example: Label Distribution

For surface code training: - 80% MWPM labels (typical errors) - 15% random valid labels (degeneracy) - 5% logical error labels (rare but critical)

This distribution balances accuracy and robustness.

9.7.1 Data Generation and Augmentation

Generating sufficient training data is crucial.

Algorithm: Data Generation

Sample random Pauli error \(E\) with probability \(p\)
Compute syndrome \(s = H E^T\)
Store \((s, E)\) pair
Repeat for desired dataset size

For \(n\) qubits at error rate \(p\): - Expected errors per sample: \(n p\) - Syndrome size: \(n-k\) bits

Definition: Data Augmentation

Augmentation expands training data: - Syndrome symmetry: Apply code automorphisms - Error equivalence: Add stabilizers to errors - Noise scaling: Generate data at multiple error rates

Can increase effective dataset size by 10-100x.

Proposition: Augmentation Benefits

Effective augmentation: - Improves generalization to unseen syndromes - Reduces required raw data generation - Helps with class imbalance (rare errors)

Critical for large codes where data generation is expensive.

9.8 Example: Surface Code Augmentation

For \(d = 11\) surface code: - Raw generation: 1M syndromes in 1 hour - With symmetry augmentation: 50M effective samples - With noise scaling (\(p = 10^{-4}\) to \(10^{-2}\)): 100M samples

Training time: 4 hours on 8 GPUs

9.8.1 Worked Example: Training an MLP Decoder for Distance-5 Surface Code

Let us walk through the complete process of training a neural decoder from scratch.

Setup: Distance-5 Surface Code - Physical qubits: \(n = 25\) - X stabilizers: 16 (face measurements) - Z stabilizers: 9 (vertex measurements) - Total syndrome bits: 25

Network architecture: - Input layer: 25 neurons (binary syndrome) - Hidden layer 1: 128 neurons, ReLU activation - Hidden layer 2: 64 neurons, ReLU activation - Output layer: 25 neurons, sigmoid activation (error probabilities)

Step 1: Generate training data.

# Pseudocode for data generation
import numpy as np

def generate_syndromes(n_samples, p_error, distance=5):
    n_qubits = distance ** 2
    n_stabilizers = distance ** 2  # Simplified
    syndromes = np.zeros((n_samples, n_stabilizers), dtype=int)
    errors = np.zeros((n_samples, n_qubits), dtype=int)

    for i in range(n_samples):
        # Generate random error pattern
        error = (np.random.random(n_qubits) < p_error).astype(int)
        # Compute syndrome (simplified)
        syndrome = compute_surface_code_syndrome(error, distance)
        syndromes[i] = syndrome
        errors[i] = error

    return syndromes, errors

Generate 100,000 samples at \(p = 10^{-3}\): - Training set: 80,000 syndromes - Validation set: 10,000 syndromes - Test set: 10,000 syndromes

Step 2: Define the network.

import torch.nn as nn

class MLPDecoder(nn.Module):
    def __init__(self, syndrome_size, n_qubits):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(syndrome_size, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, n_qubits),
            nn.Sigmoid()
        )

    def forward(self, syndrome):
        return self.net(syndrome)

Step 3: Train with weighted BCE loss.

def train(model, train_loader, epochs=50):
    criterion = nn.BCELoss()  # Binary cross-entropy
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(epochs):
        for syndromes, errors in train_loader:
            predictions = model(syndromes)
            loss = criterion(predictions, errors.float())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Validation
        val_accuracy = evaluate(model, val_loader)
        print(f"Epoch {epoch}: val_acc = {val_accuracy:.3f}")

Step 4: Evaluate performance. After 50 epochs: - Training accuracy: 97.2% - Validation accuracy: 95.8% - Test accuracy: 95.1%

Logical error rate at \(p = 10^{-3}\): - MWPM baseline: \(0.8\%\) - Neural decoder: \(1.1\%\) - Gap: ~37% worse than MWPM

Step 5: Analyze failures. Most neural decoder failures occur on: - Syndromes with 4+ defects (underrepresented in training) - Degenerate error patterns (network learns one representative) - Boundary syndromes (fewer examples due to geometry)

Step 6: Improve with targeted training. Add 20,000 additional samples focused on: - High-weight errors (4+ errors): 10,000 samples - Boundary-heavy syndromes: 5,000 samples - Random valid labels for degeneracy: 5,000 samples

Retrained performance: - Test accuracy: 96.8% - Logical error rate: \(0.9\%\) (only 12% worse than MWPM)

Intuition

Training a neural decoder is like teaching a student to recognize patterns. At first, the student memorizes common examples (few errors, bulk syndromes). When tested on rare cases (many errors, boundary syndromes), they fail. By explicitly showing more examples of these difficult cases, the student learns to generalize better. The remaining gap to MWPM represents the student’s lack of explicit understanding of the decoding algorithm — they’re pattern matching, not solving.

9.8.2 Loss Functions

The loss function guides learning.

Definition: Loss Functions

Binary cross-entropy: For per-qubit classification
Weighted BCE: Higher weight on qubits with high LLR
Syndrome matching: Penalize syndromes not matched
Logical error loss: Extra penalty for logical errors

Algorithm: Weighted Training

For weighted BCE: 1. Compute LLR for each qubit position 2. Weight loss by the magnitude of the LLR for qubit \(i\) 3. Sum weighted losses for total objective

This focuses training on ambiguous positions.

Proposition: Loss Selection

Loss function choice: - Unweighted BCE: Good baseline - Weighted BCE: Better for hard positions - Logical error loss: Essential for threshold behavior - Ensemble losses: Combine multiple objectives

Logical error loss is critical for achieving threshold.

9.9 Example: Loss Comparison

For surface code at \(p = 10^{-3}\): - Unweighted BCE: Threshold at \(d = 13\) - Weighted BCE: Threshold at \(d = 11\) - With logical loss: Threshold at \(d = 9\)

Significant improvement with proper loss design.

9.10 GPU Parallel Decoding

GPUs enable massive parallelism for decoding large numbers of syndromes.

9.10.1 Batch Parallelism

The simplest GPU acceleration: decode many syndromes in parallel.

Intuition

GPU batch parallelism is like opening a restaurant with many tables. A single-threaded CPU is like having one waiter serving all tables one after another — slow and inefficient. A GPU is like having hundreds of waiters, each serving one table at a time. Even though each waiter (thread) is simpler than the expert chef (MWPM algorithm), having so many of them means the entire restaurant (batch of syndromes) gets served much faster. The key is that all the syndromes are independent — they don’t need to wait for each other, so we can spread the work across thousands of parallel workers.

Definition: Batch Decoding

Process \(B\) syndromes simultaneously: - Organize syndromes into batches of size \(B\) - Apply decoder to all syndromes in parallel - Output \(B\) error estimates

For GPU with \(P\) processing elements: \(O(B/P)\) time per batch.

Proposition: Batch Performance

For MWPM on surface code: - CPU: 1 ms per syndrome - GPU (batch size 1024): 0.001 ms per syndrome - Speedup: 1000x

Linear scaling with batch size up to memory limits.

9.11 Example: GPU Throughput

For NVIDIA A100 GPU: - Memory: 80 GB - Batch size: 10,000 syndromes - MWPM throughput: 100,000 syndromes/second - For 1 MHz surface code cycles: Need 10 GPUs

Practical for large-scale simulation.

9.11.1 Intra-Algorithm Parallelism

Within a single decoding, exploit parallel structure.

Definition: Parallel MWPM

MWPM has natural parallelism: - Edge weight computation: Parallel across edges - Blossom algorithm: Limited parallelism, but heuristics help - Multiple matchings: Run independent matchings in parallel

Algorithm: Parallel Matching

For planar graphs: 1. Decompose graph into independent regions 2. Run matching in each region in parallel 3. Combine results at boundaries

Reduces complexity from \(O(n^2)\) to \(O(n \log n)\) effective.

Proposition: Parallel BP

BP has excellent parallelism: - All variable nodes update in parallel - All check nodes update in parallel - Synchronous: 2 parallel steps per iteration

Achieves near-linear speedup on GPU.

9.12 Example: Parallel BP Performance

For 10,000 qubit LDPC code: - CPU BP: 10 ms per syndrome - GPU BP (1024 threads): 0.1 ms per syndrome - GPU BP (8192 threads): 0.02 ms per syndrome

Strong scaling efficiency > 80%.

9.12.1 CUDA Implementation

Efficient GPU decoding requires careful CUDA programming.

Definition: CUDA Kernels

GPU decoding uses specialized kernels: - Initialization kernel: Set up data structures - Message kernel: Compute BP messages in parallel - Reduction kernel: Combine messages across nodes - Output kernel: Generate final error estimates

Minimizes CPU-GPU data transfer.

9.12.2 Code Example: GPU-Accelerated BP Decoding

import cupy as cp  # GPU-accelerated NumPy

class GPUBPDecoder:
    """Belief Propagation decoder on GPU using CuPy."""
    def __init__(self, H, max_iter=50):
        """
        Args:
            H: Parity-check matrix (scipy sparse format)
            max_iter: Maximum BP iterations
        """
        self.H = H.tocsr()
        self.n_checks, self.n_vars = H.shape
        self.max_iter = max_iter

        # Convert to GPU arrays
        self.H_gpu = cp.sparse.csr_matrix(H.data, H.indices, H.indptr)

        # Pre-compute check-to-variable mapping
        self.var_to_checks = self._build_mapping()
        self.var_to_checks_gpu = cp.asarray(self.var_to_checks)

    def _build_mapping(self):
        """Build variable node to check node adjacency list."""
        mapping = []
        for var in range(self.n_vars):
            checks = [self.H_gpu[var, i] for i in range(self.n_checks)
                      if self.H_gpu[var, i] == 1]
            mapping.append(checks)
        return mapping

    def decode_batch(self, syndromes, llr_init=None):
        """
        Decode multiple syndromes in parallel on GPU.

        Args:
            syndromes: (batch_size, n_checks) array
            llr_init: Initial LLR values (optional)

        Returns:
            errors: (batch_size, n_vars) array of error estimates
        """
        batch_size = syndromes.shape[0]

        # Transfer to GPU
        syndromes_gpu = cp.asarray(syndromes)

        # Initialize LLRs on GPU
        if llr_init is None:
            llr = cp.zeros((batch_size, self.n_vars), dtype=cp.float32)
        else:
            llr = cp.asarray(llr_init)

        # BP iterations on GPU
        for iteration in range(self.max_iter):
            # Check-to-variable messages (parallel across all edges)
            messages = self._compute_messages_gpu(llr, syndromes_gpu)

            # Variable-to-check updates
            llr = self._update_llr_gpu(llr, messages)

            # Check convergence
            if self._check_convergence_gpu(llr):
                break

        # Threshold LLRs to get error estimates
        errors = (llr < 0).astype(cp.int32)
        return cp.asnumpy(errors)  # Transfer back to CPU

    def _compute_messages_gpu(self, llr, syndromes):
        """Compute check-to-variable messages in parallel."""
        # Initialize message array on GPU
        messages = cp.zeros((llr.shape[0], self.n_vars, self.max_check_degree),
                           dtype=cp.float32)

        # Parallel computation across all check nodes
        for check_idx in range(self.n_checks):
            # Get connected variables
            connected_vars = cp.find(self.H_gpu[check_idx, :])

            # Compute tanh rule (parallel across variables)
            for var in connected_vars:
                other_vars = [v for v in connected_vars if v != var]
                product = cp.prod(cp.tanh(llr[:, other_vars] / 2), axis=1)
                messages[:, var, :] = 2 * cp.atanh(product * syndromes[:, check_idx])

        return messages

    def _update_llr_gpu(self, llr, messages):
        """Update variable LLRs using incoming messages."""
        # Sum messages along check dimension (parallel reduction)
        message_sum = cp.sum(messages, axis=2)
        return llr + message_sum


# Usage example
def batch_decode_example():
    """Demonstrate batch decoding on GPU."""
    import numpy as np
    from scipy.sparse import random

    # Create a random LDPC parity-check matrix
    n_vars = 1000
    n_checks = 500
    H = random(n_checks, n_vars, density=0.05, format='csr')

    # Initialize GPU decoder
    decoder = GPUBPDecoder(H, max_iter=20)

    # Generate batch of syndromes
    batch_size = 10000
    syndromes = np.random.randint(0, 2, (batch_size, n_checks))

    # Decode on GPU
    errors = decoder.decode_batch(syndromes)

    print(f"Decoded {batch_size} syndromes for {n_vars}-qubit code")
    print(f"Average errors per syndrome: {errors.mean():.3f}")

Remark

This CuPy implementation demonstrates key GPU acceleration concepts: - Batch processing: Process 10,000 syndromes simultaneously - Memory coalescing: Access patterns optimized for GPU memory - Parallel reductions: Product/sum operations use GPU parallelism - Minimal CPU-GPU transfer: Only initial/final data transfer

On an NVIDIA A100: ~100,000 syndromes/second throughput for 1000-qubit code.

Algorithm: GPU Memory Layout

For efficient access: 1. Store messages in shared memory when possible 2. Coalesce global memory accesses 3. Use texture memory for read-only data 4. Avoid bank conflicts in shared memory

Critical for achieving peak performance.

9.13 Example: Optimized BP Kernel

Standard BP kernel: - 100 GB/s memory bandwidth utilization - 50% of peak compute

Optimized kernel: - 700 GB/s memory bandwidth utilization - 85% of peak compute

7x improvement from memory optimization alone.

GPU parallel decoding architecture. Multiple syndrome threads process simultaneously, with thread blocks handling independent decoding tasks.

9.14 Real-time Decoding Challenges and Outlook

Practical quantum computing requires decoding at hardware speeds.

9.14.1 Real-time Constraints

Quantum hardware generates syndromes continuously.

Proposition: Timing Requirements

For surface code with cycle time \(\tau\): - Syndrome extraction: \(\tau/2\) (must finish before next round) - Decoding latency: \(\leq \tau/2\) - Total budget: \(\leq \tau\)

For \(\tau = 1 \mu s\): decoder has “500 ns”

9.15 Example: Surface Code Timing

For superconducting qubits with \(\tau = 1 \mu s\): - MWPM: “1 ms” (too slow by 2000x) - Union-Find: “10 s” (too slow by 20x) - Neural decoder: “100 ns” (meets requirement)

Neural decoders are promising for real-time operation.

9.15.1 Hybrid Decoding Strategies

Combine multiple decoders for optimal performance.

Definition: Hybrid Decoder

Hybrid strategy: - Fast decoder (Neural/Union-Find): First pass - Slow decoder (MWPM): For ambiguous cases - Fallback: Only if first pass fails confidence check

Balances speed and accuracy.

Proposition: Hybrid Performance

Hybrid decoding: - 95% syndromes: Fast decoder (< 1 \(\mu s\)) - 4% syndromes: Slow decoder (10 \(\mu s\) average) - 1% syndromes: Failure (flagged for retry)

Average latency: \(1.5 \mu s\) with MWPM-level accuracy.

9.16 Example: Tiered Decoding

For large-scale surface code: 1. Tier 1 (90%): Union-Find, < 100 ns 2. Tier 2 (9%): GNN decoder, < 1 \(\mu s\) 3. Tier 3 (1%): MWPM, < 10 \(\mu s\)

Meets 1 MHz throughput requirement.

9.16.1 Future Directions

Research continues on faster, more accurate decoders.

Remark

Emerging directions: 1. Analog neural decoders: Physical implementations [@golden2024] 2. Transformer decoders: Attention-based architectures for improved generalization [@chao2023] 3. Reinforcement learning: Adaptive decoding strategies [@begu2024] 4. Co-designed hardware: ASICs optimized for decoding [@weaver2025] 5. GPU acceleration: Massively parallel decoding [@weaver2025]

The field is rapidly evolving with many open problems [@zhou2025].

Table: Decoder Comparison

Decoder	Latency	Accuracy	Scalability
MWPM	High	Near-optimal	Poor
Union-Find	Low	Good	Good
Neural (MLP)	Medium	Good	Limited
Neural (GNN)	Low	Very Good	Excellent

Remark

The optimal decoder depends on requirements: - For simulations: MWPM provides ground truth - For near-term devices: Union-Find or GNN - For large-scale QC: GNN with hybrid strategies - For ultimate performance: Hardware-accelerated neural

GPU acceleration will be essential for million-qubit machines.

9.17 Key Takeaways

Neural decoder architectures: MLP decoders are simple but don’t scale well; GNN decoders exploit code graph structure for scalability and generalization
Training considerations: Label selection (MWPM vs random vs hard), data augmentation (symmetry, noise scaling), and loss function design (weighted BCE, logical error loss) all impact performance
GNN advantages: Message passing respects code structure, enables \(O(n)\) complexity, and generalizes across code distances — trained once, works for many sizes
GPU batch parallelism: Process thousands of syndromes simultaneously on GPU, achieving 100-1000x speedup over CPU for MWPM and BP decoders
GPU memory optimization: Coalesced accesses, shared memory for messages, and minimal CPU-GPU transfer are critical for achieving peak performance
Real-time constraints: Quantum hardware at 1 MHz requires sub-microsecond decoding; neural decoders (~100 ns) can meet this while MWPM (~1 ms) cannot
Hybrid strategies: Combine fast neural/Union-Find decoder (95% cases) with slow MWPM fallback (5% cases) for optimal speed-accuracy trade-off

9.18 Bridge to Future Work

Neural network decoders and GPU acceleration represent a paradigm shift in quantum error correction — from algorithmic decoding to learned decoding. This data-driven approach opens new research directions:

Physical neural networks: Implementing neural decoders in analog or photonic hardware for ultra-low latency
Adaptive decoding: Learning to adapt to changing noise characteristics in real-time
Multi-code decoders: Single networks that work across multiple quantum codes
End-to-end optimization: Training the entire stack from syndrome to logical operation

As quantum computers scale to millions of qubits, these learned approaches will become essential for maintaining the decoding throughput required for fault-tolerant operation. The techniques in this chapter provide the foundation for exploring these cutting-edge research directions.

9.19 Exercises

Neural Network Decoder Architecture. Consider designing a neural decoder for the surface code. (Level: basic)
1. What is the input representation for a neural decoder? How is the syndrome encoded as a tensor?
2. Compare MLP and GNN architectures for decoding. Why might GNNs generalize better to different code distances?
3. What is the output of the neural network? How is it converted to an error correction decision?
Training Neural Decoders. Training data and methodology are crucial for neural decoder performance. (Level: intermediate)
1. How is training data generated for a neural decoder? What labels are used for supervised learning?
2. Explain the difference between training on synthetic noise versus realistic device noise. Which is easier? Which gives better real-world performance?
3. What is “transfer learning” in the context of neural decoders? How can a decoder trained at one code distance be adapted to another?
GPU Acceleration. Real-time decoding requires high throughput. (Level: basic)
1. If a quantum computer operates at 1 MHz cycle rate with 1000 qubits, how many syndrome measurements per second must the decoder process?
2. Explain why GPU parallelism is well-suited to neural network inference. What operations dominate the computation?
3. Compare the latency requirements for feed-forward decoding versus iterative decoding (like BP). Which is more amenable to GPU acceleration?
GNN Decoder Design. Consider implementing a GNN decoder for a surface code. (Level: intermediate)
1. Describe how to construct the Tanner graph for a distance-d surface code. How many variable nodes and check nodes are there?
2. In the message passing step, what information should check nodes send to variable nodes, and vice versa?
3. Why does a GNN decoder trained on distance-11 surface code generalize reasonably well to distance-21, while an MLP decoder does not?
4. What is the computational complexity of one forward pass through a GNN with L message passing layers?
Training Data Strategies. Training data quality significantly impacts neural decoder performance. (Level: intermediate)
1. Explain why training on MWPM labels (errors from MWPM decoder) may limit neural decoder performance compared to training on maximum likelihood labels.
2. What is “data augmentation” in the context of neural decoders? Give three examples of augmentation techniques specific to quantum error correction.
3. Why are rare error patterns (high-weight errors, logical errors) underrepresented in standard training data? How can you address this class imbalance?
4. Compare the benefits of training at multiple error rates versus training at a single error rate.
Decoding Latency Analysis. Compare different decoders for real-time operation. (Level: intermediate)
1. For a surface code with 1 \(\mu s\) cycle time, what is the maximum allowable decoding latency? Explain your reasoning.
2. A neural decoder takes 200 ns per syndrome on GPU. How many syndromes per second can it process? Is this sufficient for a 1000-qubit device at 1 MHz?
3. MWPM takes 1 ms per syndrome on CPU. By what factor would you need to parallelize MWPM to meet real-time requirements?
4. Design a hybrid decoding strategy that uses neural decoder for 95% of syndromes and MWPM for 5%. What is the average latency?
Neural Decoder Implementation. Implement a simple MLP decoder in PyTorch. (Level: advanced)
1. Write a PyTorch module that takes syndrome vectors as input and outputs error probability estimates for each qubit.
2. Generate training data: create 10,000 syndrome-error pairs for a distance-5 surface code with error rate \(p = 0.001\).
3. Train the network for 50 epochs and evaluate accuracy on a test set of 1,000 samples.
4. Compare your neural decoder’s logical error rate to the physical error rate. Does error correction provide improvement?
GPU-Accelerated Batch Decoding. Design and implement a batched decoder using GPU acceleration. (Level: advanced)
1. Write a CUDA kernel (or use CuPy) that processes multiple syndromes in parallel. What is the memory layout for efficient batch processing?
2. If you have a GPU with 10,000 cores and need to decode 1 million syndromes per second, estimate the per-syndrome time budget per core.
3. Implement belief propagation decoding on GPU for an LDPC code. Measure throughput as a function of batch size.
4. Compare the energy efficiency (decoded syndromes per joule) of GPU decoding versus CPU decoding. What are the trade-offs?