gates and circuits advanced · 16 min read · By LIPAI WANG · May 1, 2026

Toffoli Decomposition and T-Count Optimization: How Reversible Logic Becomes Fault-Tolerant

The Toffoli gate (controlled-controlled-NOT) is the universal classical reversible gate, the building block of every quantum arithmetic circuit, and the dominant non-Clifford operation in most fault-tolerant algorithms. Naive Toffoli decomposition uses 7 T gates; Selinger's 2013 optimization uses 4; Jones-Glassman's measurement-based variant uses 3 with a small probability of failure. This tutorial covers the standard decomposition, the optimized variants, and the T-count optimization passes that follow.

Prerequisites: Tutorial 53: The Solovay-Kitaev Theorem

The Toffoli gate (controlled-controlled-NOT) flips its target qubit if and only if both control qubits are $|1\rangle$ . It is the universal classical reversible gate — every classical Boolean function can be implemented using only Toffolis and ancillas. In quantum computing, every algorithm with classical-arithmetic subroutines (Shor’s modular arithmetic, Grover’s oracle calls, quantum chemistry’s particle-counting projectors) eventually compiles into Toffolis as the non-Clifford workhorse.

The Toffoli is not in the Clifford group. By tutorial 26’s Eastin-Knill, that means it’s not transversal on the surface code and must be implemented via magic-state distillation or equivalent. The dominant cost question for fault-tolerant arithmetic: how many T gates per Toffoli?

The answer has shifted over time:

2010 textbook (Nielsen-Chuang): 7 T gates per Toffoli.
Selinger 2013: 4 T gates per Toffoli with shorter circuits.
Jones 2013 / Glassman & Selinger: 3 T gates with measurement and classical feedback (a small probability of failure that retries).
Recent (Cody Jones 2013, Maslov 2016): various optimizations for Toffoli sequences (Toffoli ladders, Toffoli cascades) that share T gates across multiple Toffolis.

These improvements compound. A modular-exponentiation circuit for RSA-2048 originally estimated at ~ $3 \times 10^9$ Toffolis with 7 T-each = $2.1 \times 10^{10}$ T gates is now estimated at $\sim 1 \times 10^{10}$ T gates after optimization — half the magic-state factory budget.

This tutorial covers the standard 7-T decomposition, Selinger’s 4-T variant, the measurement-based 3-T variant, and the T-count optimization passes that compilers run on long Toffoli sequences.

The standard 7-T decomposition

The Toffoli’s matrix is the $8 \times 8$ identity except that the last $2 \times 2$ block is the X gate. In gate notation:

control1 ────●────
              │
control2 ────●────
              │
target    ────⊕────

The standard decomposition (Nielsen-Chuang 2010, originally Barenco et al. 1995) into Cliffords and T gates uses 7 T gates and several CNOTs and Hadamards:

target:    H ──── T ──── ⊕ ──── T† ──── ⊕ ──── T ──── ⊕ ──── T† ──── H
                         │              │              │
control1:  ────────────── ●──── T ────── ────── T ──── ●─── ── ── ────
                                          │
control2:  ────────────────── ───────── ●─── ── T† ── ── ── ── ── ────

Reading T-gate count: 4 T gates on the target qubit, 1 T gate on each control = 6, plus 1 more from the simplification: total 7 T gates.

Other Cliffords needed: 6 CNOTs and 2 Hadamards on the target. The total gate count is small; the T count of 7 is what matters because T gates are exponentially more expensive than Cliffords in fault-tolerant cost.

The decomposition is exact — no approximation, no distance bound. The Toffoli equals this Clifford+T circuit identically (modulo global phase).

Why 7 T gates

A counting argument explains why the Toffoli takes around this many T gates.

The Clifford group on 3 qubits is finite. The Toffoli is not in it (it doesn’t preserve the Pauli group under conjugation in the right way). Any Clifford-equivalent extension of the Toffoli must therefore use non-Clifford gates. The minimum T-count over all decompositions is provably at least 4 (Amy-Maslov-Mosca 2014 lower bound).

Selinger 2013 achieved 4 T gates exactly. So the Nielsen-Chuang 7-T decomposition is not optimal — it is just the simplest known decomposition that fits cleanly on textbook pages. Real compilers use Selinger’s 4-T variant.

Selinger’s 4-T variant

Selinger 2013 (“Quantum circuits of T-depth one”) gave a Toffoli decomposition using only 4 T gates and T-depth 1 — meaning all 4 T gates can be applied in a single time step (in parallel on different qubits), not sequentially. T-depth-1 circuits are particularly attractive because the parallelism reduces wall-clock time on a fault-tolerant machine.

The Selinger circuit uses 7 ancilla qubits in addition to the 3 qubits of the Toffoli. The ancillas are prepared, used, and discarded; the net effect on the original 3 qubits is exactly the Toffoli action. 4 T gates total.

The trade: more ancillas (and the Cliffords to set them up) for fewer T gates. In fault-tolerant computation, this trade is usually a clear win — Cliffords and ancilla preparation are cheap; T gates are the binding cost.

Production fault-tolerant compilers (Microsoft Q# resource estimator, Google’s surface-code-resource-estimator) use Selinger’s 4-T as the baseline Toffoli decomposition.

The measurement-based 3-T variant

Jones 2013 and Glassman-Selinger introduced a measurement-based Toffoli using 3 T gates and a classical-feedback correction. The structure:

Prepare a specific 4-qubit ancilla state (a “Toffoli magic state”) using 3 T gates and Cliffords.
Couple the ancilla to the input qubits via Cliffords + a measurement.
Based on the measurement outcome, apply a Clifford correction.

The measurement outcome is random — half the time the protocol succeeds at applying the Toffoli on the first try; the other half requires a Clifford-correction step. The expected T-count averages to 3 T gates per Toffoli if you account for the small probability of needing additional correction.

This is the lowest published T-count for Toffoli. Whether it is used in production depends on the compiler’s tolerance for measurement-based protocols and classical feedback latency. In real fault-tolerant systems, measurements take 1-10 µs and classical feedback takes another few µs; the latency is non-negligible compared to the time saved by reducing T-count.

The 2026 production picture: Selinger’s 4-T is the default; Jones’s 3-T is used when latency budget allows.

T-count optimization across multiple Toffolis

For algorithms with many Toffolis in sequence (e.g., a chain of multi-bit adders), the T-count can be further reduced by exploiting structure across Toffolis. Several optimization passes:

Toffoli ladders

A “ladder” of Toffolis with shared inputs (e.g., a controlled-AND across many qubits) can sometimes share T gates. Cody Jones 2013 showed that an $n$ -qubit Toffoli ladder needs only $\sim n$ T gates total, not $\sim 4n$ as naive concatenation would give.

Hadamard gadgets

Many Toffoli decompositions include Hadamard gates that can be commuted past T gates with a phase correction. Heyfron-Campbell 2018 systematically applied this to reduce T-count by 20-40% on real algorithms.

Synthesis-time optimization

Compiler passes (e.g., t-par by Amy-Maslov-Mosca, gridsynth post-processing) optimize T-count at the synthesis stage by aggregating phase rotations and recognizing simplifications.

The cumulative effect on real algorithms: Gidney-Ekerå’s 2021 RSA-2048 estimate uses $\sim 7$ T-per-Toffoli; later optimizations (Litinski 2019, Babbush et al. 2018 chemistry-specific) bring the effective T-per-Toffoli closer to 3-4 for the dominant arithmetic circuits.

A small T-count counting demonstration

Concrete code that builds a Toffoli circuit with the standard decomposition and counts T gates:

import pennylane as qml
import numpy as np

dev = qml.device("default.qubit", wires=3)


@qml.qnode(dev)
def standard_toffoli_decomp(input_state):
    """Standard Nielsen-Chuang 7-T Toffoli decomposition."""
    qml.BasisState(input_state, wires=[0, 1, 2])
    # Standard decomposition. Comments mark each T gate.
    qml.Hadamard(wires=2)
    qml.CNOT(wires=[1, 2])
    qml.adjoint(qml.T)(wires=2)         # T†  (1)
    qml.CNOT(wires=[0, 2])
    qml.T(wires=2)                       # T   (2)
    qml.CNOT(wires=[1, 2])
    qml.adjoint(qml.T)(wires=2)         # T†  (3)
    qml.CNOT(wires=[0, 2])
    qml.T(wires=1)                       # T   (4)
    qml.T(wires=2)                       # T   (5)
    qml.Hadamard(wires=2)
    qml.CNOT(wires=[0, 1])
    qml.T(wires=0)                       # T   (6)
    qml.adjoint(qml.T)(wires=1)         # T†  (7)
    qml.CNOT(wires=[0, 1])
    return qml.probs(wires=[0, 1, 2])


# Verify: applying to |110> should give |111> (Toffoli flips target when both controls are 1).
print("Input |110> ->", standard_toffoli_decomp(np.array([1, 1, 0])))
# Verify: applying to |101> should give |101> (target unchanged when not both controls are 1).
print("Input |101> ->", standard_toffoli_decomp(np.array([1, 0, 1])))


def count_t_gates(qml_circuit_func, *args):
    """Count T and T-dagger gates in a PennyLane circuit."""
    with qml.tape.QuantumTape() as tape:
        qml_circuit_func(*args)
    t_count = sum(1 for op in tape.operations if op.name in ('T', 'Adjoint(T)'))
    return t_count


# (Note: pennylane's tape introspection is more complex than this in practice;
# this is illustrative.)

The standard 7-T decomposition has the T gates clearly visible. Optimized variants (Selinger 4-T, Jones 3-T) require more ancillas and are not as cleanly written line-by-line. Production compilers handle the optimized decomposition automatically.

When Toffoli optimization matters most

T-count optimization is most impactful when:

The algorithm has many Toffolis. Classical-arithmetic-heavy circuits (Shor’s, quantum hashing) have $10^{9}$ + Toffolis where small per-Toffoli savings compound.
The target hardware is fault-tolerant. On NISQ hardware, T gates are not specially expensive — they’re just another rotation. Optimization payoff is in the FTQC era.
You’re targeting a specific resource estimate. Algorithm authors often optimize T-count to minimize the magic-state-factory budget; resource estimators typically assume the optimization has been done.

For NISQ-era variational algorithms with few Toffolis, the standard 7-T decomposition is fine. For RSA-class cryptanalysis algorithms, every T-gate optimization translates to millions of qubits saved at the magic-state-factory level.

Common misconceptions

“More CNOTs is always worse than more T gates.” Wrong direction. CNOTs are nearly free (transversal Cliffords). T gates dominate the cost. Trade more CNOTs for fewer T gates whenever possible.

“The standard Toffoli decomposition is optimal.” It is not — Selinger’s 4-T variant achieves better. Don’t quote 7 T per Toffoli as the cost of fault-tolerant arithmetic in 2026; quote 4 (or 3 with measurement).

“T-count is the only thing that matters.” It dominates the qubit-time budget, but other costs are real: T-depth (parallelism), measurement costs, classical feedback latency, and ancilla qubit overhead all matter at the system level. Toffoli synthesis trades some of these against each other.

“Optimization passes always reduce T-count.” Sometimes they fail or give worse results due to missed pattern recognition, conflicting passes, or pathological circuit structure. Real production pipelines apply multiple passes and pick the best result.

Decision rule

For each Toffoli in a fault-tolerant compilation:

Standard 7-T decomposition. Simplest; use for prototyping or small algorithms where the optimization complexity isn’t worth the savings.
Selinger 4-T decomposition. Default for production. Use when you have at least 7 ancillas available and don’t need measurement-based tricks.
Jones 3-T measurement-based decomposition. Use when latency budget allows for measurement + classical feedback per Toffoli, and the small probability of failure is acceptable.

For multi-Toffoli circuits, run a T-count optimization pass (Amy-Maslov-Mosca’s t-par, Heyfron-Campbell, or compiler-specific) after decomposition. Expect 20-40% T-count reduction on typical arithmetic circuits.

Exercises

1. Counting T gates for Shor’s

Shor’s algorithm for $N$ -bit integers uses ~ $3 \times N$ Toffolis in the modular-exponentiation circuit (Gidney-Ekerå 2021 estimate for RSA-2048 with $N = 2048$ is ~ $6 \times 10^9$ Toffolis after optimization). Compute total T-count using (a) standard 7-T, (b) Selinger 4-T, (c) Jones 3-T.

Show answer

(a) Standard: $6 \times 10^9 \times 7 = 4.2 \times 10^{10}$ T gates. (b) Selinger: $6 \times 10^9 \times 4 = 2.4 \times 10^{10}$ T gates. (c) Jones: $6 \times 10^9 \times 3 = 1.8 \times 10^{10}$ T gates. The optimization between standard and Jones cuts T-count by 57%. The corresponding magic-state factory area scales linearly with T-count. Cutting T-count in half cuts factory area in half — saving on the order of $5 \times 10^6$ physical qubits in the RSA-2048 resource estimate. Toffoli optimization is the largest single lever in fault-tolerant resource reduction, frequently larger than algorithmic improvements.

2. Why measurement helps

Jones’s 3-T Toffoli uses a measurement plus classical feedback. Why does this allow lower T-count than the unitary-only Selinger 4-T?

Show answer

Unitary circuits must implement the Toffoli on every input deterministically. Measurement-based circuits can prepare a specific magic state (using fewer T gates than a generic unitary Toffoli would need), then “deliver” it to the target qubits via teleportation-style measurement. The classical-feedback step does the work that would otherwise require additional T gates in a unitary protocol. The catch: latency. Measurement + feedback on superconducting hardware takes several microseconds, while a Selinger 4-T Toffoli is a few hundred nanoseconds. Use measurement-based variants when latency budget allows; use unitary variants when you need fast deterministic operation. This is one of the recurring fault-tolerance design decisions.

3. Toffoli ladders

A “Toffoli ladder” is a sequence of Toffolis where each Toffoli’s target becomes the control of the next. Such ladders appear in classical bit-counting circuits. Why does the optimized ladder T-count scale linearly with $n$ rather than as $4n$ ?

Show answer

The naive concatenation of $n$ Toffolis at 4 T-gates each gives $4n$ T-gates. But adjacent Toffolis in a ladder share inputs and outputs, allowing structural overlap: phase rotations on shared qubits can be combined, ancillas can be reused, and the chain of dependencies allows commuting and merging T gates. Cody Jones 2013 showed that a ladder of $n$ Toffolis can be decomposed using only $\sim 4n / 2 = 2n$ T-gates, then refined further to closer to $n$ T-gates with appropriate ancilla scheduling. The ladder structure lets you amortize T-gate cost across the chain. This is why algorithms with structured Toffoli sequences benefit disproportionately from compilation passes.

4. Choosing the right Toffoli for QFT

The quantum Fourier transform (tutorial 11) uses controlled-rotations $R_z(\pi / 2^k)$ for $k = 1, 2, \ldots, n$ . After Solovay-Kitaev synthesis (tutorial 53), each becomes a sequence of Cliffords + T gates. For an $n$ -qubit QFT, what’s the dominant cost contribution?

Show answer

The QFT has $\binom{n}{2} \sim n^2/2$ controlled-rotations. After synthesis, each rotation requires $\sim 4 \log_2(1/\epsilon)$ T-gates (Ross-Selinger). Total: $\sim 2 n^2 \log_2(1/\epsilon)$ T-gates from the rotations. The QFT itself contains no Toffolis explicitly. The cost is dominated by single-qubit-rotation synthesis, not Toffoli decomposition — different sub-cost-line than algorithms like Shor’s modular exponentiation, which is dominated by Toffoli-arithmetic. Different algorithms hit different cost lines: arithmetic-heavy algorithms (Shor, hashing) are Toffoli-dominated; rotation-heavy algorithms (QFT, chemistry) are synthesis-dominated. Resource estimation must account for the dominant cost in each algorithm class.

Where this goes next

Tutorial 55 covers controlled-unitary synthesis — the related but distinct problem of implementing $C\text{-}U$ for arbitrary $U$ , which arises in quantum phase estimation, amplitude amplification, and many block-encoding constructions. Tutorial 56 covers ZX-calculus, a graphical language for quantum circuits that enables a different style of T-count optimization through diagrammatic rewriting.

The standard 7-T decomposition

Why 7 T gates

Selinger’s 4-T variant

The measurement-based 3-T variant

T-count optimization across multiple Toffolis

Toffoli ladders

Hadamard gadgets

Synthesis-time optimization

A small T-count counting demonstration

When Toffoli optimization matters most

Common misconceptions

Decision rule

Exercises

1. Counting T gates for Shor’s

2. Why measurement helps

3. Toffoli ladders

4. Choosing the right Toffoli for QFT

Where this goes next

Quantum, for people who already code.