gates and circuits advanced · 18 min read · By LIPAI WANG · May 1, 2026

The Solovay-Kitaev Theorem: Approximating Any Single-Qubit Unitary with Clifford+T

The Solovay-Kitaev theorem says any single-qubit unitary can be approximated to accuracy ε by a sequence of Clifford and T gates of length polylogarithmic in 1/ε. This is the structural reason fault-tolerant quantum computing can use a finite gate set without losing expressivity. The original proof gives O(log^3.97(1/ε)) gates; modern algorithms achieve nearly optimal O(log(1/ε)) for restricted classes. This tutorial covers the theorem, the algorithm, and the practical compilation tooling that descends from it.

Prerequisites: Tutorial 25: The Clifford Group, Tutorial 26: The Eastin-Knill Theorem

Tutorial 25 introduced the Clifford group as the “easy” gate set — finite, classically simulable, transversal on stabilizer codes. Tutorial 26 used Eastin-Knill to show why no error-correcting code can have a universal transversal gate set, forcing fault-tolerant quantum computing to combine Cliffords (transversal) with the T gate (non-transversal, distilled). The natural concern that follows: with only Cliffords and T, can we actually implement arbitrary single-qubit unitaries we might need?

The Solovay-Kitaev theorem (Solovay 1995, formalized by Kitaev 2002) answers this: yes, and efficiently. Any single-qubit unitary $U$ can be approximated to accuracy $\epsilon$ by a sequence of Cliffords and T gates of length $O(\log^c(1/\epsilon))$ , where $c \approx 3.97$ in the original proof and has been pushed close to $1$ in modern algorithms. This is the structural foundation of fault-tolerant compilation: every algorithm that uses arbitrary rotations gets compiled into Clifford+T, with the cost predicted by Solovay-Kitaev-style scaling.

This tutorial covers the theorem, the constructive algorithm, the modern Ross-Selinger-style improvements that achieve near-optimal length, and the practical compilation toolchain that uses these results.

The setup

The single-qubit Clifford+T gate set is $\{H, S, T\}$ . Compositions of these generate a discrete subgroup of $\mathrm{SU}(2)$ . The relevant question: how dense is this subgroup in $\mathrm{SU}(2)$ , and how efficiently can we approximate a target unitary?

Density itself follows from a basic fact: any non-Clifford gate, combined with Cliffords, generates a dense subgroup of $\mathrm{SU}(2)$ . The T gate is non-Clifford (tutorial 25 showed $T X T^\dagger = (X+Y)/\sqrt{2}$ , not a Pauli), so the Clifford+T group is dense. Density alone is not enough — we want efficient approximation, not just existence of approximating sequences.

The Solovay-Kitaev theorem makes “efficient” precise:

Theorem (Solovay 1995, Kitaev 2002, formalized by Dawson-Nielsen 2006). For any universal gate set $\mathcal{G}$ closed under inverses and any target unitary $U \in \mathrm{SU}(2)$ , there is an algorithm that, given $U$ and target accuracy $\epsilon$ , produces a sequence of gates from $\mathcal{G}$ approximating $U$ to within distance $\epsilon$ , using $O(\log^c(1/\epsilon))$ gates and running in $O(\log^c(1/\epsilon))$ classical time, where $c \approx 3.97$ .

The “distance” here is the operator-norm distance: $\|U - V\| \leq \epsilon$ . The constant $c \approx 3.97$ is from the original proof; later refinements have pushed it down.

The constructive algorithm

The Solovay-Kitaev algorithm is recursive. The intuition: to approximate $U$ to accuracy $\epsilon$ , find an approximation $U_0$ to accuracy $\sqrt{\epsilon}$ from a precomputed library, then “correct” the error $U U_0^{-1}$ using a group commutator of two unitaries each approximated to accuracy roughly $\epsilon^{2/3}$ . Recurse on those.

Pseudocode (slightly simplified):

def solovay_kitaev(U, level n):
    if n == 0:
        return BASIC_APPROX(U)  # nearest precomputed gate sequence
    U_prev = solovay_kitaev(U, n-1)         # approximation at coarser accuracy
    delta = U * inverse(U_prev)              # the residual error
    V, W = decompose_as_commutator(delta)    # find V, W with VWV^-1 W^-1 ≈ delta
    V_approx = solovay_kitaev(V, n-1)
    W_approx = solovay_kitaev(W, n-1)
    return V_approx * W_approx * inverse(V_approx) * inverse(W_approx) * U_prev

The recursion has depth $n = O(\log\log(1/\epsilon))$ , and each level’s accuracy improves by a power of $\sim 3/2$ . The total number of gates is $O((\log(1/\epsilon))^c)$ where $c$ depends on the constants in the commutator decomposition.

The non-trivial step is decompose_as_commutator(delta): given a small unitary $\delta$ near the identity, find unitaries $V, W$ near the identity such that the group commutator $V W V^{-1} W^{-1} = \delta$ . For $\delta$ very close to the identity, this always exists and can be computed analytically using the Lie-algebra correspondence.

Why $\log^{3.97}$ and not $\log$

Each commutator-decomposition step doubles the number of gates needed (you have $V, W, V^{-1}, W^{-1}$ instead of just $\delta$ ). At level $n$ of the recursion, the gate count is roughly $5^n$ times the gate count at level 0, where the factor 5 comes from $V W V^{-1} W^{-1} U_\text{prev}$ . With $n = O(\log\log(1/\epsilon))$ , the total scaling becomes:

\text{Gates} \;\sim\; 5^{\log\log(1/\epsilon)} \;=\; (\log(1/\epsilon))^{\log 5} \;\approx\; (\log(1/\epsilon))^{2.32}.

Adding the constants from the basic approximation step pushes the exponent up to around 3.97 in the original analysis. The “3.97” is not fundamental — it is an artifact of the commutator-decomposition construction. Modern algorithms based on number theory achieve closer to $O(\log(1/\epsilon))$ with much better constants.

Modern algorithms: Ross-Selinger 2016

The state-of-the-art for single-qubit Clifford+T synthesis is the Ross-Selinger algorithm (Ross-Selinger 2016, “Optimal ancilla-free Clifford+T approximation of z-rotations”). It achieves:

Approximate gate count: $\sim 4 \log(1/\epsilon)$ T gates for $z$ -axis rotations.
Optimal up to a constant — within a small multiplicative factor of the information-theoretic lower bound.
Runtime polynomial in $\log(1/\epsilon)$ .

The algorithm works by reducing the synthesis problem to an algebraic-number-theory problem: factor a specific element in $\mathbb{Z}[\omega]$ (the ring of cyclotomic integers with $\omega = e^{i \pi/4}$ ) into a small number of factors of bounded norm. Each factor corresponds to a specific T-or-Clifford gate sequence. The number-theoretic factorization is solved by a clever lattice-reduction-style algorithm that runs in time polynomial in the bit length of $1/\epsilon$ .

The practical implication: for $z$ -rotations to accuracy $10^{-10}$ , Ross-Selinger uses about 80 T gates. Solovay-Kitaev with the original commutator algorithm uses roughly $\log^{3.97}(10^{10}) \approx 5{,}000$ T gates. Three orders of magnitude difference.

For arbitrary $\mathrm{SU}(2)$ rotations (not just $z$ -rotations), the analogous algorithms achieve $\sim 6\log(1/\epsilon)$ T gates by combining Ross-Selinger with Euler-angle decomposition.

Implementation: open-source tools

Production fault-tolerant compilers don’t reimplement Solovay-Kitaev from scratch. Several open-source libraries handle single-qubit synthesis:

PyQt synthesis (within Cirq) — basic Solovay-Kitaev implementation, useful for rough estimates.
Sleek (Microsoft Q# resource estimator) — Ross-Selinger-derived algorithms for Clifford+T synthesis, used in Microsoft’s resource estimator.
Newton’s method T-counter (Bocharov-Roetteler 2014) — another efficient single-qubit synthesis approach.
gridsynth (Selinger) — open-source Haskell implementation of Ross-Selinger.

For a real algorithm using arbitrary rotations (e.g., a chemistry circuit with continuous angles), the compilation flow is:

Decompose the algorithm into Cliffords plus arbitrary single-qubit rotations $R_z(\theta)$ .
For each $R_z(\theta)$ , choose target accuracy $\epsilon_\text{single}$ such that the cumulative error stays below the algorithm’s overall budget.
Run Ross-Selinger (or equivalent) on each $R_z(\theta_i)$ to get a Clifford+T sequence.
Compose into the full circuit. Optimize T-count via post-pass optimizers (Nam et al. 2018, Heyfron-Campbell 2018).

A small synthesis demonstration

This is a stripped-down Solovay-Kitaev sketch that gives you a feel for the recursion. The implementation is illustrative, not production-grade — gridsynth or Microsoft’s tools are what you’d use in a real workflow.

import numpy as np
from itertools import product

# Basic gate set: H, S, T (and their inverses for completeness).
H = np.array([[1, 1], [1, -1]]) / np.sqrt(2)
S = np.array([[1, 0], [0, 1j]])
T = np.array([[1, 0], [0, np.exp(1j * np.pi / 4)]])
SDAG = S.conj().T
TDAG = T.conj().T

BASIC_GATES = {'H': H, 'S': S, 'T': T, 'Sd': SDAG, 'Td': TDAG}


def gate_distance(U, V):
    """Operator-norm-style distance between two single-qubit unitaries."""
    M = U @ V.conj().T
    eigs = np.linalg.eigvals(M)
    return min(np.abs(np.angle(eigs)))


def basic_approx_library(max_length=4):
    """Precompute all gate sequences of length <= max_length and their unitaries."""
    library = []
    for length in range(1, max_length + 1):
        for combo in product(BASIC_GATES.keys(), repeat=length):
            U = np.eye(2, dtype=complex)
            for g in combo:
                U = BASIC_GATES[g] @ U
            library.append(("·".join(combo), U))
    return library


def basic_approximation(U, library):
    """Find the closest precomputed sequence."""
    best_seq, best_unitary, best_dist = None, None, float('inf')
    for seq, V in library:
        d = gate_distance(U, V)
        if d < best_dist:
            best_seq, best_unitary, best_dist = seq, V, d
    return best_seq, best_unitary, best_dist


# Example: try to approximate a small z-rotation.
target_angle = 0.1  # ~5.7 degrees
target = np.array([[np.exp(-1j * target_angle / 2), 0],
                   [0, np.exp(1j * target_angle / 2)]])

library = basic_approx_library(max_length=4)
seq, approx, d = basic_approximation(target, library)
print(f"Target: R_z({target_angle})")
print(f"Best basic approximation: {seq}")
print(f"Distance: {d:.4f}")

The basic-approximation step is what gets recursively refined by the Solovay-Kitaev commutator iteration. Real implementations precompute a much larger library and then refine. The output is a Clifford+T sequence whose T-count grows polylogarithmically with target accuracy.

Common misconceptions

“Solovay-Kitaev gives optimal T-count.” No. The original Solovay-Kitaev algorithm gives $O(\log^{3.97}(1/\epsilon))$ T-count, far from optimal. Ross-Selinger and related modern algorithms give near-optimal $O(\log(1/\epsilon))$ . Don’t quote Solovay-Kitaev’s $\log^{3.97}$ as the cost of fault-tolerant compilation in 2026.

“Single-qubit synthesis is the dominant cost of fault-tolerant compilation.” Often it is not. Magic-state distillation (tutorial 24) typically dominates the qubit budget, while gate count is dominated by the algorithm’s arithmetic. Single-qubit synthesis is one of several costs that compose into the resource estimate.

“Solovay-Kitaev works for any gate set.” It works for any universal gate set closed under inverses. The Clifford+T set satisfies this. Non-universal gate sets cannot give arbitrary rotations regardless of length.

“More T gates means a more accurate approximation.” Roughly yes, but the relationship is logarithmic: doubling T-count from 40 to 80 might improve accuracy from $10^{-5}$ to $10^{-10}$ . This is why high-precision algorithms have manageable T-counts despite needing absurd-sounding accuracy targets.

“You can replace T gates with arbitrary $R_z(\theta)$ on real hardware.” Only on NISQ devices that support continuous rotations directly. For fault-tolerant computation, only the discrete Clifford+T set has fault-tolerant constructions; arbitrary rotations must be synthesized.

Decision rule

When a fault-tolerant compilation flow encounters an $R_z(\theta)$ for non-special $\theta$ :

Determine accuracy budget. Total algorithm error $\epsilon$ , divided across the algorithm’s $N$ rotations: per-rotation accuracy $\epsilon / N$ .
Run Ross-Selinger (or equivalent). Get a Clifford+T sequence with $\sim 4 \log(N/\epsilon)$ T gates per rotation.
Compose and optimize. Post-pass T-count optimizers can reduce total T-count by 10-30% by exploiting circuit-level cancellations.
Account for distillation. Each T gate costs a magic-state factory output. For RSA-2048-scale algorithms, the synthesis T-count is the input to the magic-state factory budget.

For circuits with very few non-Clifford rotations or with specific structure (e.g., chemistry’s particle-conserving rotations), specialized synthesis methods can sometimes beat generic Solovay-Kitaev/Ross-Selinger.

Exercises

1. Counting T gates for chemistry

A chemistry simulation needs $10^6$ rotations $R_z(\theta_i)$ , with total error budget $\epsilon = 10^{-3}$ . Estimate the T-count using Ross-Selinger.

Show answer

Per-rotation accuracy: $\epsilon_\text{single} = 10^{-3} / 10^6 = 10^{-9}$ . Ross-Selinger T-count per rotation: $\sim 4 \log_2(10^9) \approx 4 \cdot 30 = 120$ T gates. Total: $10^6 \times 120 = 1.2 \times 10^8$ T gates. For RSA-2048, Gidney-Ekerå quoted ~ $2 \times 10^{10}$ T gates. This chemistry simulation is two orders of magnitude smaller — easier for the magic-state factories. The accuracy budget breakdown is the dominant lever in resource estimation; tightening or loosening per-rotation accuracy by a factor of $10$ adjusts the T-count proportionally.

2. Why log^3.97 is not the bottleneck

The original Solovay-Kitaev gives $O(\log^{3.97}(1/\epsilon))$ scaling. For $\epsilon = 10^{-10}$ , that’s roughly $5{,}000$ T gates per rotation. Modern Ross-Selinger gives $\sim 4 \log_2(1/\epsilon) = 132$ T gates per rotation. Why was the field happy with Solovay-Kitaev for nearly two decades?

Show answer

Solovay-Kitaev was good enough for asymptotic complexity claims and resource-estimation bounds — quantum-algorithm papers typically reported “polylogarithmic gate count” without optimizing constants. The $5{,}000$ -vs- $132$ gap matters in practice but not in theoretical claims. As the field moved from theoretical estimates to actual fault-tolerant resource estimates (around 2014-2016), the constants started to matter, and Ross-Selinger’s order-of-magnitude improvement became operationally significant. Theoretical-paper culture and practical-engineering culture have different definitions of “good enough.” Solovay-Kitaev was good enough for the former for a long time.

3. Why Clifford+T specifically

Why is the standard fault-tolerant gate set Clifford+T rather than something like Clifford+S^(1/n) for some other $n$ ?

Show answer

Two reasons: (1) The T gate is the non-Clifford gate in the Clifford hierarchy with the simplest magic-state distillation protocol — Bravyi-Kitaev’s 15-to-1 protocol distills T-states cheaply. Other non-Clifford gates have higher distillation overheads. (2) The Clifford+T set generates a dense subgroup of $\mathrm{SU}(2)$ (universality) with optimal-up-to-constants T-count synthesis (Ross-Selinger). Other gate sets (e.g., Clifford+Toffoli) are universal too but have different distillation and synthesis costs. The Clifford+T choice is a confluence of (a) cheap distillation and (b) good synthesis algorithms. Reed-Muller codes have transversal T but lose transversal H — different tradeoff. The standard choice is Clifford+T because it’s the lowest total cost in the dominant fault-tolerance architecture.

4. Per-rotation budget tradeoffs

Suppose you can change the algorithm’s per-rotation accuracy from $10^{-10}$ to $10^{-8}$ at the cost of more rotations (say, going from $10^7$ rotations to $10^9$ ). Which is better for total T-count?

Show answer

Option A: $10^7$ rotations × $\sim 4 \log_2(10^{10}) = 4 \cdot 33 = 132$ T-gates each = $1.32 \times 10^9$ T total. Option B: $10^9$ rotations × $\sim 4 \log_2(10^8) = 4 \cdot 27 = 108$ T-gates each = $1.08 \times 10^{11}$ T total. Option A wins by two orders of magnitude. The lesson: rotation count drives total T-count linearly, while per-rotation accuracy enters logarithmically. Reduce the number of rotations even at the cost of higher per-rotation accuracy. This is the structural principle behind algorithms like qubitization that reduce the per-step rotation count compared to naive Trotterization.

Where this goes next

Tutorial 54 covers the dual problem at the gate level: Toffoli decomposition and T-count optimization, which determines how many T gates each non-Clifford operation in the algorithm contributes before single-qubit synthesis. Together, Solovay-Kitaev (single-qubit synthesis) and Toffoli decomposition (multi-qubit non-Clifford) account for almost all the T-count in a fault-tolerant compilation pipeline.

The setup

The constructive algorithm

Why log⁡3.97\log^{3.97}log3.97 and not log⁡\loglog

Modern algorithms: Ross-Selinger 2016

Implementation: open-source tools

A small synthesis demonstration

Common misconceptions

Decision rule

Exercises

1. Counting T gates for chemistry

2. Why log^3.97 is not the bottleneck

3. Why Clifford+T specifically

4. Per-rotation budget tradeoffs

Where this goes next

Quantum, for people who already code.

Why $\log^{3.97}$ and not $\log$