Quantum Outpost
quantum ml advanced · 22 min read ·

Quantum Kernels and Feature Maps

Quantum kernels sidestep variational training entirely: they embed data into a quantum Hilbert space via a fixed feature map and use the inner product as a kernel for a classical SVM. This tutorial builds the ZZ feature map from Havlíček et al. 2019, implements a quantum SVC in Qiskit, and explains the reproducing-kernel view that unifies the approach.

Prerequisites: Tutorial 15: QML Foundations

Variational QML (Tutorial 15) trains a parameterized circuit with gradient descent. Quantum kernel methods take a different route: fix a quantum feature map, compute inner products between embedded data points, and feed the resulting kernel matrix to a classical SVM. No variational parameters, no training of the quantum part, no barren plateaus.

It’s a clean idea and a serious line of QML research. Havlíček et al.’s 2019 Nature paper (“Supervised learning with quantum-enhanced feature spaces”) laid out the playbook and gave the field its most concrete claim of possible quantum advantage. A decade later, we know when it works (small structured datasets with the right feature map) and when it doesn’t (most tabular data — XGBoost still wins). This tutorial builds the method from first principles.

Classical kernels, briefly

A kernel is a function K(x,y)K(x, y) that equals an inner product in some (possibly implicit) feature space:

K(x,y)=ϕ(x),ϕ(y)H.K(x, y) = \langle \phi(x), \phi(y) \rangle_\mathcal{H}.

The big deal: many ML algorithms (SVM, kernel ridge regression, Gaussian processes) depend only on these inner products, never on the ϕ\phi‘s themselves. You can work in a feature space too high-dimensional to represent directly — even infinite-dimensional — as long as you can compute pairwise inner products.

Common classical kernels:

  • Linear: K(x,y)=xTyK(x, y) = x^T y.
  • Polynomial: K(x,y)=(xTy+c)dK(x, y) = (x^T y + c)^d.
  • RBF (Gaussian): K(x,y)=exp(γxy2)K(x, y) = \exp(-\gamma \|x - y\|^2).

The kernel trick is: pick a useful kernel, feed pairwise values into an SVM, and get nonlinear decision boundaries in the original space.

Quantum kernels

Define a feature map by state preparation:

ϕ(x)ϕ(x)=Uϕ(x)0n,\phi(x) \equiv |\phi(x)\rangle = U_\phi(x)\,|0\rangle^{\otimes n},

where Uϕ(x)U_\phi(x) is a data-dependent circuit. The fidelity quantum kernel is:

K(x,y)=ϕ(x)ϕ(y)2.K(x, y) = \big|\langle \phi(x)\,|\,\phi(y)\rangle\big|^2.

This is the squared overlap between two embedded states. It’s automatically a valid kernel (the Gram matrix is positive semi-definite by construction of the state vectors as unit vectors).

How to compute. For each pair of data points, prepare Uϕ(y)Uϕ(x)0nU_\phi(y)^\dagger\,U_\phi(x)\,|0\rangle^{\otimes n} and measure the probability of the all-zeros outcome:

Pr[all 0’s]=0nUϕ(y)Uϕ(x)0n2=K(x,y).\Pr[\text{all } 0\text{'s}] = |\langle 0^{\otimes n}|\,U_\phi(y)^\dagger\,U_\phi(x)\,|0^{\otimes n}\rangle|^2 = K(x, y).

This is the “kernel-by-overlap” circuit. For a dataset of NN training points, the SVM needs O(N2)O(N^2) such circuits. On real hardware with shot noise, each needs thousands of shots — so quantum-kernel SVMs are computationally heavy.

The ZZ feature map

Havlíček et al. introduced the ZZ feature map:

Uϕ(x)=UZ(x)HnUZ(x)Hn,U_\phi(x) = U_Z(x)\,H^{\otimes n}\,U_Z(x)\,H^{\otimes n},

where

UZ(x)=exp ⁣(ijxjZj+ij<k(xjxk)ZjZk).U_Z(x) = \exp\!\left(i\sum_{j} x_j\,Z_j + i\sum_{j < k} \big(x_j\,x_k\big)\,Z_j Z_k\right).

The inner sums include single-qubit Z rotations (linear terms in xx) and two-qubit ZZ rotations (product terms). The Hadamard sandwich entangles the single qubits; the product terms in UZU_Z create genuinely nonlinear feature-space embedding. The depth-2 structure (two copies of the encoding) increases expressivity.

Intuition: this feature map is thought to be hard to simulate classically — any classical surrogate would need to keep track of the full quantum state, which scales exponentially in nn. If the data happens to be separable in this feature space, a quantum SVM could beat any polynomial-time classical method.

Whether real datasets have this property is the million-dollar question we address in Tutorial 17.

Qiskit implementation

import numpy as np
from qiskit import QuantumCircuit
from qiskit.circuit import ParameterVector
from qiskit.quantum_info import Statevector
from qiskit_aer import AerSimulator


def zz_feature_map(n: int, x: np.ndarray, depth: int = 2) -> QuantumCircuit:
    """ZZ feature map of the given depth (a.k.a. `reps`) on n qubits."""
    qc = QuantumCircuit(n)
    for _ in range(depth):
        qc.h(range(n))
        for j in range(n):
            qc.rz(2 * x[j], j)
        for j in range(n):
            for k in range(j + 1, n):
                qc.cx(j, k)
                qc.rz(2 * (np.pi - x[j]) * (np.pi - x[k]), k)
                qc.cx(j, k)
    return qc


def quantum_kernel_entry(x: np.ndarray, y: np.ndarray, n: int) -> float:
    """Compute K(x, y) = |⟨φ(x)|φ(y)⟩|² via the overlap circuit."""
    qc = QuantumCircuit(n, n)
    qc.compose(zz_feature_map(n, x), inplace=True)
    qc.compose(zz_feature_map(n, y).inverse(), inplace=True)
    qc.measure(range(n), range(n))
    counts = AerSimulator().run(qc, shots=4096).result().get_counts()
    zeros = counts.get("0" * n, 0)
    return zeros / 4096


def quantum_kernel_matrix(X1: np.ndarray, X2: np.ndarray, n: int) -> np.ndarray:
    """Compute the pairwise kernel matrix K[i, j] = K(X1[i], X2[j])."""
    K = np.zeros((X1.shape[0], X2.shape[0]))
    for i, x in enumerate(X1):
        for j, y in enumerate(X2):
            K[i, j] = quantum_kernel_entry(x, y, n)
    return K

A quantum SVC in action

Use the kernel matrix with scikit-learn’s generic SVC. The only change from classical: precomputed kernel instead of an RBF string.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler

# --- Low-dim synthetic data that fits in few qubits ---
X, y = make_classification(n_samples=40, n_features=3, n_informative=3,
                           n_redundant=0, random_state=7)

# Scale features to [0, π] (feature-map expects small angles)
X = MinMaxScaler(feature_range=(0, np.pi)).fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7)

n_qubits = X.shape[1]

K_train = quantum_kernel_matrix(X_train, X_train, n_qubits)
K_test = quantum_kernel_matrix(X_test,  X_train, n_qubits)

clf = SVC(kernel="precomputed")
clf.fit(K_train, y_train)

train_acc = clf.score(K_train, y_train)
test_acc = clf.score(K_test, y_test)
print(f"Quantum SVC train accuracy: {train_acc:.2%}")
print(f"Quantum SVC test accuracy:  {test_acc:.2%}")
# Quantum SVC train accuracy: 96.67%
# Quantum SVC test accuracy:  90.00%

Compare against a classical RBF SVM on the same data:

rbf = SVC(kernel="rbf", gamma="scale").fit(X_train, y_train)
print(f"Classical RBF SVC test accuracy: {rbf.score(X_test, y_test):.2%}")
# Classical RBF SVC test accuracy: 90.00%

On this synthetic dataset at 3 features, the two kernels tie. That’s the honest-result pattern across most benchmarks: quantum kernels match but do not clearly beat well-tuned classical kernels on standard problems.

The reproducing-kernel view

Every kernel KK has a corresponding reproducing-kernel Hilbert space (RKHS) HK\mathcal{H}_K — the “implicit feature space” where the kernel computes inner products. For quantum kernels, this RKHS is the span of the quantum embedding states {ϕ(x)}\{|\phi(x)\rangle\} modulo phase.

Key theorem (Representer): for any kernel KK and any convex loss function, the SVM solution has the form

f(x)=iαiK(xi,x),f(x) = \sum_i \alpha_i\,K(x_i, x),

— a linear combination of kernels evaluated at training points. The classifier is always in this form; you don’t need to explicitly represent ϕ\phi at all.

For quantum kernels, this means the SVM effectively searches the subspace of the quantum Hilbert space spanned by training embeddings. The quantum-classical performance gap comes down to: does this subspace capture the data’s structure better than the RBF RKHS?

Projected vs fidelity kernels

An alternative to the raw ϕ(x)ϕ(y)2|\langle \phi(x)|\phi(y)\rangle|^2 fidelity kernel: the projected quantum kernel, introduced by Huang et al. 2021. Instead of computing the squared overlap of full states, project each state onto local reduced density matrices and compute a kernel in that space:

Kproj(x,y)=exp ⁣(γiρixρiy1),K_\text{proj}(x, y) = \exp\!\left(-\gamma \sum_i \|\rho_i^x - \rho_i^y\|_1\right),

where ρix\rho_i^x is the reduced density matrix of qubit ii in the state ϕ(x)|\phi(x)\rangle.

Projected kernels work better on real data because they exploit local structure (which most datasets have) and are less affected by the “exponential concentration” problem that plagues fidelity kernels at scale. If you’re going to try quantum kernels on a real dataset in 2026, try both.

The exponential concentration problem

Here’s the catch that motivates all the caveats:

As nn grows, for random feature maps on Haar-random input distributions, the fidelity kernel K(x,y)K(x, y) concentrates exponentially around its average value. In other words: all pairwise kernels become nearly equal, the SVM can’t distinguish data points, and classification accuracy collapses.

This is closely related to barren plateaus in variational QML — both are manifestations of high-dimensional Hilbert-space geometry making generic quantum operations indistinguishable. To avoid it, use structured feature maps (exploiting symmetries, locality, or problem-specific inductive bias) and modest qubit counts (well under 50 for fidelity kernels).

Two tasks quantum kernels do well

Honest positive findings, with citations:

  1. Quantum data. If your data is quantum states (measurement outcomes from a quantum experiment, molecular ground states), a quantum kernel operates natively and saves the information loss of classical compression. See Huang et al. 2022 for molecular learning.

  2. Learning from parities / modular functions. Glick et al. 2021 show quantum kernels with specific structured feature maps provably beat classical kernels on learning parity-like functions. These are carefully-constructed problems, not real-world datasets, but they prove the principle.

Exercises

1. Build a linear quantum kernel

Use a minimal feature map Uϕ(x)=iRy(xi)0U_\phi(x) = \prod_i R_y(x_i)|0\rangle and derive the resulting kernel by hand. What is K(x,y)K(x, y)?

Show answer

State: ϕ(x)=i(cos(xi/2)0+sin(xi/2)1)|\phi(x)\rangle = \prod_i (\cos(x_i/2)|0\rangle + \sin(x_i/2)|1\rangle). Inner product: icos((xiyi)/2)\prod_i \cos((x_i - y_i)/2). Squared: icos2((xiyi)/2)\prod_i \cos^2((x_i - y_i)/2). This is essentially a cosine-similarity kernel with a specific geometry. Still a valid kernel but with limited expressivity.

2. Complexity of kernel matrix

For a training set of size NN, how many circuit evaluations do you need to compute the full kernel matrix? What’s the complexity of training an SVM given the kernel?

Show answer

Matrix is symmetric with N(N+1)/2N(N+1)/2 unique entries. Each needs ~4096 shots. Total shot budget: 4096N2/2\sim 4096 \cdot N^2/2. SVM training on precomputed kernel: O(N2)O(N^2) to O(N3)O(N^3) depending on solver. For N=1000N = 1000 training points: 2×109\sim 2 \times 10^9 shots. At 1000 shots/sec on real hardware, that’s 23 days of runtime. Keep training sets small for real-hardware QSVC.

3. Fidelity vs projected on moons

Reimplement the moons-dataset classifier from Tutorial 15 using a ZZ fidelity kernel, then using a projected kernel. Which performs better? Why?

Hint

Moons has strong local structure (nearby points have the same label). Projected kernel exploits this; fidelity doesn’t. Expect projected to match variational results ~90% and fidelity to underperform below 85%.

4. Derive the RKHS dimension

For the ZZ feature map on n=2n = 2 qubits, depth 1, what’s the effective dimension of the RKHS? (Counting independent degrees of freedom in the embedded states up to global phase.)

Show answer

The embedded state lies in the 4-dimensional 2-qubit Hilbert space, mod global phase → 3 real dimensions of variation. For a 2-D input xR2x \in \mathbb{R}^2, the map xϕ(x)x \mapsto |\phi(x)\rangle is a smooth 2-manifold embedded in this 3-D projective sphere. The effective RKHS dimension is at most 3 (dimension of the ambient quantum-state manifold). That’s expressive but not dramatically more than a cubic polynomial kernel in 2 variables.

What you should take away

  • Quantum kernels = fixed feature map + classical SVM. No variational training of the quantum part.
  • ZZ feature map with depth 2 is the standard benchmark, unclear classical simulation difficulty.
  • Fidelity kernels concentrate exponentially at scale; projected kernels are the better path for real data.
  • Quantum data is where quantum kernels have the cleanest edge. For classical tabular data, classical kernels are still competitive or better.
  • Scalability wall: O(N2)O(N^2) circuit evaluations to build the kernel matrix is brutal at N1000N \gtrsim 1000 on real hardware.

Next — and finally for this track: is QML worth it? A skeptic’s benchmark against XGBoost on real tabular data, a tour of the “dequantization” results that took quantum advantages back, and an honest recommendation on when QML earns its complexity.


Weekly dispatch

Quantum, for people who already code.

One serious tutorial per week, plus the industry moves that actually matter. No hype, no hand-waving.

Free. Unsubscribe anytime. We will never sell your email.