Chapter 24: Quantum Machine Learning

Machine learning has transformed nearly every corner of technology - from image recognition and language translation to drug discovery and game playing. Naturally, researchers have asked: can quantum computing make machine learning even more powerful? The field of Quantum Machine Learning (QML) explores this question, seeking to identify problems where quantum computers provide genuine advantages for learning tasks.

This chapter will take an honest, technically grounded tour through the landscape of QML. We will start with a concise refresher on classical machine learning, then explore quantum kernel methods, quantum neural networks, generative models, and reinforcement learning. We will end with the most important section: a candid assessment of where QML actually helps today, where it might help in the future, and where the hype outpaces the evidence. Quantum computing is a powerful tool, but it is not magic - and understanding its real capabilities requires the same rigor we have applied throughout this textbook.

24.1 Classical ML Refresher

Before adding "quantum" to "machine learning," we need a clear picture of what classical machine learning does and how it works. This section provides the essential background; readers already familiar with ML may skim it.

Supervised Learning

In supervised learning, we are given a training set of input-output pairs $\{(\mathbf{x}_i, y_i)\}_{i=1}^{N}$, where $\mathbf{x}_i \in \mathbb{R}^d$ is a $d$-dimensional feature vector and $y_i$ is the corresponding label (a class for classification, or a real number for regression). The goal is to learn a function $f: \mathbb{R}^d \to \mathcal{Y}$ that generalizes well to unseen data - that is, $f$ correctly predicts labels for inputs not in the training set.

The key question is not whether the model can memorize the training data (any sufficiently large model can), but whether it can generalize. This tension between fitting the training data (low training error) and performing well on new data (low test error) is the fundamental challenge of machine learning, formalized as the bias-variance trade-off.

Neural Networks

A neural network is a parameterized function built from layers of simple operations. Each layer applies a linear transformation (matrix multiplication plus bias) followed by a nonlinear activation function:

$$\mathbf{h}^{(l)} = \sigma\left( W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)} \right)$$

where $W^{(l)}$ is the weight matrix, $\mathbf{b}^{(l)}$ is the bias vector, $\sigma$ is a nonlinear activation function (such as ReLU or sigmoid), and $\mathbf{h}^{(l)}$ is the output of layer $l$. The input $\mathbf{h}^{(0)} = \mathbf{x}$ feeds through $L$ layers to produce the output $f(\mathbf{x}) = \mathbf{h}^{(L)}$.

Gradient Descent

The parameters $\theta = \{W^{(l)}, \mathbf{b}^{(l)}\}$ are optimized by minimizing a loss function $\mathcal{L}(\theta)$ that measures how poorly the model's predictions match the training labels. The optimization is done by gradient descent: repeatedly computing the gradient $\nabla_\theta \mathcal{L}$ and updating the parameters in the direction that reduces the loss:

$$\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$$

where $\eta$ is the learning rate. The gradient is computed efficiently using backpropagation, an application of the chain rule that propagates error signals backward through the network. Modern deep learning owes its success largely to the efficiency of backpropagation combined with hardware acceleration (GPUs and TPUs) that can process massive datasets in parallel.

The Kernel Trick

An alternative to neural networks that will be central to our quantum discussion is the kernel method. The idea: instead of learning a complex nonlinear function directly, map the data into a high-dimensional feature space using a function $\phi: \mathbb{R}^d \to \mathcal{H}$, where the data becomes linearly separable. A linear classifier in this feature space corresponds to a nonlinear classifier in the original space.

The kernel trick avoids explicitly computing $\phi(\mathbf{x})$ by noting that many algorithms only need inner products between feature vectors. If we can compute the kernel function $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ efficiently, we never need to work in the (potentially infinite-dimensional) feature space directly. Support Vector Machines (SVMs) are the most well-known kernel method.

Key Concept.

Classical machine learning optimizes parameterized functions (neural networks, kernel methods) to minimize a loss function on training data while generalizing to unseen data. The kernel trick is especially relevant for QML: it shows that working in a high-dimensional feature space can be done implicitly through inner products, which is exactly what quantum circuits enable naturally.

24.2 Quantum Kernel Methods

The connection between kernel methods and quantum computing is remarkably natural. A quantum circuit maps classical data $\mathbf{x}$ into a quantum state $|\phi(\mathbf{x})\rangle$ living in an exponentially large Hilbert space - exactly the kind of high-dimensional feature space that kernel methods exploit. The quantum kernel is simply the inner product between these quantum states:

$$k(\mathbf{x}, \mathbf{x}') = |\langle \phi(\mathbf{x}) | \phi(\mathbf{x}') \rangle|^2$$

This is the probability that the quantum state $|\phi(\mathbf{x}')\rangle$ passes a test for being the state $|\phi(\mathbf{x})\rangle$ - a quantity that can be estimated on a quantum computer using a simple circuit.

The Quantum Feature Map

A quantum feature map is a parameterized circuit $U(\mathbf{x})$ that encodes classical data into a quantum state:

$$|\phi(\mathbf{x})\rangle = U(\mathbf{x}) |0\rangle^{\otimes n}$$

Common choices include:

Angle encoding: Each feature $x_j$ becomes a rotation angle: $U(\mathbf{x}) = \prod_j R_Y(x_j)$. This uses one qubit per feature but creates a relatively simple feature space.
Amplitude encoding: The feature vector is encoded as the amplitudes of a quantum state: $|\phi(\mathbf{x})\rangle = \sum_j x_j |j\rangle / \|\mathbf{x}\|$. This uses only $\log_2(d)$ qubits for a $d$-dimensional vector, but preparing this state efficiently is non-trivial.
IQP (Instantaneous Quantum Polynomial) encoding: Alternating layers of Hadamard gates and diagonal unitaries parameterized by the data, creating feature spaces that are provably hard to simulate classically.

Interactive: Quantum Feature Map Explorer

Explore the parameterized circuits used to encode classical data into quantum states. These feature maps are the building blocks of quantum classifiers and quantum kernel methods. Each architecture creates a different family of quantum states, affecting the model's expressiveness and trainability.

Encoding: $x_0$: 1.00 $x_1$: 2.00

OPENQASM 2.0; include "qelib1.inc"; qreg q[2]; creg c[2]; ry({x0}) q[0]; ry({x1}) q[1]; cx q[0], q[1]; c[0] = measure q[0]; c[1] = measure q[1];

Adjust the data inputs $x_0$ and $x_1$ to see how different data points are encoded into quantum states. The Bloch sphere shows the single-qubit reduced state.

Estimating the Quantum Kernel

To compute $k(\mathbf{x}, \mathbf{x}')$, we use the swap test or, equivalently, apply $U(\mathbf{x})$ followed by $U(\mathbf{x}')^\dagger$ and measure whether all qubits return to $|0\rangle$. The probability of obtaining the all-zeros outcome is exactly $|\langle \phi(\mathbf{x}) | \phi(\mathbf{x}') \rangle|^2$.

The sandbox below demonstrates a simple quantum feature map using angle encoding on 2 qubits. Two data points are encoded using $R_Y$ and $R_Z$ rotations, and the overlap between their quantum states gives the kernel value. Modify the angles to see how different data points produce different kernel values.

The probability of measuring 00 estimates the kernel value $k(\mathbf{x}, \mathbf{x}')$. When the two data points are identical, the circuit perfectly uncomputes and you get 00 with certainty ($k = 1$). As the data points diverge, the kernel value decreases. Try setting both data points to the same values to verify.

OPENQASM 2.0;
include "qelib1.inc";
qreg q[2];
creg c[2];
// Encode data point x = (1.0, 2.0) using angle encoding
ry(1.0) q[0];
ry(2.0) q[1];
// Apply inverse encoding for x' = (0.5, 1.5)
ry(-1.5) q[1];
ry(-0.5) q[0];
// P(00) estimates the kernel K(x, x')
c[0] = measure q[0];
c[1] = measure q[1];

The probability of measuring 00 estimates the kernel value $k(x, x')$. Try making both data points identical to see $k = 1$. Adjust the angles to explore how the kernel value changes with data similarity.

Quantum SVM Workflow

The complete Quantum Support Vector Machine (QSVM) workflow is:

Choose a quantum feature map $U(\mathbf{x})$.
For all pairs of training data points, estimate the kernel matrix $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ using the quantum computer. This requires $O(N^2)$ circuit executions for $N$ training points.
Feed the kernel matrix into a classical SVM solver to find the optimal decision boundary.
For a new test point $\mathbf{x}^*$, estimate $k(\mathbf{x}_i, \mathbf{x}^*)$ for each support vector $\mathbf{x}_i$ and classify using the trained SVM.

Common Misconception.

The potential quantum advantage in quantum kernel methods lies in the expressiveness of the feature space, not in speeding up the SVM optimization itself (which is done classically). A quantum advantage exists only if the quantum feature map creates a kernel that captures data structure that no efficiently computable classical kernel can. For most practical datasets studied so far, classical kernels perform comparably or better.

24.3 Quantum Neural Networks

A quantum neural network (QNN), also called a parameterized quantum circuit (PQC) or variational quantum circuit, is the quantum analog of a classical neural network. The idea: build a quantum circuit with tunable gate parameters, measure an observable at the output, and optimize the parameters to minimize a loss function - exactly the same train-predict paradigm as classical ML.

Architecture

A typical QNN has three stages:

Data encoding: Classical input $\mathbf{x}$ is encoded into a quantum state using a feature map $U(\mathbf{x})$, as in the kernel method.
Variational layers: A parameterized circuit $V(\theta)$ with trainable angles $\theta = (\theta_1, \ldots, \theta_m)$ processes the encoded state. This typically consists of alternating layers of single-qubit rotations ($R_Y$, $R_Z$) and entangling gates (CNOT).
Measurement: An observable (such as $\langle Z_0 \rangle$) is measured to produce the model's prediction $f(\mathbf{x}; \theta) = \langle \psi(\mathbf{x}, \theta) | O | \psi(\mathbf{x}, \theta) \rangle$.

The circuit below illustrates a simple 2-qubit QNN with one variational layer. The data is encoded via $R_Y$ gates, followed by a CNOT for entanglement and trainable $R_Z$ rotations.

Adjust the data inputs ($x_0$, $x_1$) to see how different data points are encoded, and the weight parameters ($\theta_0$, $\theta_1$) to see how the trainable circuit transforms the output distribution. In a real QNN, the weights would be optimized by a classical optimizer to minimize a loss function over the training data.

Training with the Parameter-Shift Rule

Computing gradients of quantum circuits requires a technique called the parameter-shift rule. For a gate of the form $R_\alpha(\theta) = e^{-i \theta \sigma / 2}$ (where $\sigma$ is a Pauli operator), the gradient of an expectation value with respect to $\theta$ can be computed exactly using two circuit evaluations:

$$\frac{\partial}{\partial \theta} \langle O \rangle = \frac{1}{2} \left[ \langle O \rangle_{\theta + \pi/2} - \langle O \rangle_{\theta - \pi/2} \right]$$

This is the quantum analog of backpropagation: it provides exact gradients (not approximations) using only measurements of the same circuit at shifted parameter values. However, each gradient component requires two circuit evaluations, so computing the full gradient of $m$ parameters requires $2m$ circuit runs - a cost that grows linearly with the number of parameters.

Key Concept.

Quantum neural networks are parameterized quantum circuits trained by classical optimization. The parameter-shift rule provides exact gradients from circuit measurements, analogous to backpropagation in classical neural networks. However, each gradient evaluation requires additional circuit runs, and the total training cost scales linearly with the number of parameters.

Classical vs Quantum Kernel Comparison

Compare how a classical RBF kernel and a quantum kernel classify the same dataset. The decision boundary depends on the kernel's ability to separate the data in feature space. For linearly separable data, both kernels perform similarly. The quantum kernel's advantage, if any, appears on data with structure that aligns with the quantum feature map.

Circuit angle: 1.57

OPENQASM 2.0; include "qelib1.inc"; qreg q[2]; creg c[2]; ry(1.57) q[0]; ry(1.57) q[1]; c[0] = measure q[0]; c[1] = measure q[1];

OPENQASM 2.0; include "qelib1.inc"; qreg q[2]; creg c[2]; ry({angle}) q[0]; ry({angle}) q[1]; cx q[0], q[1]; rz({angle}) q[1]; cx q[0], q[1]; c[0] = measure q[0]; c[1] = measure q[1];

The left panel shows a simple product-state encoding (analogous to an RBF kernel). The right panel adds entanglement via CNOT + RZ, creating richer correlations that a quantum kernel can exploit. Adjust the angle to see how the distributions diverge.

24.4 Quantum Generative Models

Generative models learn the underlying probability distribution of a dataset and can generate new samples that resemble the training data. Classical examples include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion models. Quantum computing offers a natural fit for generative modeling because quantum circuits inherently produce probability distributions when measured.

Quantum Circuit Born Machines

A Quantum Circuit Born Machine (QCBM) is perhaps the simplest quantum generative model. A parameterized quantum circuit $V(\theta)$ acts on the initial state $|0\rangle^{\otimes n}$ to produce a quantum state whose measurement statistics define the model distribution:

$$p_\theta(x) = |\langle x | V(\theta) | 0 \rangle^{\otimes n}|^2$$

The parameters $\theta$ are optimized so that $p_\theta$ matches the target distribution $p_{\text{data}}$. The loss function is typically the KL divergence or the Maximum Mean Discrepancy (MMD) between the model distribution and the empirical data distribution.

The appeal of QCBMs is that they can natively represent distributions that are hard to sample from classically. A quantum circuit with polynomial depth can produce distributions that require exponential classical resources to sample (this is essentially the argument behind quantum computational supremacy experiments). However, whether such hard-to-sample distributions arise naturally in practical machine learning tasks is an open question.

Quantum GANs

Quantum GANs extend the adversarial training framework to the quantum setting. A quantum generator circuit $G(\theta_G)$ produces quantum states, and a discriminator (which can be quantum or classical) tries to distinguish generated samples from real data. The generator and discriminator are trained in an adversarial loop, as in classical GANs. Quantum GANs have been demonstrated on small-scale problems, but face the same training stability challenges as their classical counterparts, compounded by the noise and limited qubit count of near-term quantum hardware.

Quantum Boltzmann Machines

Quantum Boltzmann Machines (QBMs) are inspired by classical Restricted Boltzmann Machines but use a quantum Hamiltonian to define the model distribution. The key idea is that thermal states of quantum Hamiltonians can encode correlations that classical Boltzmann machines cannot efficiently represent, particularly those involving quantum entanglement between visible and hidden units. However, training QBMs requires estimating thermal expectation values, which is itself a computationally demanding task.

Note.

Quantum generative models are theoretically motivated by the fact that quantum circuits can produce distributions that are hard to sample classically. However, the practical question is whether the data distributions encountered in real-world applications actually require quantum resources to model effectively. For most current applications, classical generative models (especially diffusion models and transformers) outperform quantum approaches at practical scales.

24.5 Quantum Reinforcement Learning

Reinforcement learning (RL) is the branch of ML where an agent learns to make decisions by interacting with an environment, receiving rewards, and adjusting its strategy (policy) to maximize cumulative reward. RL has achieved spectacular successes in game playing (Go, Atari, StarCraft) and robotics. Quantum enhancements to RL come in several flavors:

Quantum-Enhanced Policy Networks

The simplest approach replaces the classical neural network that parameterizes the agent's policy with a quantum neural network (PQC). The quantum circuit takes in the environment state (encoded into rotation angles), processes it through variational layers, and outputs measurement statistics that determine the action probabilities. The parameters are updated using policy gradient methods (such as REINFORCE), with gradients computed via the parameter-shift rule.

Initial experiments on simple RL environments (CartPole, FrozenLake) have shown that quantum policy networks can match classical networks using fewer trainable parameters. However, it is not yet clear whether this parameter efficiency translates to a practical advantage, since the cost of each quantum circuit evaluation (including state preparation, gate application, and measurement) may be much higher than the cost of a classical forward pass.

Quantum Environments

A more natural application of quantum RL is when the environment itself is quantum - for example, controlling a quantum system for state preparation, error correction, or optimal quantum control. In these settings, the agent must interact with a quantum system that cannot be efficiently simulated classically, providing a clear motivation for quantum processing in the RL loop.

Grover-Enhanced Search in RL

Some proposals use Grover's algorithm or amplitude amplification to speed up the exploration phase of RL, providing quadratic speedups in searching over large action spaces. These approaches are theoretically sound but require fault-tolerant quantum computers with many qubits, placing them firmly in the future rather than the near term.

Common Misconception.

Quantum reinforcement learning does not mean the quantum computer "learns faster" in any simple sense. The quantum advantage, if it exists, comes from either (1) the expressiveness of quantum policy representations in fewer parameters, (2) inherently quantum environments that classical computers cannot efficiently simulate, or (3) quantum speedups in specific subroutines like search. For classical environments with efficient classical simulators, the case for quantum RL remains unproven.

24.6 The Honest Assessment: Where Does QML Actually Help?

This is the most important section of this chapter. The hype around quantum machine learning has been enormous, with breathless claims about "exponential speedups" and "quantum advantage for AI." A responsible treatment demands that we separate established results from speculation, and genuine advantages from wishful thinking. Here is the current state of the field, as honestly as we can present it.

Where Quantum Advantage Is Established (Theory)

Learning quantum systems: When the data itself comes from a quantum process (quantum states, quantum channels, quantum Hamiltonians), quantum computers can learn properties of that system exponentially faster than any classical approach. This is because quantum data can only be fully accessed by quantum measurements, and classical descriptions of quantum states are exponentially large. This is the clearest and most rigorous advantage.
Specific kernel separations: Theoretical constructions exist where quantum kernels can classify data that no classical kernel of polynomial size can. However, these constructions rely on cryptographic or algebraic hardness assumptions and use contrived datasets that do not arise in practical applications.
Sampling complexity: Quantum circuits can sample from distributions that are provably hard to sample classically (related to quantum supremacy). If a learning task requires sampling from such distributions, quantum computers have an inherent advantage.

The Barren Plateau Problem

The single biggest obstacle to practical QML is the barren plateau phenomenon. Discovered by McClean et al. in 2018, it shows that for random parameterized quantum circuits, the gradient of the loss function vanishes exponentially with the number of qubits:

$$\text{Var}\left[\frac{\partial \mathcal{L}}{\partial \theta_k}\right] \leq O\left(\frac{1}{2^n}\right)$$

where $n$ is the number of qubits. This means that as the system size grows, the optimization landscape becomes exponentially flat and featureless - the optimizer has no gradient signal to follow. This is not just a worst-case issue; it affects a wide class of circuit architectures, loss functions, and noise models.

Worse still, recent theoretical results have revealed a fundamental dilemma: strategies that provably avoid barren plateaus (such as restricting circuits to polynomial subspaces with local structure) often produce circuits that are classically simulable. In other words, the very circuits that are trainable tend to be the ones that offer no quantum advantage. Circuits that could provide quantum advantage tend to be the ones that are untrainable. This "barren plateau paradox" is an active area of research with no clear resolution.

Key Concept.

The barren plateau problem is the central theoretical obstacle to scalable QML. Gradient variance vanishes exponentially with qubit count for generic parameterized circuits. Current evidence suggests a fundamental tension: circuits that avoid barren plateaus are often classically simulable, while circuits with genuine quantum expressiveness are often untrainable. This does not prove QML is useless, but it constrains the design space severely.

Where the Evidence Is Weak or Negative

Classical data, classical tasks: For standard ML benchmarks (image classification, natural language processing, tabular data), no quantum algorithm has demonstrated a practical advantage over classical methods on datasets of meaningful size. Classical deep learning scales to billions of parameters on massive datasets; current QML is limited to tens of qubits and toy-scale problems.
Quantum kernel dequantization: Several quantum kernel methods proposed for classical data have been "dequantized" - shown to be achievable classically with comparable performance. When quantum kernels are tuned for good generalization on practical datasets, they often converge to kernels that can be efficiently approximated classically.
Data loading bottleneck: Many QML algorithms assume efficient quantum access to classical data (loading $N$ data points into quantum states in $O(\log N)$ time). In practice, loading classical data into a quantum computer is expensive and can negate any quantum speedup.
Noise: Near-term quantum hardware is noisy, and noise degrades the expressiveness and trainability of quantum circuits. Error mitigation helps (as we will see in Chapter 25) but introduces sampling overhead that further reduces any potential speedup.

Where QML Might Help (Honest Speculation)

Quantum chemistry and materials: ML models trained on quantum simulation data (molecular energies, wavefunctions, material properties) could benefit from quantum feature maps that capture the structure of quantum systems more naturally than classical representations.
Structured combinatorial problems: Problems where the data or the model has inherent combinatorial or graph structure might benefit from quantum approaches, especially if the structure aligns with the native connectivity of quantum hardware.
Fault-tolerant era: With error-corrected quantum computers and thousands of logical qubits, the landscape changes dramatically. Algorithms like quantum linear algebra (HHL) and quantum principal component analysis could provide genuine speedups for certain ML tasks - but this is likely a decade or more away.

A Summary Table

Claim	Status	Evidence Level
QML learns quantum systems faster	Established	Rigorous proofs
Quantum kernels separate from classical	Theoretical only	Contrived datasets; not practical
QNNs outperform classical NNs	Unproven	Small-scale demos; no scaling evidence
Quantum speedup for classical ML tasks	Unlikely near-term	Data loading bottleneck; dequantization
Barren plateaus are avoidable at scale	Open question	Avoiding them may remove quantum advantage
QML for quantum chemistry data	Promising	Natural fit; early results encouraging

Note.

This assessment reflects the state of research through 2025. The field is evolving rapidly, and future theoretical breakthroughs or hardware advances could change the picture significantly. The intellectually honest position is neither hype ("quantum ML will revolutionize everything") nor dismissal ("quantum ML is useless") but careful optimism tempered by a clear-eyed view of current limitations.