GPUs and a smarter algorithm let classical computers simulate 100–124 qubits of ruthenium chemistry in hours
This paper reports a way to run a quantum-chemistry algorithm on many GPUs so that classical hardware can solve problems previously thought
This paper reports a way to run a quantum-chemistry algorithm on many GPUs so that classical hardware can solve problems previously thought to need quantum computers. The authors build a parallel, GPU (graphics processing unit)‑accelerated version of the iterative qubit coupled cluster (iQCC) method. Using this code they completed full ground‑state calculations for ruthenium catalyst models with 100–124 qubits on NVIDIA GPUs in times between about 1.2 and 45 hours, and they report accuracy better than a standard classical benchmark (Density Matrix Renormalization Group, DMRG).
At a high level, iQCC is a way to represent and improve an electronic‑structure solution directly in the language of qubits. The algorithm dresses the system Hamiltonian (the mathematical object that encodes the energy) repeatedly with simple unitary operations called entanglers. The key algorithmic choice here is to pick entanglers only from the Direct Interaction Space (DIS). These operators are guaranteed to give non‑zero energy gradients, which means the optimization steps keep moving toward lower energy and avoid the “barren plateau” problem (when gradients vanish and training a variational circuit stalls).
The main technical advance in this work is parallelization and GPU acceleration. The authors avoid gathering all Hamiltonian terms on one processor. Instead they split terms across compute nodes by a bit‑wise partitioning rule so each node holds a disjoint subset. New terms generated during dressing are routed to the right partition deterministically, which cuts down the all‑to‑all communication that usually slows multi‑GPU runs. Heavy algebraic steps—contracting many Pauli strings (products of simple X, Y, Z matrices that describe qubit operators)—are offloaded to GPUs. Combined with dynamic load balancing and a polynomial approximation for optimizing many entanglers (truncation orders K = 2–6 recover sub‑millihartree accuracy in their tests), the implementation yields over two orders of magnitude speedup versus a serial CPU approach and achieves O(M/P) memory per node, where M is the number of Hamiltonian terms and P the number of processors.