Generative quantum eigensolver (GQE) training using PennyLane data¶
Published: September 20, 2024. Last updated: November 8, 2024.
Note
Go to the end to download the full example code.
We will be demonstrating and evaluating a novel algorithm proposed by Nakaji et al. in their paper The generative quantum eigensolver (GQE) and its application for ground state search that employs a classical generative model of quantum circuits for the purpose of ground-state energy estimation of any molecular Hamiltonian 1. It has been proposed as a scalable alternative to the variational quantum eigensolver (VQE) approach, where the quantum state is represented as a quantum circuit with tunable parameters which are then optimized during training in order to arrive at a state minimizing the corresponding energy E. Instead, in GQE, the structure of the quantum circuit is given by a trained generative model.
We will primarily focus on offline training on a fixed training dataset – thanks to the molecular data available in PennyLane Datasets. By the end of the demo, we will show that the model gradually provides a better estimate for the energies and, in turn, can sample energies close to the ground state energy calculated by PennyLane.
This demo is organized as follows. Firstly, we compare the GQE and VQE algorithms, and afterwards, we dive deeper in describing GPT-QE, i.e., the GQE algorithm which uses a GPT model, and its training. Next, we generate the training dataset for our GPT model using PennyLane, and give details on our model architecture and training implementation. After that, we evaluate the model throughout its training and discuss its performance in estimating the ground state. And lastly, we discuss the results, potential ways optimizing the code, and its extension into the “online training” phase.
GQE vs. VQE¶
Despite the relative success of VQE, there are some issues regarding its trainability for large problem instances 1. This shortcoming makes it less competitive against the performance of classical machine learning (ML) algorithms for large problems. To bypass this, the GQE algorithm was proposed. Specifically, GQE uses a classical generative model where quantum circuits are sampled as a sequence of unitaries from a given operator pool. This generative model is then trained so that it learns to predict quantum circuits that evolves an initial state to the states better approximating the ground state.
The main difference between the two approaches is where the tunable parameters are embedded. That is, it is the classical GQE model that is being optimized as opposed to the variable quantum circuit of VQE. Potentially then, the barren plateau landscape of VQE and the quantum gradient evaluation of large circuits will be sidestepped by GQE, thus becoming more amenable for larger problems.

GPT-QE background¶
In particular, the model architecture used by Nakaji et al. 1 for GQE was a generative pre-trained transformer (GPT) 2, 3. This choice is then reflected in the name GPT-QE. As language models, GPTs are successful in generating sequences of words that closely resemble human natural language. This performance is harnessed for quantum chemistry by constructing quantum states ρ as a sequence of unitary operators which are, in turn, represented by quantum circuits. That is, we let ρ=Uρ0U† for some fixed initial state ρ0 and the aforementioned sequence is U=UjNUjN−1⋯Uj1. The GPT model samples a sequence of integers j1,j2,...,jN indexing a pool of operators Uj generated using molecular data from PennyLane Molecules. We interpret these integers as tokens and the pool as the vocabulary in the parlance for language models. The goal of training is then to minimize the corresponding energy E=Tr(ˆHρ), where ˆH is the Hamiltonian of the molecule in question.
Each token ji is sampled from the distribution exp(−βwji), where wji is the unnormalized log probability (or logit) returned by the GPT model for the token ji and β is an inverse temperature representing a trade-off parameter between exploration and exploitation. We then observe that the probability of sampling a state through the method described above is proportional to exp(−βwsum), where wsum=∑Ni=1wji and the probability for the corresponding energy is exp(−βE). We thus have a constraint for the total logit to be equal to the energy of the corresponding state: wsum=E, which can be imposed by training GPT-QE to minimize the loss function C=(wsum−E)2. With this constraint satisfied, GPT-QE would then be sampling states of smaller energies with increasing likelihood.
More concretely, we summarize the *pre-*training loop in Figure 2. This is called pre-training because it involves learning from a fixed dataset first before transitioning to the “real” training which utilizes the data generated by the model itself. In this demo, we will call the pre-training as offline training since the GPT model does not receive feedback from the sequences it samples, and online training if otherwise.

Figure 2: Overview for offline training of GPT-QE¶
Dataset construction via PennyLane¶
Firstly, let us construct the static dataset we will use for offline training. We choose to generate our own dataset in order to illustrate the sequences and energies more concretely. Our dataset will be made from random sequences of tokens, which we recall corresponds to indices of a vocabulary of unitary operators. We then define an energy function in PennyLane to calculate the energy of a state corresponding to a token sequence. Applying the aforementioned function, we would then have a dataset of token sequences and energies for the GPT model offline training.
Loading molecular information¶
For simplicity, let us consider the hydrogen molecule and load the corresponding dataset from PennyLane. Recall that we would need a vocabulary of operators Uj, an initial state ρ0, and the Hamiltonian ˆH for a hydrogen molecule. We also get the ground state energy for later comparison with the results.
Specifically, the unitary operators Uj are time evolution operators as prescribed
in 1. The non-identity operators are generated in PennyLane using SingleExcitation
and DoubleExcitation
which then depend on the number of electrons and orbitals
of the molecule.
import numpy as np
import pennylane as qml
def generate_molecule_data(molecules="H2"):
datasets = qml.data.load("qchem", molname=molecules)
# Get the time set T
op_times = np.sort(np.array([-2**k for k in range(1, 5)] + [2**k for k in range(1, 5)]) / 160)
# Build operator set P for each molecule
molecule_data = dict()
for dataset in datasets:
molecule = dataset.molecule
num_electrons, num_qubits = molecule.n_electrons, 2 * molecule.n_orbitals
singles, doubles = qml.qchem.excitations(num_electrons, num_qubits)
double_excs = [qml.DoubleExcitation(time, wires=double) for double in doubles for time in op_times]
single_excs = [qml.SingleExcitation(time, wires=single) for single in singles for time in op_times]
identity_ops = [qml.exp(qml.I(range(num_qubits)), 1j*time) for time in op_times] # For Identity
operator_pool = double_excs + single_excs + identity_ops
molecule_data[dataset.molname] = {
"op_pool": np.array(operator_pool),
"num_qubits": num_qubits,
"hf_state": dataset.hf_state,
"hamiltonian": dataset.hamiltonian,
"expected_ground_state_E": dataset.fci_energy
}
return molecule_data
molecule_data = generate_molecule_data("H2")
h2_data = molecule_data["H2"]
op_pool = h2_data["op_pool"]
num_qubits = h2_data["num_qubits"]
init_state = h2_data["hf_state"]
hamiltonian = h2_data["hamiltonian"]
grd_E = h2_data["expected_ground_state_E"]
op_pool_size = len(op_pool)
Defining the energy function¶
In PennyLane, we define the energy function E=Tr(ˆHUjN⋯Uj1ρ0U†j1⋯U†jN)
corresponding to Eq. 1 of 1. Here, energy_circuit
takes in the operator sequence Uj1,Uj2,...,UjN
and returns the energy of the corresponding quantum state.
As a slight extension, we can also calculate the energies for each subsequence of
operators to help with the training of the model. That is, for a sequence of three operators
Uj1,Uj2,Uj3 we compute the energies for Uj1 and Uj1,Uj2 instead of just
the full sequence of three operators, which was described in 1. This can be done simply in PennyLane, using
Snapshot
as shown below.
dev = qml.device("default.qubit", wires=num_qubits)
@qml.qnode(dev)
def energy_circuit(gqe_ops):
# Computes Eq. 1 from Nakaji et al. based on the selected unitary operators
qml.BasisState(init_state, wires=range(num_qubits)) # Initial state <-- Hartree Fock state
for op in gqe_ops:
qml.Snapshot(measurement=qml.expval(hamiltonian))
qml.apply(op) # Applies each of the unitary operators
return qml.expval(hamiltonian)
energy_circuit = qml.snapshots(energy_circuit)
def get_subsequence_energies(op_seq):
# Collates the energies of each subsequence for a batch of sequences
energies = []
for ops in op_seq:
es = energy_circuit(ops)
energies.append(
[es[k].item() for k in list(range(1, len(ops))) + ["execution_results"]]
)
return np.array(energies)
Token sequence generation with corresponding energies¶
With these ingredients, we can now construct a dataset containing sequences of tokens and their energies.
Since we cannot feed the operators directly to the GPT model, we would need to tokenize
them. The indices of op_pool
seems to be a good candidate, but we instead choose the tokens to be
the op_pool
indices shifted by 1. This is so that we can define a special token 0
that tells
the GPT model where the sequence starts.
We generate a train_size
number of random operator sequences of length seq_len
for our
purposes and calculate their energies (and their subsequences).
# Generate sequence of indices of operators in vocab
train_size = 1024
seq_len = 4
train_op_pool_inds = np.random.randint(op_pool_size, size=(train_size, seq_len))
# Corresponding sequence of operators
train_op_seq = op_pool[train_op_pool_inds]
# Corresponding tokens with special starting tokens
train_token_seq = np.concatenate([
np.zeros(shape=(train_size, 1), dtype=int), # starting token is 0
train_op_pool_inds + 1 # shift operator inds by one
], axis=1)
# Calculate the energies for each subsequence in the training set
train_sub_seq_en = get_subsequence_energies(train_op_seq)
GPT-QE offline training¶
Having setup our training dataset, we can start implementing our offline training loop as illustrated in Figure 2. We outline our implementation below.
GPT model implementation details¶
The GPT model we will use in this demo is mostly implemented in the nanoGPT repo (a reimplementation of the OpenAI GPT-2) as the
class
GPT
with the model hyperparameters stored in the
dataclass
GPTConfig
. Namely, we will use 12 attention layers, 12 attention heads, and 768 embedding dimensions,
which are equal to those described in 1.
We can import from the nanoGPT repo directly by running the curl
command (commented out below) in a Jupyter Notebook.
Since nanoGPT is trained as a language model, its loss function and sampling method are
defined differently. We then define the subclass GPTQE
below to override some nanoGPT methods in order to make it more
suitable for our case.
# !curl -O https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py
from model import GPT, GPTConfig
import torch
from torch.nn import functional as F
class GPTQE(GPT):
def forward(self, idx):
device = idx.device
b, t = idx.size()
pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)
# forward the GPT model itself
tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
x = self.transformer.drop(tok_emb + pos_emb)
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
logits = self.lm_head(x)
return logits
def calculate_loss(self, tokens, energies):
current_tokens, next_tokens = tokens[:, :-1], tokens[:, 1:]
# calculate the logits for the next possible tokens in the sequence
logits = self(current_tokens)
# get the logit for the actual next token in the sequence
next_token_mask = torch.nn.functional.one_hot(
next_tokens, num_classes=self.config.vocab_size
)
next_token_logits = (logits * next_token_mask).sum(axis=2)
# calculate the cumulative logits for each subsequence
cumsum_logits = torch.cumsum(next_token_logits, dim=1)
# match cumulative logits to subsequence energies
loss = torch.mean(torch.square(cumsum_logits - energies))
return loss
@torch.no_grad()
def generate(self, n_sequences, max_new_tokens, temperature=1.0, device="cpu"):
idx = torch.zeros(size=(n_sequences, 1), dtype=int, device=device)
total_logits = torch.zeros(size=(n_sequences, 1), device=device)
for _ in range(max_new_tokens):
# if the sequence context is growing too long we must crop it at block_size
idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
# forward the model to get the logits for the index in the sequence
logits = self(idx_cond)
# pluck the logits at the final step
logits = logits[:, -1, :]
# set the logit of the first token so that its probability will be zero
logits[:, 0] = float("inf")
# apply softmax to convert logits to (normalized) probabilities and scale by desired temperature
probs = F.softmax(-logits / temperature, dim=-1)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1)
# # Accumulate logits
total_logits += torch.gather(logits, index=idx_next, dim=1)
# append sampled index to the running sequence and continue
idx = torch.cat((idx, idx_next), dim=1)
return idx, total_logits
However, it is important to note that the loss function calculate_loss
, that we defined is different from the one
described in 1 which is (exp(−wsum)−exp(−E))2. As described
beforehand, we instead directly compute the mean squared error between wsum and E.
Since the exponential function is one-to-one, both loss functions would then impose the same minimum.
Using the error between exponentials may even
introduce numerical instabilities in the training since the loss would be taking differences of potentially large numbers.
In addition to this change from 1, we also use the error between the cumulative sum of logits and the corresponding energy
for each subsequence instead of just the error between total logits and the energy of an entire sequence.
This addition will give more training data to the model and should help with logit matching the intermediate tokens.
Since it was not explicitly shown in 1, another possible deviation we made is with the logit calculation during offline training. It seems that the logits were accumulated by looping through the sequential generation of tokens. In order to fill in the blanks, we implement the logit calculation in a manner that we think is efficient. Namely, we directly pass the fixed training sequences to the model and retrieve the relevant logits from it. This can be done because we are using a causal mask for the attention blocks so that the logits of the earlier tokens in the sequence are not affected by tokens in the later part of the sequence. Thus, having the same effect of the sequential token generation.
We initialize our GPT model below and we see that it has around 85 million parameters. When saved, the model is 324.25 MB
in size.
tokens = torch.from_numpy(train_token_seq).to("cuda")
energies = torch.from_numpy(train_sub_seq_en).to("cuda")
gpt = GPTQE(GPTConfig(
vocab_size=op_pool_size + 1,
block_size=seq_len,
dropout=0.2,
bias=False
)).to("cuda")
opt = gpt.configure_optimizers(
weight_decay=0.01, learning_rate=5e-5, betas=(0.9, 0.999), device_type="cuda"
)
number of parameters: 84.98M
num decayed parameter tensors: 50, with 84,963,072 parameters
num non-decayed parameter tensors: 25, with 19,200 parameters
GPT offline training¶
We now implement a training loop for our GPT model. This can be framed as a straightforward supervised learning problem. We sketch the steps for each training iteration/epoch below:
-
Shuffle the training set and split it into
n_batches
minibatches -
For each minibatch, calculate the average loss, the gradients, and take an optimizer step
-
For each n-th iteration (n is 500 here), evaluate the GPT model:
-
Generate a batch of sequences and the predicted energies (total logits). Note that these are not necessarily same sequences as in the training set.
-
Calculate the true energies using PennyLane
-
Calculate the mean absolute error as a metric to track the learning progress and save the GPT model everytime the metric gets better
-
n_batches = 8
train_inds = np.arange(train_size)
losses = []
pred_Es_t = []
true_Es_t = []
current_mae = 10000
gpt.train()
for i in range(10000):
# Shuffle batches of the training set
np.random.shuffle(train_inds)
token_batches = torch.tensor_split(tokens[train_inds], n_batches)
energy_batches = torch.tensor_split(energies[train_inds], n_batches)
# SGD on random minibatches
loss_record = 0
for token_batch, energy_batch in zip(token_batches, energy_batches):
opt.zero_grad()
loss = gpt.calculate_loss(token_batch, energy_batch)
loss.backward()
opt.step()
loss_record += loss.item() / n_batches
losses.append(loss_record)
if (i+1) % 500 == 0:
# For GPT evaluation
gpt.eval()
gen_token_seq, pred_Es = gpt.generate(
n_sequences=100,
max_new_tokens=seq_len,
temperature=0.001, # Use a low temperature to emphasize the difference in logits
device="cuda"
)
pred_Es = pred_Es.cpu().numpy()
gen_inds = (gen_token_seq[:, 1:] - 1).cpu().numpy()
gen_op_seq = op_pool[gen_inds]
true_Es = get_subsequence_energies(gen_op_seq)[:, -1].reshape(-1, 1)
mae = np.mean(np.abs(pred_Es - true_Es))
ave_E = np.mean(true_Es)
pred_Es_t.append(pred_Es)
true_Es_t.append(true_Es)
print(f"Iteration: {i+1}, Loss: {losses[-1]}, MAE: {mae}, Ave E: {ave_E}")
if mae < current_mae:
current_mae = mae
torch.save(gpt, f"./seq_len={seq_len}/gqe.pt")
print("Saved model!")
gpt.train()
pred_Es_t = np.concatenate(pred_Es_t, axis=1)
true_Es_t = np.concatenate(true_Es_t, axis=1)
Iteration: 500, Loss: 0.004496691238049528, MAE: 0.13945468622863236, Ave E: -1.1161227981406456
Saved model!
Iteration: 1000, Loss: 0.001162520404255374, MAE: 0.11792013497926974, Ave E: -1.116178063434579
Saved model!
Iteration: 1500, Loss: 0.0006311882560414964, MAE: 0.08421050347067748, Ave E: -1.1304435666682537
Saved model!
Iteration: 2000, Loss: 0.0002220232025956396, MAE: 0.03313205549288038, Ave E: -1.13411711385679
Saved model!
Iteration: 2500, Loss: 9.021296506465553e-05, MAE: 0.03720317687198404, Ave E: -1.1360217383940532
Iteration: 3000, Loss: 0.00011929328764308375, MAE: 0.010246824522607662, Ave E: -1.1355033629645301
Saved model!
Iteration: 3500, Loss: 4.015137835017087e-05, MAE: 0.008332604993116905, Ave E: -1.1362737218253494
Saved model!
Iteration: 4000, Loss: 0.00025425587370956726, MAE: 0.03346923599957368, Ave E: -1.13442109812976
Iteration: 4500, Loss: 4.590269966149363e-05, MAE: 0.0086580669691949, Ave E: -1.1344678899103924
Iteration: 5000, Loss: 2.7407370499136962e-05, MAE: 0.006680762382889203, Ave E: -1.136412143925528
Saved model!
Iteration: 5500, Loss: 3.778071550021417e-05, MAE: 0.014272903220676704, Ave E: -1.1362969016861684
Iteration: 6000, Loss: 2.2792776141250974e-05, MAE: 0.007428675818214263, Ave E: -1.1367647064449693
Iteration: 6500, Loss: 1.9002385742602413e-05, MAE: 0.004431537870071902, Ave E: -1.135880723613281
Saved model!
Iteration: 7000, Loss: 1.5268728079291623e-05, MAE: 0.002464256235883442, Ave E: -1.1356989137037925
Saved model!
Iteration: 7500, Loss: 1.1030378864566936e-05, MAE: 0.007000517223791054, Ave E: -1.1360445255294285
Iteration: 8000, Loss: 7.638036884241474e-06, MAE: 0.0044611951680048586, Ave E: -1.1352658877947734
Iteration: 8500, Loss: 1.616690860258467e-05, MAE: 0.004094392133172753, Ave E: -1.1356437076129735
Iteration: 9000, Loss: 7.37882245331426e-06, MAE: 0.004240113290004896, Ave E: -1.1358971131175264
Iteration: 9500, Loss: 1.004411104422562e-05, MAE: 0.010631562300185794, Ave E: -1.1368761600775912
Iteration: 10000, Loss: 1.809862392776087e-05, MAE: 0.01987725166307399, Ave E: -1.1345492765523346
CPU times: user 2h 12min 24s, sys: 8.18 s, total: 2h 12min 32s
Wall time: 2h 12min 32s
With a preliminary look at the training logs above, we see that the offline training took 2 h and 12 min for 10,000
training iterations. The code execution time was measured by including the %%time
magic command before the code block.
We also note that the mean absolute error between the predicted and true energies for the
generated sequences quickly became smaller during the earlier parts of the training but fluctuates more
later on. As mentioned earlier, a model version is saved each time we get better performance. The best model version is then
saved at the 7,000th iteration, where the mean absolute error is at its lowest. The other quantities we observe are the
average true energies of the generated sequences. It is then promising to see that throughout training, this is
close to the ground state energy grd_E=-1.1372633205048763 Ha
.
GPT-QE results¶
Having finished the offline training, let’s take a look at some of our results.
Training loss curve¶
One of the first things we can look at is the training loss curve. Recall that for our case, the loss is the mean squared error between the cumulative sum of logits (predicted subsequence energies) and their corresponding true subsequence energies. So reducing this quantity allows the model to be better at giving correct energies and in turn, correctly sample sequences with lower energies. Since the raw loss values become very small, we instead plot the loss in log-scale below to magnify the loss trend.
import holoviews as hv
import hvplot.pandas
import pandas as pd
hvplot.extension('matplotlib')
losses = pd.read_csv("./seq_len=4/trial7/losses.csv")["0"]
loss_fig = losses.hvplot(
title="Training loss progress", ylabel="loss", xlabel="Training epochs", logy=True
).opts(fig_size=600, fontscale=2, aspect=1.2)
loss_fig

Figure 3: The subsequence loss for each training iteration¶
We see in Figure 3 that the loss
continues to decrease until around the 4,000th iteration. There, the model was erroraneous but is quick
to recover as training continues. This may signal that the GPT model started focusing on learning something
erroraneous too quickly. So, more regularization noise (like dropout
) may be needed to help avoid this.
Evaluation progress¶
We now track the performance of the GPT model throughout its training in Figure 4 below. As mentioned before, after every 500th iteration, we let the model generate a batch of sequences. Alongside, we also return the total logits (predicted energies) used in the sequence generation. In Figure 4, the average predicted energies correspond to the red markers and the distribution of predicted energies is represented by the red area. Once we have the generated sequences, we can also let PennyLane calculate the true sequence energies. Similarly then, the blue markers are the average true energies and the blue area represents the true energy distribution.
df_true = pd.read_csv("./seq_len=4/trial7/true_Es_t.csv").iloc[:, 1:]
df_pred = pd.read_csv("./seq_len=4/trial7/pred_Es_t.csv").iloc[:, 1:]
df_true.columns = df_true.columns.astype(int)
df_pred.columns = df_pred.columns.astype(int)
df_trues_stats = pd.concat([df_true.mean(axis=0), df_true.min(axis=0), df_true.max(axis=0)], axis=1).reset_index()
df_trues_stats.columns = ["Training Iterations", "Ave True E", "Min True E", "Max True E"]
df_preds_stats = pd.concat([df_pred.mean(axis=0), df_pred.min(axis=0), df_pred.max(axis=0)], axis=1).reset_index()
df_preds_stats.columns = ["Training Iterations", "Ave Pred E", "Min Pred E", "Max Pred E"]
fig = (
df_trues_stats.hvplot.scatter(x="Training Iterations", y="Ave True E", label="Mean True Energies") *
df_trues_stats.hvplot.line(x="Training Iterations", y="Ave True E", alpha=0.5, linewidth=1) *
df_trues_stats.hvplot.area(x="Training Iterations", y="Min True E", y2="Max True E", alpha=0.1)
) * (
df_preds_stats.hvplot.scatter(x="Training Iterations", y="Ave Pred E", label="Mean Predicted Energies") *
df_preds_stats.hvplot.line(x="Training Iterations", y="Ave Pred E", alpha=0.5, linewidth=1) *
df_preds_stats.hvplot.area(x="Training Iterations", y="Min Pred E", y2="Max Pred E", alpha=0.1)
)
fig = fig * hv.Curve([[0, grd_E], [10000, grd_E]], label="Ground State Energy").opts(color="k", alpha=0.4, linestyle="dashed")
fig = fig.opts(ylabel="Sequence Energies", title="GQE Evaluations", fig_size=600, fontscale=2)
fig

Figure 4: True and predicted energies for sequences generated by the GPT model for each 500th training iteration¶
We now see that the energies predicted by the model improve in accuracy, aligning more closely with the true energies during training. The increase in accuracy then allows the model to correctly sample states with lower energies. This is supported in Figure 4 where the sampled true energies get closer to the ground state energy (the dashed line).
Note that at around the 4,000th iteration, the predicted energies are very far from the true energies. This makes sense considering our observation in Figure 3. Also note that at the 7,000th iteration, the averages of the predicted and true energies are the closest and even their respective spreads seem to have good overlap. This is when the best performing model version was saved, as seen in the training logs. For later iterations however, the predicted energies no longer improved. This may indicate that the GPT model has started overfitting on the training dataset in the later iterations. That is, the model became great at predicting the correct energies for the training set (as observed by the decreasing loss in Figure 3) but not great at generalizing on those outside the training set (like the sequences that the model generated on its own). One solution to avoid overfitting could then be online training. This is so that the GPT model is not restricted on a fixed dataset on which to overfit.
Sequence generation comparison¶
Here, we compare some statistics of the true energies corresponding to sequences generated by a “random” model, the latest version of the model after all the training iterations, and the best model version saved based on the mean absolute error between the true and predicted energies during training (for our case, this is the model saved at the 7,000th iteration). Note that we consider the training set to be generated by a random model since the token sequences are just sampled uniformly.
# Latest model
gen_token_seq_, _ = gpt.generate(
n_sequences=1024,
max_new_tokens=seq_len,
temperature=0.001,
device="cuda"
)
gen_inds_ = (gen_token_seq_[:, 1:] - 1).cpu().numpy()
gen_op_seq_ = op_pool[gen_inds_]
true_Es_ = get_subsequence_energies(gen_op_seq_)[:, -1].reshape(-1, 1)
# Best model
loaded = torch.load("./seq_len=4/trial7/gqe.pt")
loaded_token_seq_, _ = loaded.generate(
n_sequences=1024,
max_new_tokens=seq_len,
temperature=0.001,
device="cuda"
)
loaded_inds_ = (loaded_token_seq_[:, 1:] - 1).cpu().numpy()
loaded_op_seq_ = op_pool[loaded_inds_]
loaded_true_Es_ = get_subsequence_energies(loaded_op_seq_)[:, -1].reshape(-1, 1)
# Summary table
df_compare_Es = pd.DataFrame({
"Source": ["Random", "Latest Model", "Best Model"],
"Aves": [train_sub_seq_en[:, -1].mean(), true_Es_.mean(), loaded_true_Es_.mean()],
"Mins": [train_sub_seq_en[:, -1].min(), true_Es_.min(), loaded_true_Es_.min()],
"Maxs": [train_sub_seq_en[:, -1].max(), true_Es_.max(), loaded_true_Es_.max()],
"Mins_error": [
abs(train_sub_seq_en[:, -1].min() - grd_E),
abs(true_Es_.min() - grd_E),
abs(loaded_true_Es_.min() - grd_E),
],
})
df_compare_Es
Source Aves Mins Maxs Mins_error
0 Random -1.114531 -1.136982 -1.027878 2.811117e-04
1 Latest Model -1.132200 -1.137038 -1.125118 2.253048e-04
2 Best Model -1.135560 -1.137263 -1.125118 7.712067e-07
We observe that the minimum energy corresponding to the random training set is already close
to the ground state energy grd_E=-1.1372633205048763 Ha
with an error of around 2.81e-04 Ha
. But we notice that the maximum energy
is relatively far so the random sequences give a wider spread of energies.
The energies of the generated sequences from the latest and the best GPT model versions however have a narrower spread and so,
the average and the minimum energies are very close to grd_E
, closer than those in the random
training set. Namely, around a 2.25e-04 Ha
error for the minimum energy generated by the latest model version
and a 7.71e-07 Ha
error for the best model version. It is then very interesting that
a well-trained GPT model can generate sequences that are better than those in the training set, even though every
sequence the model has seen just came from that set. That is, the model was able to generalize.
Between the two GPT model versions, we see that the latest version is worse than the best version. The minimum energy error for the latest version has the same order of magnitude as the corresponding one for the random training set. Contrast this with the minimum energy error for the best version, which is 3 orders of magnitude smaller. This behavior is supported by our observation in Figure 4 where the performance of the model after the 7,000th iteration worsened. That is, the predicted energies started to deviate further from the true energies which in turn caused the states being sampled from these predicted energies to be different from the intended lower energy states.
Conclusion¶
In this demo, we see that GPT-QE is a viable alternative in estimating the ground-state energy
of a hydrogen molecule. The best underlying GPT model version can generate a state whose energy is
only around 7.71e-07 Ha
away from ground state energy, which is well below the chemical accuracy. Additionally, since the GPT model being optimized
is completely detached from the quantum simulations, gradients across a potentially large quantum circuit
don’t need to be computed, for example. Thus, the machinery of classical ML can be harnessed without
worrying too much about the quantum algorithm side of the problem.
The reader can also experiment with other molecules from PennyLane and tweak several hyperparameters
of the GPT model (like dropout
, and n_layer
) and include standard ML callbacks to its training
(like an early stopping mechanism and a learning rate schedule). The code itself is also open to further
optimization. For instance, using PennyLane Lightning
to evaluate the energies faster. An online training loop can also be implemented similarly to our offline
version, the reader would just need to sample sequences from the current GPT model instead of a fixed
dataset for each training iteration. To facilitate exploration and exploitation, one would also define
a schedule for the inverse temperature. That is, initially letting the GPT model sample more randomly through
a high temperature, then gradually decreasing so that the GPT model focuses more on the higher probability
states (low energies).
Reference¶
- 1(1,2,3,4,5,6,7,8,9,10,11)
-
Kouhei Nakaji et al., “The generative quantum eigensolver (GQE) and its application for ground state search”. arXiv:2401.09253 (2024)
- 2
-
Alec Radford et al., “Language Models are Unsupervised Multitask Learners”. OpenAI blog, 1(8):9 (2019)
- 3
-
Ashish Vaswani et al., “Attention is All you Need”. Advances in Neural Information Processing Systems, 30 (2017)
Joseph Bunao
Developer specializing in machine learning applications for quantum hardware and software research
Zy Niu
AI Tooling for Researchers.
Share demo