Generative quantum eigensolver training using PennyLane data

Joseph Bunao; Zy Niu

Demos/
Quantum Machine Learning/
Generative quantum eigensolver training using PennyLane data

Generative quantum eigensolver training using PennyLane data

Joseph Bunao

Zy Niu

Published: September 19, 2024. Last updated: September 22, 2025.

We will be demonstrating and evaluating a novel algorithm proposed by Nakaji et al. in their paper The generative quantum eigensolver (GQE) and its application for ground state search that employs a classical generative model of quantum circuits for the purpose of ground-state energy estimation of any molecular Hamiltonian [1]. It has been proposed as a scalable alternative to the variational quantum eigensolver (VQE) approach, where the quantum state is represented as a quantum circuit with tunable parameters which are then optimized during training in order to arrive at a state minimizing the corresponding energy \(E.\) Instead, in GQE, the structure of the quantum circuit is given by a trained generative model.

We will primarily focus on offline training on a fixed training dataset – thanks to the molecular data available in PennyLane Datasets. By the end of the demo, we will show that the model gradually provides a better estimate for the energies and, in turn, can sample energies close to the ground state energy calculated by PennyLane.

This demo is organized as follows. Firstly, we compare the GQE and VQE algorithms, and afterwards, we dive deeper in describing GPT-QE, i.e., the GQE algorithm which uses a GPT model, and its training. Next, we generate the training dataset for our GPT model using PennyLane, and give details on our model architecture and training implementation. After that, we evaluate the model throughout its training and discuss its performance in estimating the ground state. And lastly, we discuss the results, potential ways optimizing the code, and its extension into the “online training” phase.

GQE vs. VQE

Despite the relative success of VQE, there are some issues regarding its trainability for large problem instances [1]. This shortcoming makes it less competitive against the performance of classical machine learning (ML) algorithms for large problems. To bypass this, the GQE algorithm was proposed. Specifically, GQE uses a classical generative model where quantum circuits are sampled as a sequence of unitaries from a given operator pool. This generative model is then trained so that it learns to predict quantum circuits that evolves an initial state to the states better approximating the ground state.

The main difference between the two approaches is where the tunable parameters are embedded. That is, it is the classical GQE model that is being optimized as opposed to the variable quantum circuit of VQE. Potentially then, the barren plateau landscape of VQE and the quantum gradient evaluation of large circuits will be sidestepped by GQE, thus becoming more amenable for larger problems.

../../_images/paper_gqe_vqe.png — Figure 1: Diagrams of GQE and VQE from Fig. 1 in [1]¶

GPT-QE background

In particular, the model architecture used by Nakaji et al. [1] for GQE was a generative pre-trained transformer (GPT) [2], [3]. This choice is then reflected in the name GPT-QE. As language models, GPTs are successful in generating sequences of words that closely resemble human natural language. This performance is harnessed for quantum chemistry by constructing quantum states \(\rho\) as a sequence of unitary operators which are, in turn, represented by quantum circuits. That is, we let \(\rho = U\rho_0 U^{\dagger}\) for some fixed initial state \(\rho_0\) and the aforementioned sequence is \(U = U_{j_N}U_{j_{N-1}}\cdots U_{j_1}.\) The GPT model samples a sequence of integers \(j_1, j_2, ..., j_N\) indexing a pool of operators \(U_j\) generated using molecular data from PennyLane Molecules. We interpret these integers as tokens and the pool as the vocabulary in the parlance for language models. The goal of training is then to minimize the corresponding energy \(E = \mbox{Tr}(\hat{H}\rho),\) where \(\hat{H}\) is the Hamiltonian of the molecule in question.

Each token \(j_i\) is sampled from the distribution \(\exp(-\beta w_{j_i}),\) where \(w_{j_i}\) is the unnormalized log probability (or logit) returned by the GPT model for the token \(j_i\) and \(\beta\) is an inverse temperature representing a trade-off parameter between exploration and exploitation. We then observe that the probability of sampling a state through the method described above is proportional to \(\exp(-\beta w_{\mbox{sum}}),\) where \(w_{\mbox{sum}} = \sum_{i=1}^N w_{j_i}\) and the probability for the corresponding energy is \(\exp(-\beta E).\) We thus have a constraint for the total logit to be equal to the energy of the corresponding state: \(w_{\mbox{sum}} = E,\) which can be imposed by training GPT-QE to minimize the loss function \(C = (w_{\mbox{sum}} - E)^2.\) With this constraint satisfied, GPT-QE would then be sampling states of smaller energies with increasing likelihood.

More concretely, we summarize the *pre-*training loop in Figure 2. This is called pre-training because it involves learning from a fixed dataset first before transitioning to the “real” training which utilizes the data generated by the model itself. In this demo, we will call the pre-training as offline training since the GPT model does not receive feedback from the sequences it samples, and online training if otherwise.

../../_images/gqe_training_diagram.png — Figure 2: Overview for offline training of GPT-QE¶

Dataset construction via PennyLane

Firstly, let us construct the static dataset we will use for offline training. We choose to generate our own dataset in order to illustrate the sequences and energies more concretely. Our dataset will be made from random sequences of tokens, which we recall corresponds to indices of a vocabulary of unitary operators. We then define an energy function in PennyLane to calculate the energy of a state corresponding to a token sequence. Applying the aforementioned function, we would then have a dataset of token sequences and energies for the GPT model offline training.

Loading molecular information

For simplicity, let us consider the hydrogen molecule and load the corresponding dataset from PennyLane. Recall that we would need a vocabulary of operators \(U_j\), an initial state \(\rho_0,\) and the Hamiltonian \(\hat{H}\) for a hydrogen molecule. We also get the ground state energy for later comparison with the results.

Specifically, the unitary operators \(U_j\) are time evolution operators as prescribed in [1]. The non-identity operators are generated in PennyLane using SingleExcitation and DoubleExcitation which then depend on the number of electrons and orbitals of the molecule.

import numpy as np
import pennylane as qml

def generate_molecule_data(molecules="H2"):
    datasets = qml.data.load("qchem", molname=molecules)

    # Get the time set T
    op_times = np.sort(np.array([-2**k for k in range(1, 5)] + [2**k for k in range(1, 5)]) / 160)

    # Build operator set P for each molecule
    molecule_data = dict()
    for dataset in datasets:
        molecule = dataset.molecule
        num_electrons, num_qubits = molecule.n_electrons, 2 * molecule.n_orbitals
        singles, doubles = qml.qchem.excitations(num_electrons, num_qubits)
        double_excs = [qml.DoubleExcitation(time, wires=double) for double in doubles for time in op_times]
        single_excs = [qml.SingleExcitation(time, wires=single) for single in singles for time in op_times]
        identity_ops = [qml.exp(qml.I(range(num_qubits)), 1j*time) for time in op_times] # For Identity
        operator_pool = double_excs + single_excs + identity_ops
        molecule_data[dataset.molname] = {
            "op_pool": np.array(operator_pool),
            "num_qubits": num_qubits,
            "hf_state": dataset.hf_state,
            "hamiltonian": dataset.hamiltonian,
            "expected_ground_state_E": dataset.fci_energy
        }
    return molecule_data

molecule_data = generate_molecule_data("H2")
h2_data = molecule_data["H2"]
op_pool = h2_data["op_pool"]
num_qubits = h2_data["num_qubits"]
init_state = h2_data["hf_state"]
hamiltonian = h2_data["hamiltonian"]
grd_E = h2_data["expected_ground_state_E"]
op_pool_size = len(op_pool)

Defining the energy function

In PennyLane, we define the energy function \(E = \mbox{Tr}(\hat{H}U_{j_N}\cdots U_{j_1}\rho_0 U_{j_1}^{\dagger}\cdots U_{j_N}^{\dagger})\) corresponding to Eq. 1 of [1]. Here, energy_circuit takes in the operator sequence \(U_{j_1}, U_{j_2}, ..., U_{j_N}\) and returns the energy of the corresponding quantum state.

As a slight extension, we can also calculate the energies for each subsequence of operators to help with the training of the model. That is, for a sequence of three operators \(U_{j_1}, U_{j_2}, U_{j_3}\) we compute the energies for \(U_{j_1}\) and \(U_{j_1}, U_{j_2}\) instead of just the full sequence of three operators, which was described in [1]. This can be done simply in PennyLane, using Snapshot as shown below.

dev = qml.device("default.qubit", wires=num_qubits)

@qml.qnode(dev)
def energy_circuit(gqe_ops):
    # Computes Eq. 1 from Nakaji et al. based on the selected unitary operators
    qml.BasisState(init_state, wires=range(num_qubits)) # Initial state <-- Hartree Fock state
    for op in gqe_ops:
        qml.Snapshot(measurement=qml.expval(hamiltonian))
        qml.apply(op) # Applies each of the unitary operators
    return qml.expval(hamiltonian)

energy_circuit = qml.snapshots(energy_circuit)

def get_subsequence_energies(op_seq):
    # Collates the energies of each subsequence for a batch of sequences
    energies = []
    for ops in op_seq:
        es = energy_circuit(ops)
        energies.append(
            [es[k].item() for k in list(range(1, len(ops))) + ["execution_results"]]
        )
    return np.array(energies)

Token sequence generation with corresponding energies

With these ingredients, we can now construct a dataset containing sequences of tokens and their energies. Since we cannot feed the operators directly to the GPT model, we would need to tokenize them. The indices of op_pool seems to be a good candidate, but we instead choose the tokens to be the op_pool indices shifted by 1. This is so that we can define a special token 0 that tells the GPT model where the sequence starts.

We generate a train_size number of random operator sequences of length seq_len for our purposes and calculate their energies (and their subsequences).

# Generate sequence of indices of operators in vocab
train_size = 1024
seq_len = 4
train_op_pool_inds = np.random.randint(op_pool_size, size=(train_size, seq_len))

# Corresponding sequence of operators
train_op_seq = op_pool[train_op_pool_inds]

# Corresponding tokens with special starting tokens
train_token_seq = np.concatenate([
    np.zeros(shape=(train_size, 1), dtype=int), # starting token is 0
    train_op_pool_inds + 1 # shift operator inds by one
], axis=1)

# Calculate the energies for each subsequence in the training set
train_sub_seq_en = get_subsequence_energies(train_op_seq)

GPT-QE offline training

Having setup our training dataset, we can start implementing our offline training loop as illustrated in Figure 2. We outline our implementation below.

GPT model implementation details

The GPT model we will use in this demo is mostly implemented in the nanoGPT repo (a reimplementation of the OpenAI GPT-2) as the class GPT with the model hyperparameters stored in the dataclass GPTConfig. Namely, we will use 12 attention layers, 12 attention heads, and 768 embedding dimensions, which are equal to those described in [1]. We can import from the nanoGPT repo directly by running the curl command (commented out below) in a Jupyter Notebook. Since nanoGPT is trained as a language model, its loss function and sampling method are defined differently. We then define the subclass GPTQE below to override some nanoGPT methods in order to make it more suitable for our case.

# !curl -O https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py
from model import GPT, GPTConfig
import torch
from torch.nn import functional as F

class GPTQE(GPT):
    def forward(self, idx):
        device = idx.device
        b, t = idx.size()
        pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)
        return logits

    def calculate_loss(self, tokens, energies):
        current_tokens, next_tokens = tokens[:, :-1], tokens[:, 1:]
        # calculate the logits for the next possible tokens in the sequence
        logits = self(current_tokens)
        # get the logit for the actual next token in the sequence
        next_token_mask = torch.nn.functional.one_hot(
            next_tokens, num_classes=self.config.vocab_size
        )
        next_token_logits = (logits * next_token_mask).sum(axis=2)
        # calculate the cumulative logits for each subsequence
        cumsum_logits = torch.cumsum(next_token_logits, dim=1)
        # match cumulative logits to subsequence energies
        loss = torch.mean(torch.square(cumsum_logits - energies))
        return loss

    @torch.no_grad()
    def generate(self, n_sequences, max_new_tokens, temperature=1.0, device="cpu"):
        idx = torch.zeros(size=(n_sequences, 1), dtype=int, device=device)
        total_logits = torch.zeros(size=(n_sequences, 1), device=device)
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits = self(idx_cond)
            # pluck the logits at the final step
            logits = logits[:, -1, :]
            # set the logit of the first token so that its probability will be zero
            logits[:, 0] = float("inf")
            # apply softmax to convert logits to (normalized) probabilities and scale by desired temperature
            probs = F.softmax(-logits / temperature, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # # Accumulate logits
            total_logits += torch.gather(logits, index=idx_next, dim=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)
        return idx, total_logits

However, it is important to note that the loss function calculate_loss, that we defined is different from the one described in [1] which is \((\exp(-w_{\mbox{sum}}) - \exp(-E))^2.\) As described beforehand, we instead directly compute the mean squared error between \(w_{\mbox{sum}}\) and \(E.\) Since the exponential function is one-to-one, both loss functions would then impose the same minimum. Using the error between exponentials may even introduce numerical instabilities in the training since the loss would be taking differences of potentially large numbers. In addition to this change from [1], we also use the error between the cumulative sum of logits and the corresponding energy for each subsequence instead of just the error between total logits and the energy of an entire sequence. This addition will give more training data to the model and should help with logit matching the intermediate tokens.

Since it was not explicitly shown in [1], another possible deviation we made is with the logit calculation during offline training. It seems that the logits were accumulated by looping through the sequential generation of tokens. In order to fill in the blanks, we implement the logit calculation in a manner that we think is efficient. Namely, we directly pass the fixed training sequences to the model and retrieve the relevant logits from it. This can be done because we are using a causal mask for the attention blocks so that the logits of the earlier tokens in the sequence are not affected by tokens in the later part of the sequence. Thus, having the same effect of the sequential token generation.

We initialize our GPT model below and we see that it has around 85 million parameters. When saved, the model is 324.25 MB in size.

tokens = torch.from_numpy(train_token_seq).to("cuda")
energies = torch.from_numpy(train_sub_seq_en).to("cuda")

gpt = GPTQE(GPTConfig(
    vocab_size=op_pool_size + 1,
    block_size=seq_len,
    dropout=0.2,
    bias=False
)).to("cuda")
opt = gpt.configure_optimizers(
    weight_decay=0.01, learning_rate=5e-5, betas=(0.9, 0.999), device_type="cuda"
)

number of parameters: 84.98M
num decayed parameter tensors: 50, with 84,963,072 parameters
num non-decayed parameter tensors: 25, with 19,200 parameters

GPT offline training

We now implement a training loop for our GPT model. This can be framed as a straightforward supervised learning problem. We sketch the steps for each training iteration/epoch below:

Shuffle the training set and split it into n_batches minibatches
For each minibatch, calculate the average loss, the gradients, and take an optimizer step
For each n-th iteration (n is 500 here), evaluate the GPT model:
- Generate a batch of sequences and the predicted energies (total logits). Note that these are not necessarily same sequences as in the training set.
- Calculate the true energies using PennyLane
- Calculate the mean absolute error as a metric to track the learning progress and save the GPT model everytime the metric gets better

n_batches = 8
train_inds = np.arange(train_size)

losses = []
pred_Es_t = []
true_Es_t = []
current_mae = 10000
gpt.train()
for i in range(10000):
    # Shuffle batches of the training set
    np.random.shuffle(train_inds)
    token_batches = torch.tensor_split(tokens[train_inds], n_batches)
    energy_batches = torch.tensor_split(energies[train_inds], n_batches)

    # SGD on random minibatches
    loss_record = 0
    for token_batch, energy_batch in zip(token_batches, energy_batches):
        opt.zero_grad()
        loss = gpt.calculate_loss(token_batch, energy_batch)
        loss.backward()
        opt.step()
        loss_record += loss.item() / n_batches
    losses.append(loss_record)

    if (i+1) % 500 == 0:
        # For GPT evaluation
        gpt.eval()
        gen_token_seq, pred_Es = gpt.generate(
            n_sequences=100,
            max_new_tokens=seq_len,
            temperature=0.001, # Use a low temperature to emphasize the difference in logits
            device="cuda"
        )
        pred_Es = pred_Es.cpu().numpy()

        gen_inds = (gen_token_seq[:, 1:] - 1).cpu().numpy()
        gen_op_seq = op_pool[gen_inds]
        true_Es = get_subsequence_energies(gen_op_seq)[:, -1].reshape(-1, 1)

        mae = np.mean(np.abs(pred_Es - true_Es))
        ave_E = np.mean(true_Es)

        pred_Es_t.append(pred_Es)
        true_Es_t.append(true_Es)

        print(f"Iteration: {i+1}, Loss: {losses[-1]}, MAE: {mae}, Ave E: {ave_E}")

        if mae < current_mae:
            current_mae = mae
            torch.save(gpt, f"./seq_len={seq_len}/gqe.pt")
            print("Saved model!")

        gpt.train()

pred_Es_t = np.concatenate(pred_Es_t, axis=1)
true_Es_t = np.concatenate(true_Es_t, axis=1)

Iteration: 500, Loss: 0.004496691238049528, MAE: 0.13945468622863236, Ave E: -1.1161227981406456
Saved model!
Iteration: 1000, Loss: 0.001162520404255374, MAE: 0.11792013497926974, Ave E: -1.116178063434579
Saved model!
Iteration: 1500, Loss: 0.0006311882560414964, MAE: 0.08421050347067748, Ave E: -1.1304435666682537
Saved model!
Iteration: 2000, Loss: 0.0002220232025956396, MAE: 0.03313205549288038, Ave E: -1.13411711385679
Saved model!
Iteration: 2500, Loss: 9.021296506465553e-05, MAE: 0.03720317687198404, Ave E: -1.1360217383940532
Iteration: 3000, Loss: 0.00011929328764308375, MAE: 0.010246824522607662, Ave E: -1.1355033629645301
Saved model!
Iteration: 3500, Loss: 4.015137835017087e-05, MAE: 0.008332604993116905, Ave E: -1.1362737218253494
Saved model!
Iteration: 4000, Loss: 0.00025425587370956726, MAE: 0.03346923599957368, Ave E: -1.13442109812976
Iteration: 4500, Loss: 4.590269966149363e-05, MAE: 0.0086580669691949, Ave E: -1.1344678899103924
Iteration: 5000, Loss: 2.7407370499136962e-05, MAE: 0.006680762382889203, Ave E: -1.136412143925528
Saved model!
Iteration: 5500, Loss: 3.778071550021417e-05, MAE: 0.014272903220676704, Ave E: -1.1362969016861684
Iteration: 6000, Loss: 2.2792776141250974e-05, MAE: 0.007428675818214263, Ave E: -1.1367647064449693
Iteration: 6500, Loss: 1.9002385742602413e-05, MAE: 0.004431537870071902, Ave E: -1.135880723613281
Saved model!
Iteration: 7000, Loss: 1.5268728079291623e-05, MAE: 0.002464256235883442, Ave E: -1.1356989137037925
Saved model!
Iteration: 7500, Loss: 1.1030378864566936e-05, MAE: 0.007000517223791054, Ave E: -1.1360445255294285
Iteration: 8000, Loss: 7.638036884241474e-06, MAE: 0.0044611951680048586, Ave E: -1.1352658877947734
Iteration: 8500, Loss: 1.616690860258467e-05, MAE: 0.004094392133172753, Ave E: -1.1356437076129735
Iteration: 9000, Loss: 7.37882245331426e-06, MAE: 0.004240113290004896, Ave E: -1.1358971131175264
Iteration: 9500, Loss: 1.004411104422562e-05, MAE: 0.010631562300185794, Ave E: -1.1368761600775912
Iteration: 10000, Loss: 1.809862392776087e-05, MAE: 0.01987725166307399, Ave E: -1.1345492765523346
CPU times: user 2h 12min 24s, sys: 8.18 s, total: 2h 12min 32s
Wall time: 2h 12min 32s

With a preliminary look at the training logs above, we see that the offline training took 2 h and 12 min for 10,000 training iterations. The code execution time was measured by including the %%time magic command before the code block. We also note that the mean absolute error between the predicted and true energies for the generated sequences quickly became smaller during the earlier parts of the training but fluctuates more later on. As mentioned earlier, a model version is saved each time we get better performance. The best model version is then saved at the 7,000th iteration, where the mean absolute error is at its lowest. The other quantities we observe are the average true energies of the generated sequences. It is then promising to see that throughout training, this is close to the ground state energy grd_E=-1.1372633205048763 Ha.

GPT-QE results

Having finished the offline training, let’s take a look at some of our results.

Training loss curve

One of the first things we can look at is the training loss curve. Recall that for our case, the loss is the mean squared error between the cumulative sum of logits (predicted subsequence energies) and their corresponding true subsequence energies. So reducing this quantity allows the model to be better at giving correct energies and in turn, correctly sample sequences with lower energies. Since the raw loss values become very small, we instead plot the loss in log-scale below to magnify the loss trend.

import holoviews as hv
import hvplot.pandas
import pandas as pd

hvplot.extension('matplotlib')

losses = pd.read_csv("./seq_len=4/trial7/losses.csv")["0"]
loss_fig = losses.hvplot(
    title="Training loss progress", ylabel="loss", xlabel="Training epochs", logy=True
).opts(fig_size=600, fontscale=2, aspect=1.2)
loss_fig

../../_images/gqe_training_loss.png — Figure 3: The subsequence loss for each training iteration¶

We see in Figure 3 that the loss continues to decrease until around the 4,000th iteration. There, the model was erroraneous but is quick to recover as training continues. This may signal that the GPT model started focusing on learning something erroraneous too quickly. So, more regularization noise (like dropout) may be needed to help avoid this.

Evaluation progress

We now track the performance of the GPT model throughout its training in Figure 4 below. As mentioned before, after every 500th iteration, we let the model generate a batch of sequences. Alongside, we also return the total logits (predicted energies) used in the sequence generation. In Figure 4, the average predicted energies correspond to the red markers and the distribution of predicted energies is represented by the red area. Once we have the generated sequences, we can also let PennyLane calculate the true sequence energies. Similarly then, the blue markers are the average true energies and the blue area represents the true energy distribution.

df_true = pd.read_csv("./seq_len=4/trial7/true_Es_t.csv").iloc[:, 1:]
df_pred = pd.read_csv("./seq_len=4/trial7/pred_Es_t.csv").iloc[:, 1:]

df_true.columns = df_true.columns.astype(int)
df_pred.columns = df_pred.columns.astype(int)

df_trues_stats = pd.concat([df_true.mean(axis=0), df_true.min(axis=0), df_true.max(axis=0)], axis=1).reset_index()
df_trues_stats.columns = ["Training Iterations", "Ave True E", "Min True E", "Max True E"]

df_preds_stats = pd.concat([df_pred.mean(axis=0), df_pred.min(axis=0), df_pred.max(axis=0)], axis=1).reset_index()
df_preds_stats.columns = ["Training Iterations", "Ave Pred E", "Min Pred E", "Max Pred E"]

fig = (
    df_trues_stats.hvplot.scatter(x="Training Iterations", y="Ave True E", label="Mean True Energies") *
    df_trues_stats.hvplot.line(x="Training Iterations", y="Ave True E", alpha=0.5, linewidth=1) *
    df_trues_stats.hvplot.area(x="Training Iterations", y="Min True E", y2="Max True E", alpha=0.1)
) * (
    df_preds_stats.hvplot.scatter(x="Training Iterations", y="Ave Pred E", label="Mean Predicted Energies") *
    df_preds_stats.hvplot.line(x="Training Iterations", y="Ave Pred E", alpha=0.5, linewidth=1) *
    df_preds_stats.hvplot.area(x="Training Iterations", y="Min Pred E", y2="Max Pred E", alpha=0.1)
)
fig = fig * hv.Curve([[0, grd_E], [10000, grd_E]], label="Ground State Energy").opts(color="k", alpha=0.4, linestyle="dashed")
fig = fig.opts(ylabel="Sequence Energies", title="GQE Evaluations", fig_size=600, fontscale=2)
fig

../../_images/gqe_performance.png — Figure 4: True and predicted energies for sequences generated by the GPT model for each 500th training iteration¶

We now see that the energies predicted by the model improve in accuracy, aligning more closely with the true energies during training. The increase in accuracy then allows the model to correctly sample states with lower energies. This is supported in Figure 4 where the sampled true energies get closer to the ground state energy (the dashed line).

Note that at around the 4,000th iteration, the predicted energies are very far from the true energies. This makes sense considering our observation in Figure 3. Also note that at the 7,000th iteration, the averages of the predicted and true energies are the closest and even their respective spreads seem to have good overlap. This is when the best performing model version was saved, as seen in the training logs. For later iterations however, the predicted energies no longer improved. This may indicate that the GPT model has started overfitting on the training dataset in the later iterations. That is, the model became great at predicting the correct energies for the training set (as observed by the decreasing loss in Figure 3) but not great at generalizing on those outside the training set (like the sequences that the model generated on its own). One solution to avoid overfitting could then be online training. This is so that the GPT model is not restricted on a fixed dataset on which to overfit.

Sequence generation comparison

Here, we compare some statistics of the true energies corresponding to sequences generated by a “random” model, the latest version of the model after all the training iterations, and the best model version saved based on the mean absolute error between the true and predicted energies during training (for our case, this is the model saved at the 7,000th iteration). Note that we consider the training set to be generated by a random model since the token sequences are just sampled uniformly.

# Latest model
gen_token_seq_, _ = gpt.generate(
    n_sequences=1024,
    max_new_tokens=seq_len,
    temperature=0.001,
    device="cuda"
)
gen_inds_ = (gen_token_seq_[:, 1:] - 1).cpu().numpy()
gen_op_seq_ = op_pool[gen_inds_]
true_Es_ = get_subsequence_energies(gen_op_seq_)[:, -1].reshape(-1, 1)

# Best model
loaded = torch.load("./seq_len=4/trial7/gqe.pt")
loaded_token_seq_, _ = loaded.generate(
    n_sequences=1024,
    max_new_tokens=seq_len,
    temperature=0.001,
    device="cuda"
)
loaded_inds_ = (loaded_token_seq_[:, 1:] - 1).cpu().numpy()
loaded_op_seq_ = op_pool[loaded_inds_]
loaded_true_Es_ = get_subsequence_energies(loaded_op_seq_)[:, -1].reshape(-1, 1)

# Summary table
df_compare_Es = pd.DataFrame({
    "Source": ["Random", "Latest Model", "Best Model"],
    "Aves": [train_sub_seq_en[:, -1].mean(), true_Es_.mean(), loaded_true_Es_.mean()],
    "Mins": [train_sub_seq_en[:, -1].min(), true_Es_.min(), loaded_true_Es_.min()],
    "Maxs": [train_sub_seq_en[:, -1].max(), true_Es_.max(), loaded_true_Es_.max()],
    "Mins_error": [
        abs(train_sub_seq_en[:, -1].min() - grd_E),
        abs(true_Es_.min() - grd_E),
        abs(loaded_true_Es_.min() - grd_E),
    ],
})
df_compare_Es

Source      Aves      Mins      Maxs   Mins_error
0        Random -1.114531 -1.136982 -1.027878 2.811117e-04
1  Latest Model -1.132200 -1.137038 -1.125118 2.253048e-04
2    Best Model -1.135560 -1.137263 -1.125118 7.712067e-07

We observe that the minimum energy corresponding to the random training set is already close to the ground state energy grd_E=-1.1372633205048763 Ha with an error of around 2.81e-04 Ha. But we notice that the maximum energy is relatively far so the random sequences give a wider spread of energies.

The energies of the generated sequences from the latest and the best GPT model versions however have a narrower spread and so, the average and the minimum energies are very close to grd_E, closer than those in the random training set. Namely, around a 2.25e-04 Ha error for the minimum energy generated by the latest model version and a 7.71e-07 Ha error for the best model version. It is then very interesting that a well-trained GPT model can generate sequences that are better than those in the training set, even though every sequence the model has seen just came from that set. That is, the model was able to generalize.

Between the two GPT model versions, we see that the latest version is worse than the best version. The minimum energy error for the latest version has the same order of magnitude as the corresponding one for the random training set. Contrast this with the minimum energy error for the best version, which is 3 orders of magnitude smaller. This behavior is supported by our observation in Figure 4 where the performance of the model after the 7,000th iteration worsened. That is, the predicted energies started to deviate further from the true energies which in turn caused the states being sampled from these predicted energies to be different from the intended lower energy states.

Conclusion

In this demo, we see that GPT-QE is a viable alternative in estimating the ground-state energy of a hydrogen molecule. The best underlying GPT model version can generate a state whose energy is only around 7.71e-07 Ha away from ground state energy, which is well below the chemical accuracy. Additionally, since the GPT model being optimized is completely detached from the quantum simulations, gradients across a potentially large quantum circuit don’t need to be computed, for example. Thus, the machinery of classical ML can be harnessed without worrying too much about the quantum algorithm side of the problem.

The reader can also experiment with other molecules from PennyLane and tweak several hyperparameters of the GPT model (like dropout, and n_layer) and include standard ML callbacks to its training (like an early stopping mechanism and a learning rate schedule). The code itself is also open to further optimization. For instance, using PennyLane Lightning to evaluate the energies faster. An online training loop can also be implemented similarly to our offline version, the reader would just need to sample sequences from the current GPT model instead of a fixed dataset for each training iteration. To facilitate exploration and exploitation, one would also define a schedule for the inverse temperature. That is, initially letting the GPT model sample more randomly through a high temperature, then gradually decreasing so that the GPT model focuses more on the higher probability states (low energies).

Reference

About the authors

Joseph Bunao

Developer specializing in machine learning applications for quantum hardware and software research

Zy Niu

AI Tooling for Researchers.

Share demo

Ask a question on the forum

About

Software & Documentation

Resources

Topics

Community & Support

Generative quantum eigensolver training using PennyLane data

GQE vs. VQE

GPT-QE background

Dataset construction via PennyLane

Loading molecular information

Defining the energy function

Token sequence generation with corresponding energies

GPT-QE offline training

GPT model implementation details

GPT offline training

GPT-QE results

Training loss curve

Evaluation progress

Sequence generation comparison

Conclusion

Reference

About the authors

Related Demos