0%

Accelerating Simulated Quantum Annealing with GPU and Tensor Cores

Foreword

Recently, our team submitted a research paper called “Accelerating Simulated Quantum Annealing with GPU and Tensor Cores,” which was accepted by research paper of ISC high performance, 2022.

In this paper, we modify the SQA algorithm to increase the parallelism of the SQA algorithm and still guarantee the performance of the accuracy.

In general, SQA algorithms are often used to solve combinatorial optimization problems. We will first map the combinatorial optimization problem (e.g. traveling salesman problem (TSP)) that needs to solve to the appropriate Ising model. Different problems need to be mapped to different Ising models. Some Ising models like the fully-connected model, King graph model, EA model, etc., are referred to by the teams solving quantum annealing-related works (e.g. Tohoku’s SQA accelerated by FPGA and GPU, Toshiba Simulated Bifurcation Machine SBM).

In this work, we propose an innovative parallelizing strategy called hierarchical update to vastly improve the efficiency of parallel computing, which is capable of accelerating state-of-the-art SQA implementations further by 7X-47.2X based on our case studies. Furthermore, we develop a tensorizing scheme to leverage the Tensor Cores on modern GPUs to deliver up to 1.83X of additional speedup by using cublas, provided by NVIDIA CUDA Toolkit. Overall, our work solves fully-connected Ising models faster than any previous SQA works. Our solution outperforms existing GPU-based solutions by 86.6X and FPGA-based solutions by 14X. Moreover, our proposed algorithm still maintains the performance of accuracy of SQA. Finally, we use our algorithm to solve the max-cut problem gset provided by Stanford University and compare the accuracy with the current state-of-the-art. With the careful adjustment of corresponding parameters such as temperature, almost all gset problems have an average accuracy rate of over 97.5%.

Content

The structure of this article is as follows. Firstly, I will briefly talk about the Ising model and SQA. After a basic understanding of SQA, we will mention how the original state-of-the-art accelerated SQA can solve the fully connected Ising problem. The second part will elaborate on the proposed accelerated strategy, “Hierarchical Update,” in this work. And describe how to adopt Tensor cores during the acceleration to make a higher speedup. After talking about the methodology, we will cover some experiments conducted in this paper to verify the effect of the proposed acceleration strategy. And make a summary in the final.

Introduction

Ising Model

The Ising model is a huge lattice constructed by N spins. Besides, each spin can be in one of the two different states, +1 and -1. If the spin is labeled as +1, the spin direction is up. Conversely, the spin direction is down when the spin is marked as -1.

In addition, as shown in Fig.1, some of the spins are connected. The connected strength between the spin i and spin j is called couplings and notated as J_{ij}.

Figure 1. Sample Ising model - EA model

After having these parameters, we can formulate the Hamiltonian for the Ising model as the formula.

Hamiltonian equation

When solving the Ising problem, our goal is to find a spin configuration to minimize the system energy. Thus, we will try to flip the spins on the Ising model to find out the spin configuration with the minimal Hamiltonian. Solving an Ising problem is a combinatorial problem, an NP-Complete problem.

But why do we want to solve the Ising problem? Since we can reduce the other NPC problem, like MAX-CUT, min-CUT, Traveling salesman problem, etc., to the Ising problem and solve those problems we demand to fix by solving the Ising problem.

Generally, the Ising problem is currently solved in a heuristic. One of the heuristic algorithms, “Simulated Quantum Annealing,” is an algorithm for solving combinatorial optimization problems. Before conversing about the SQA, let’s look at some Ising models with different connectivity.

There are many different Ising models shown in Fig. 2.

Figure 2. Ising models with different connectivities

The connectivity of the Ising model increase from the left-most to the rightmost Ising model. The higher spin connectivity for an Ising model, the more difficult it is to be accelerated. Since when we want to make a spin-flip, we need to know all the spins states of the connected spins. Hence, for the rightmost one, the fully-connected Ising model, we need to maintain the remaining (N-1) spins in the same state when we want to flip a spin. Thus, N rounds are required to traverse N spins and judge whether one spin can be reversed to make the Hamiltonian of the Ising model become lower in each round.
In our work, we accelerate solving the fully-connected Ising model, which is the most challenging Ising model to be accelerated by adopting simulated quantum annealing (SQA) algorithm.

Simulated Quantum Annealing

So what is simulated quantum annealing (SQA)? SQA is a quantum-inspired algorithm. In order to mimic the quantum phenomenon, it will duplicate the original Ising model into M copies as shown in the graph below. This multi-trotter Ising model is called a transverse field Ising model (TFIM). Besides, different from the original fully connected Ising model, each spin on TFIM will also connect to the spins in the same position located at the previous and the next trotter of TFIM.

Figure 3. TFIM

At the start of the SQA, by giving the initial temperature, and magnetic field, we need to randomly generate a set of spin configurations. As time increases (MC-step increases), the value of T(t) will gradually decrease. And under the same temperature, we will travers all the spins and try flip the spin which can make the system hamiltonian becomes lower under the same system situation. But remember we can only flip on e spin at a time, since the Ising model we solved is fully connected Ising model.

Ok, so to make a spin-flip, we need to know the delta energy caused by a spin-flip. Here is the hamiltonian for the Fully-connected TFIM.

From this formula, we can separate the system’s total energy caused by a single spin.

After manipulating the equation, we can get spin energy for the spin i located on the trotter m.

Hence, we can determine the delta energy caused by a spin-flip by knowing the spin energy for each spin since the spin state can only be +1 or -1.

For more detail about the hamiltonian and delta energy caused by the spin-flip, you can refer to Ref-1 and Ref-2 here.

  • Ref-1: “Relationship between d-Dimensional Quantal Spin Systems and (d+1)-Dimensional Ising Systems: Equivalence, Critical Exponents and Systematic Approximants of the Partition Function and Spin Correlations ”
  • Ref-2: (IEEE international conference on rebooting computing (ICRC) “An ising computer based on simulated quantum annealing by path integral monte carlo method”

With knowing delta energy makes by a spin-flip, let’s traverse the process of the SQA algorithm.

After storing the coupling data into the coupling matrix, we need to generate a spin configuration set randomly. Besides, from the previous, we know that we need to calculate the spin energy for each spin to ensure whether to make the spin-flip. The equation shows that the bottleneck for calculating the spin energy is bound by this sigma term called local-field. Hence, before starting the Monte Carlo step, we will pre-calculate the local-field term of all the spins in parallel by using GPU as shown in line 5 to line 10 in the algorithm.

After the initialization, we will traverse all the spins of the TFIM in the N and M for loop during each MC-step.

Since we have pre-calculated the local field of each spin, we can compute the delta energy caused by each spin in O(1) time. And then, we can determine whether the spin-flip can lower the system energy in line 17. If yes, then the spin can be flipped in line 18. However, since all the spins are connected, the remaining (N-1) spins’ local field needs to be updated after a spin-flip. Thus, we need to update all the spins’ local fields in line 20 to line 22.

Unlike constructing the local field of N spins that need to cost O(N^2), updating N local fields only takes O(N) by using this trick on the bottom right. For example, as the spin i on the trotter m is flipped. We need to update the k-th spin on the m trotter for k from 1 to N. By Subtracting the old local field from the new local field, since we can only flip one spin at a time, hence, this term can be eliminated (shown in the figure above). Furthermore, since the spin state of each spin will only be 1 and -1, we can replace the old spin state with the new spin state multiplied by -1. And finally, we know that we only need O(1) time to update one local field. For update N spins, it costs O(N) but not O(N^2).

And after all the local fields update to be correct, we can go back to line 13 and check whether the next spin-flip can lower the system energy, then do the same thing again and again until all the spins are traversed. Then the system energy will decrease and start the next round of MC-step and do the same thing until the end of the Monte Carlo step.

Original state-of-the-art (SOTA)

Before this work, the SQA algorithm had been accelerated by many different strategies. Tohoku University proposed the state-of-the-art, published on 2020 IEEE access and 2019 IEEE transactions. The results of these two works are also the main comparison in our work today.

They firstly parallel the calculation of constructing and updating local-field by using accelerators.
Besides, since SQA only specifies that we can not flip the connected spins at the same time, i.e., in the fully connected TFIM, we can not flip the spins of the same trotter and the corresponding spin on the upper and the lower trotter. However, we can flip different spins on other trotters simultaneously, as shown in the figure below, by using the pipeline method to increase the parallelism of the SQA algorithm.

These actions will make the time complexity of each MC step from M*N^2 to N^2 or even lower by using an accelerator to make more speedup on constructing or updating the local field parts.

Methodology

So, according to the SOTA from Tohoku University, we proposed a new strategy by re-forming the SQA algorithm to implement the algorithm more parallelly. We call the new method “Hierarchical Update.”

In the methodology section, we talk about Hierarchical Update firstly, then head to how to use Tensor cores to have a more higher speedup.
Let’s start from the Hierarchical Update, which is also called HU method.

Hierarchical Update (HU)

Different from the original SQA, which needs to update N local field after a spin-flip, HU will only update the local field of some spins but all the spins on the same trotter. For example, a TFIM is constructed by four trotters with 64 spins on each trotter. Besides, after the initialization or update, the local field for each spin are correct. We colored those spins green. But if a spin-flip happens, all the local fields for spins on the same trotter go wrong, and we color them red.

In addition, there is a new variable used by HU, called block_size. And we split the N spins into N divided by blk_sz blocks. Assume the blk_sz =32. Then we can separate the 64 spins into two groups just like the green frame. Furthermore, we need to separate each group into two small blocks (framed in orange and blue). The orange and blue frame is used to maintain the parallelism of the M dimension, which will mention later.

Let’s look at the green frame for how HU implements. Like the original SQA, we first judge whether to flip the spin 1. If spin one is flipped, then all the local fields for the spins on the trotter need to be updated, so we colored them red. Next, different from the original method, we only update the local fields for those spins in the same block, i.e., we only update the local fields for spins one to spin thirty-two.

So we can continuously judge whether to flip the spin two. And if the spin two is flipped, then we still only update the local fields for those spins located in the same block. Until we judge the spin-flip to the end of the block, the spin 32, no matter whether the spin 32 flips or not, we will cost O(N) times to update the local fields on the same trotter, then all the spins’ local fields will become correct again.

And we head to judge the spin on the next block. And we judge the first spin in the block and do the same things to check whether to flip the spins in this block until the end of the block.

After traversing all the blocks, we successfully flip all the spins on a trotter, trying to figure out the lowest energy of the trotter under this MC-step.

Besides, as mentioned previously, we will separate one block into two small blocks, the orange frame, and the blue frame. Splitting the two small blocks is to maintain the M dimension.

In this example, when dealing with the first block, the odd trotter should judge the orange block, and the even trotter should judge the spins in the blue block; then, all the trotters can be considered at the same time.

And after the spins in the same small block are considered, the odd and even trotters swap the small block they have considered. That is, the odd trotter starts to judge the spins in the blue block, and the even trotter starts to consider the spins located in the orange block as shown below.

After all the spins located in the small block are traversed, it means that all the spins in the same blocks of all the trotters have been judged to flip. Then we can head to deal with the next block as mentioned previously. This is the running process of the entire HU.

Tensor Cores

In addition, to achieve higher speedup, we will use Tensor cores to speed up the matrix multiplication of updating the local fields of all the spins after one block of spins has been judged.

But one thing to be aware of is that we only flipped block*M spins.

Therefore, when updating the local field, we do not need to do the matrix multiplication of the NNM dimension but do the Nblk_szM dimension’s matrix multiplication, which can greatly reduce the amount of unnecessary calculation.

Analysis for HU

After talking about the HU, let’s give a simple analysis for HU.

The amount of computation like multiplication and addition needed by the original strategy is shown on the left side. And the right-hand side shows the computation amount needed by adopting HU.

Through the analysis, we can know that to judge N spins. The original strategy needs 2NN times multiplication and NN times addition. The HU method needs 2blk_szN+2NN/blk_sz times multiplication and blk_szN+N*N/blk_sz times addition. For blk_sz equals two, the HU method computation amount becomes lower than the original strategy, i.e., the larger blk_sz, the more computation amount HU can reduce until it reaches the GPU computational resource limit. And this is the first reason for HU to make a higher speedup for SQA.

Besides, in our implementation, we let one SM process handle one trotter. Hence, we can reduce the time to launch the GPU kernel, which costs less kernel launching time. Furthermore, since block_size is small, we can also put all the data in the GPU’s local memory but the global memory to have faster memory access. Again, we will use Tensor Cores during the update of the local field to reach a higher acceleration. Finally, we maintain the M-dimension parallelism to make the HU method reach higher acceleration.

Results

In the performance and evaluation parts, we did five different experiments to do some validation in our work. But I will only cover the some of the results here.

Experiment 1 - Benefits of Hierarchical Update

For the first experiment, we compare the time cost by 100 annealing steps with and without using the HU method for M from 4 to 512 and the number of spin, that is N, from 1024 to 32768.
As shown in these figures, the dashed line shows the time cost by without using the HU method, and the solid line is the time cost by adopting the HU method. The time cost by adopting HU is hugely less than the time cost without using HU under all N and M clearly in the results.

Experiment 2 - Per-Step Annealing Time

In this experiment, we compare the time cost by per-step annealing time under different N and M with four methods, including Tohoku’s GPU version, Tohoku’s FPGA version, and our proposed work using and without using HU for the second experiment. And make the time cost by Tohoku’s GPU version as a baseline to calculate the speedup of the other three works.

As shown in the table, we can observe that as N and M grow, the speedup of our work compared to the baseline becomes larger. As N equals 32768 and M equals 512, adopting the HU method can reach a speedup of 47.2 times. And using the HU and Tensor cores can get to 86.6 times speedup even more.
Furthermore, the proposed method yields faster results under all problem sizes than Tohoku university’s works no matter using GPU or FPGA as the accelerator.

Experiment 5 - The Quality of Solution

In the final experiment, we test the solution quality of our proposed work.

We use the Gset problem, which is solving the MAX-CUT problem, provided by Stanford University.
We run the SQA under 1000 annealing steps and try to find out the best solution quality by adjusting the initial temperature and the decreasing temperature value per MC-step.

The blue square means the best known CUT now. The stars mean the best solution we have found, and the green triangles represent the average CUT we have found now. The orange circles denote the best solution found by Tohoku University, and the pink circles symbolize the average CUT found by Tohoku.

The table shows that the lowest average CUT we found is higher than 97.5% compared to the best known. It shows that using the HU method does not significantly impact the solution quality. Therefore, from the experiment results above, we demonstrate that the HU method can greatly speed up the SQA algorithm and does not impact the solution quality.

Conclusion

We accelerate the SQA algorithm to solve the Ising problems by proposing the HU method and using Tensor cores to reach more speedup.
Compared to the original state-of-the-art proposed by Tohoku University, we can finally reach the speedup from 1.7 times to 86.6 times compared to the GPU’s version. And achieve the speedup from 2.7 times to 13.68 times compared to the FPGA’s version with maintaining the solution quality of SQA.

20 minutes pre-recorded video

Video link

Contact me

yhchiara@gmail.com or r09944072@csie.ntu.edu.tw