High performance Boson sampling simulation via data-flow engines (2024)

Boson sampling (BS) is viewed to be an accessible quantum computing paradigm to demonstrate computational advantage compared to classical computers. In this context, the evolution of permanent calculation algorithms attracts a significant attention as the simulation of BS experiments involves the evaluation of vast number of permanents. For this reason, we generalize the Balasubramanian–Bax–Franklin–Glynn permanent formula, aiming to efficiently integrate it into the BS strategy of Clifford and Clifford (2020 Faster classical boson sampling). A reduction in simulation complexity originating from multiplicities in photon occupation was achieved through the incorporation of a n-ary Gray code ordering of the addends during the permanent evaluation. Implementing the devised algorithm on FPGA-based data-flow engines, we leverage the resulting tool to accelerate boson sampling simulations for up to 40 photons. Drawing samples from a 60-mode interferometer, the achieved rate averages around 80 s per sample, employing 4 FPGA chips. The developed design facilitates the simulation of both ideal and lossy boson sampling experiments.

1.Introduction

The main idea behind quantum supremacy [1–3] is to use a quantum processor to solve a mathematical problem that is intractable for the classical computers, that is the solution of the problem on classical hardware would require resources (like the execution time) scaling exponentially with the problem size [4–11]. Undoubtedly, it is fundamental for the researchers to ascertain that the quantum device used in the experiment indeed solves its designed task in the way it is expected. Without such confirmation, one could not measure the strength of the quantum supremacy claims. For this reason, the development of powerful validation protocols is of high importance.

One of the proposed quantum algorithms to surpass classical computing devices is the so-called boson sampling (BS) [4, 5] in which bosons are drawn from a distribution of indistinguishable particles. In case the relation between the number of the output modes m and the number of input photons n holds on, the total execution time of the simulation scales exponentially with n, since there are no efficient classical methods to simulate the quantum correlations originating from the peculiar nature of multi-body bosonic systems consisting of indistinguishable particles. For this reason, BS has been subjected to numerous theoretical and experimental investigations. Due to the intensive interest shown towards this topic, several flavors of BS have been formulated such as the Gaussian BS [12–14], scattershot BS [15–17], translating BS into time domain [18, 19], or implementing the BS with trapped ion technology [20, 21], ultra cold atoms [22] and with microwave cavities [23]. From an experimental point of view, after the proof-of-principle realizations of small-scaled photonic interferometers [24–27], researchers were experimenting to increase the number of modes on the photonic chip [28] and by multiplexing on-chip single-photon sources [29–31]. Currently, the largest conventional BS experiment was reported in the work of Wang et al [32] via a 60 mode interferometer with 20 input photons and 14 measured photons.

The period between 2020 and 2023 was especially fruitful for outstanding achievements in the experimental realization of BS from Gaussian states [33–35]. These experimental designs turned to be inaccessible to be simulated within a reasonable time frame even for the most efficient exact algorithms developed to simulate Gaussian BS [36–38]. Approximate samplers [37, 39, 40], while for smaller systems can generate faked samples being closer to the ideal distribution than the experiments, they also become computational too expensive at these measures.

Because of the dedicated race to experimentally demonstrate proven quantum advantage the validation protocols also increased in their importance [40]. Shortly after the BS proposal researchers tried to find a way to verify the results of the samplers. Initially, the validators used statistical tests to show that the samplers draw samples from distribution being different from classical counterparts [41] such as the mean-field approach or uniform distribution. During recent years, we could see the emergence of more sublime BS validation approaches [42, 43]. Some of those new ideas arise from the techniques used in a different branch of computer science, like pattern recognition [44] or computer vision [45]. These validators require permanent computation or access to the samples drawn from a bona fide boson sampler, which means they can benefit from a high-performance permanent computation technique.

From a mathematical point of view, the key element in the simulation of BS is a fast evaluation of the permanent function. The reason for that is the connection between the permanents of specific matrices and probabilities of BS outcomes. Thus, in BS simulation many permanent computations are required even with the most efficiently known Clifford and Clifford algorithm [46, 47]. The permanent of an n × n matrix A is defined by the formula

where S_n is a set of non-repeating n element permutations of . The formula has a factorial time complexity of . Valiant [48] proved that the evaluation of the permanent of a matrix containing only zeros and ones belongs to complexity class #P (sharp P)-complete, a class which is at least as hard as NP-complete with many counting problems e.g. perfect matchings, graph colorings of size k, satisfiable assignments to general Boolean formula, etc. Unlike the determinant having quite a similar definition to equation (1), there is not any linear algebra simplification to calculate the permanent in polynomial time. Currently, the most efficient approaches to calculate the permanent have a computational complexity of which can be further reduced to if data recycling of intermediate results is implemented via Gray code ordering. The variants of the so-called Ryser [49] and the Balasubramanian–Bax–Franklin–Glynn (BB/FG) [50] formulas were benchmarked on the Tianhe 2 supercomputer in [51], implying better numerical accuracy of the BB/FG method.

In this work, we designed a novel scalable implementation to calculate the permanent via the BB/FG formula. In order to account for a collision of photons on the optical modes of the interferometer we introduce an extended version of the BB/FG formula based on the concept of the n-ary Gray code ordering [52] to take into account data repetition during the permanent evaluation and improve the performance of the BS simulation. We also implemented the designed computational model on FPGA-based data-flow engines (DFEs) further improving the computational performance of the BS simulation provided by the Piquasso package [53]. The so-called data-flow programming model [54] utilizes a fundamentally different approach to increase the computational concurrency than traditional parallel programming models on GPU and CPU hardware. The basic concept of data-flow programming can be explained by thinking of a stream of computational data flowing through hardware. Each of the utilized resources performs a fixed logical transformation on the elements of the stream, transforming a single data element in each clock cycle. By chaining up hardware resources we end up with a data-flow computing model: while one hardware element is transforming the ith element in the data stream, the next hardware element in the chain is already working on the previously transformed, th element of the stream. By providing a long enough data stream to the hardware one can realize an efficient parallel execution, even if the computational tasks have a high degree of data dependency.

The high-level DFE development framework of Maxeler Technologies provides an efficient programming background to design data-flow applications in terms of high-level building blocks (such as a support for fix-point arithmetic operations with complex numbers, automatic pipeline scheduling, stream-holds, memory controllers, etc) instead of the lingering work of programming low-level VHSIC Hardware Description Language (VHDL) components. By combining a novel permanent computational approach based on the BB/FG formula with the data-flow programming model we developed a permanent computing DFE capable of supporting exact BS simulations (with and without photon losses) up to 40 photons. The computational complexity of the BS simulation might be significantly reduced if photon occupation multiplicities on the optical modes are taken into consideration. Our DFE permanent calculator is adapted to this generalization by streaming the addends in n-ary Gray code ordering directly on the FPGA chip. With this accessory, it took s per sample to draw 40 photon samples from a random 60 mode interferometer via four concurrently operating DFEs.

The manuscript is organized as follows: in section 2 we discuss the n-ary Gray code implementation to evaluate the permanent function with the BB/FG formula, accounting for photon occupation multiplicities on the photonic modes. Then, in section 3 we describe the DFE implementation of the permanent calculation engine developed for FPGA chips. Finally, in section 4 we provide the results of our numerical experiments on the classical simulation of BS incorporating DFE permanent calculation implementations.

2.Evaluation of the permanent function: a novel scalable approach

The two lowest complexity methods of computing the permanent involve the formulas of Ryser [49] and BB/FG [50] when using Gray code ordering of the addends. Ryser's approach computes the permanent as

The formula has an exponential computational complexity of , which is significantly reduced when the subsets are ordered in a way that only a single element of S is changed in the subsequent S. In this case, the value of the inner sum of the matrix elements can be reused, i.e. only a single matrix element should be added or subtracted from the reused sum to obtain the new value corresponding to the next S. By reducing the complexity of the inner sum from to this way, the Ryser formula can be evaluated by an overall complexity. By a later technique reported in [55] the complexity can be further reduced by a factor of 2 in the outer sum using the expression:

where x_i is defined by . Another highly efficient method to calculate the permanent is provided by the BB/FG formula:

where with δ₁ = 1 and for . Notice, that in contrast with the traditional definition of the BB/FG formula, in the inner sum of equation (4) we rather calculate the column sums of the input matrix instead of row sums. This choice is motivated by a practical reason implied by the following reasoning: regarding the unitary matrix describing the scattering in the optical interferometer, the rows of the unitary are associated with the output states, while the columns are related to the input modes. Non-trivial photon multiplicities on the output modes (expected to occur more often than multiplicities in the input modes) are described by repeated rows in the unitary. As we will show in section 2.2, by accounting for these multiplicities one can significantly reduce the complexity of the permanent evaluation.

The computational complexity of the BB/FG formula is , while it can be reduced to if Gray code ordering is applied in the evaluation of the outer sum. The Ryser and the BB/FG formulas follow quite different approaches to evaluate the permanent, also resulting in dissimilar numerical properties. In the benchmark comparison of the two methods reported in [51] the authors found that the BB/FG formula gives numerically more precise results than Ryser's formula in the context of bounded bit-sized data types. This finding was further justified by our numerical analysis reported in section 3.2 comparing the numerical accuracy of the individual algorithms with the numerical results obtained by multiple-precision floating-point number arithmetics provided by the GNU MPFR library [56].

2.1.Parallel BB/FG implementation to calculate the permanent

In this section we discuss the technical foundations of our CPU implementation to calculate the permanent of a square matrix, that can be further improved to account for photon multiplicities discussed in section 2.2. Our algorithm implementing a Gray code ordered BB/FG formula (4) has a computational complexity of while minimizing the overhead of processing the logic associated with the generation of auxiliary data needed for the Gray code ordering. Table 1 shows an example of a so-called binary reflected Gray code sequence encoded by 3 bits. Via Gray code ordering in each cycle of the outer sum of equation (4) only one element of the δ vector is changed, making it possible to reuse the column sum (i.e. the inner sum in equation (4)) in the next cycle and subtract/add only elements of a single row of the input matrix (see [51] for details).

Table 1.An example showing a 3-bit reflected Gray code sequence. In each row a single bit is changed in the Gray code compared to the previous row.

Decimal index	Gray code
0	000
1	001
2	011
3	010
4	110
5	111
6	101
7	100

2.2.Handling row and column multiplicities in the input matrix

In the general case, multiple photons might occupy the same optical mode at the output of the interferometer. Since the output modes are associated with the row indices of the unitary describing the scattering process of the photons, the multi-photon collision at the output modes is mathematically described by repeated rows in the unitary. In particular, the number for which a specific row is repeated in the unitary is identical to the number of photons occupying the corresponding optical mode. Following the permanent evaluation strategy encoded by the BB/FG formula it becomes clear that by having row multiplicities in the input matrix some of the addends to the permanent would show up multiple times during the calculation. Such a complexity reduction was already reported in several previous works of [47, 58, 59] by generalizing the Ryser formula, or [60] by turning the permanent calculation into the evaluation of the Hafnian function accounting for multiplicities [36]. Here we argue that it is possible to make use of the addend multiplicities and reduce the overall complexity of the BB/FG permanent calculation as well. According to our best knowledge, this improvement to the BB/FG algorithm was not reported before.

For example, let us assume that the kth output mode (k > 1) is occupied by M_k photons, resulting in M_k identical rows in the unitary at row indices . Consequently, the

column sum might result in different outcomes depending on the number of −1 values and the number of +1 values among the δ_i () elements of the δ vector. The individual outcomes occur explicitly times during the permanent evaluation. Taking the M_k multiplicities for each optical output mode labeled by k, in total

different outcomes of the inner column sum show up during the calculation, determining the overall complexity of the permanent evaluation. According to the BB/FG formula, the first row (k = 1) of the matrix is always taken by a coefficient δ₁ = 1, hence we always treat the first row with a multiplicity of 1, even if it is repeated in the matrix. Computationally the best practice is to move one of the non-repeating rows (if exists) to the top of the input matrix. The possible computational speedup achieved via the outlined combinatorial simplification depends on the specific use case. The exponential factor in the evaluation complexity is determined by the number of the involved rows of the input matrix. In general, the complexity of the calculations can be reduced to

with over the optical modes. In order to account for row multiplicities within the BB/FG formula we make use of reflected n-ary Gray code counting instead of the binary Gray code counting described in section 2.1:

where is a square matrix describing the interferometer, M and N are the row and column multiplicities respectively such that the photon count and Δ is the n-ary Gray code, required for efficient computation. (For the physical meaning of see equation (5)) The n-ary Gray code is also known as the non-Boolean Gray code or Guan code [61]. As the name implies, the individual "digits" of the n-ary Gray code are encoded by numbers from instead of the binary values 0 and 1. While the limits of the individual Gray code digits can be different from each other, one can construct a specific n-ary Gray code counter in which the kth digit counts the number of +1 values of the δ_i elements corresponding to the repeated rows describing the kth output mode. Thus, according to our reasoning, the kth digit of the Gray code counts from 0 to M_k . In order to construct such a reflected n-ary Gray code counter, we followed the work of Kurt et al [52]. In each iteration, only a single digit changes its value, enabling one to reuse the calculated column sums in the next iterations, similarly to the case of binary-reflected Gray code ordering.

Due to the reflected nature of the counter, it is possible to determine the Gray code corresponding to a specific decimal index in a constant time (for details see [52]). Thus, the algorithm can be executed in parallel in a similar fashion then we described it for a binary-reflected Gray code counter. In the Piquasso Boost library, we provide our implementation of the BB/FG algorithm accounting for row multiplicities.

3.DFE design and implementation to evaluate the permanent

In order to further increase the computational speed of permanents, we developed a DFE implementation of the algorithm described in the previous section. Here in this section, we discuss the technical aspects of the developed FPGA (Field Programmable Gate Arrays) based permanent calculation engines realized via a static data-flow programming model. Since the configuration of the programmable hardware elements on the FPGA chip (or in other words the uploading of the program to the FGPA card) is time-consuming, the implementation needs to be general enough to avoid the need to re-configure the FPGA card during the BS runtime. In addition, by accounting for the particular needs implied by physics working behind the simulated BS architecture, we encountered further optimization requirements for the DFE implementation, such as the possibility to calculate the permanents of multiple matrices in one shot to amortize the initialization overhead even at small matrix sizes. In order to provide a thorough description of our implementation, while sparing the reader from too many low-level technical details, we will discuss these optimization details separately in the supplementary material. Here we rather focus on the description of the basic concept to evaluate permanents on DFEs.

Due to high resource requirements associated with floating point operations in the FPGA chip, we focused on designing a scheme where fixed point number representation was used to do arithmetic operations with gradually increasing the bitwidth of the number representation from the initial 64 bits (sufficient to represent the elements of the input unitary) up to 128 bits used to derive an accurate final result up to a unitary size of . (While the floating point units on modern CPUs are highly specialized and optimized, the FPGA with the aid of generic look-up tables (LUTs) and digital signal processing (DSP) units scales better with fixed point operations. For example, double precision floating point additions require a delay of 14 clock ticks, while 1 tick is sufficient for fixed point addition.) Starting with 64 bit double precision format of the input unitary on the CPU side, we perform a conversion of floating point numbers () into fixed point representation encoded by b = 64 bit variable a, such that a is the nearest integer to , with standing for the number of fractional bits. The remaining two bits are used to store the integer part (for the case if the matrix element is unity) and the sign of a in two's complement encoding. (Since the input matrix is expected to be unitary, the magnitude of the matrix elements would never be larger than unity.)

In order to ease the congestion on the FPGA chip and avoid timing issues associated with long routing paths, we split the input matrix into four blocks (see the left side of figure 1), and stream the column blocks into 4 distinct 'column summing' DFE kernels indicated by green colored boxes in figure 1). This way the computations could be partitioned over the Xilinx Alveo U250 FPGA chip used in our numerical experiments while increasing the computational concurrency at the same time. The purpose of these kernels is to (i) calculate the initial column sums for each column in the first n tick of the kernel (here n is the matrix size). Secondly, (ii) as the initial column sums are calculated, a gray code counter is launched (in the nth clock tick) to stream the last n − 3 elements of the vectors defined in the BB/FG permanent formula of equation (4). (The corresponding δ_i elements are given by the individual bits of . Further details to generate the gray code counter logic via bit-wise operations are discussed in section 2.1). The first element of each vector is 1 by the definition, while the second and third elements become fixed by the following reasoning: in order to increase the computational concurrency on the DFE the gray code counter is multiplexed in order to create 4 concurrent streams of the vectors by fixing the second and third elements to (0, 0), (0, 1), (1, 0) or (1, 1) as indicated by the colored streams in figure 1. Consequently in each of the 4 'column sum' kernels 4 concurrent vector streams are used to calculate the weighted column sums. In each clock cycle 4 new column sums are concurrently calculated for each column index (distributed between the 4 column sum kernels) corresponding to the 4 multiplexed streams: from the value of the most recent column sum the new value is determined by adding or sub-tracking twice the matrix element taken from row i. The row index i corresponds to the element index where a change occurred in compared to (due to the design this index is the same for all of the multiplexed streams).

High performance Boson sampling simulation via data-flow engines (67)

The calculated column sum streams are gathered into groups according to the streams to provide an input for the next layer of computing kernels. The red-colored kernels in figure 1 calculates the products of the incoming column sum streams (each of them corresponding to a specific column index ), i.e. the stream data arriving at the same clock cycle are multiplied with each other over a binary tree reduction with a depth of .

The most expensive operations in the design are the multiplications of the complex numbers associated with the column sums. According to the most widely used approach, the complex multiplications are broken down as 4 fixed point multiplications and 2 additions as . The formula suggested by Knuth in [63] allows the computation of the product as which can be implemented with 3 multiplications and 5 additions. There is also the Karatsuba-style formula of which Knuth credits to Peter Ungar from 1963. This approach uses the same amount of operations as the prior one. From a pipelining perspective on the FPGA chip, the first formula is more balanced as the multiplications and additions to get the real and imaginary parts can be implemented concurrently. However, it was shown in work [64] that the numeric stability of the Ungar formula is better, as only the imaginary part would be affected by additional inaccuracy. Fortunately, in the context of fixed point numbers used in our design, this aspect is of less importance. It should be also noted that the addition resulting in the final real and imaginary components would fall within the range requiring an extra leading bit in the intermediate computations. In addition, the fixed point multiplications need to be further broken down to deal with efficient tiling on the hardware units of the FPGA chip, such as digital signal processing (DSP) multipliers.

In our implementation, we followed the pioneering results of [65–67]. According to [65] one can use the Karatsuba multiplication formula by optimally splitting the input multiplicands into smaller bitwidth parts being the least common multiple of the width of the utilized input bit ports of the DSP units. (Thus, the most optimal selection of the tiling factors depends on the width of the input multiplicands.) A slightly different approach is to work out a DSP-number optimized recursive Karatsuba-like formula for the individual input bit size configurations by following the reasoning of [66, 67] being applicable for rectangular-sized tilings. (The Virtex 5 DSP units embedded inside our Xilinx Alveo U250 FPGA cards have wide input ports to perform signed multiplications.) Since our DFE implementation is designed to handle at most matrices, one needs to calculate the product reduction of 40 complex numbers. The tree-like reduction is performed over 6 levels, providing a balance between resource utilization and pipelining. On the first level 20 products are calculated from 64 bit wide input fixed point values (streamed from the columns sum kernels) resulting in 79 bit results (the remaining bits are discarded). On the second level, the 20 results are paired up to 10 pairs. The multiplications are performed again with 79 bit-wide results. The third level calculates the pair-wise product of 10 numbers resulting in 93 bit results. The remaining 3 levels to reduce the final 5 numbers are done with results of and 189 bit precision. All values are fixed points with 2 integer bits (including the sign bit) and the remaining as fractional bits. The final values are all accumulated at 192 (or 256) bits of precision, 6 of which are integer bits. During the development, numerous technical details were addressed to end up with an efficient strategy to multiply large bitwidth complex numbers implemented in our calculations. We solved many issues related to the correct management of the fan-outs of intermediate computations or to the careful choice between LUTs or DSP units to perform multiplications on small bitwidth tiles in order to reach the limits of our design being still decomposable on the FPGA chip. The most cumbersome limiting factor was the requirement to keep up with the numerical accuracy comparable to the CPU implementations using extended precision floating point number representation (see section 3.2 for the comparison of DFE implementation to the CPU ones). To this end, we needed to increase the bitwidth of the fixed point numbers in the multiplication binary tree as much as possible by utilizing more and more hardware resources, while keeping up with the timing constraints by designing a suitable pipeline structure on the chip.

Finally, the results of the four product kernels are streamed toward the final 'sum up' kernel indicated by the blue-colored region in figure 1. Due to the multiplexed four gray code streams four different column product reductions are entering the kernel in each tick. The incoming streams are summed up and added to the partial permanent labeled by Perm in figure 1. To sum up the permanent addends into a single scalar result, a single clock tick deep data-flow loop is designed (corresponding to the addition of two fixed point numbers) by removing any registers from the underlying logic. Finally, the result encoded by a bit complex number is streamed back to the CPU.

3.1.Performance benchmark of the permanent calculator implementations

In this section, we provide the details regarding our performance benchmark of the DFE permanent calculator implementation compared to the fastest CPU implementations available today. The numerical experiments were conducted using Xilinx Alveo U250 FPGA cards implementing a bitstream generated from Very High Speed Integrated Circuit Hardware Description Language (VHDL) generated from Java source by Maxeler's MaxCompiler 2021.1. Thus the heart of the developed permanent calculator DFE is a static data-flow computation model translated into V4HDL by the MaxCompiler. In the performance benchmark, we compared the DFE implementation to the Piquasso Boost library's BB/FG parallel C++ implementation and TheWalrus v0.16.2 library's Ryser double and quad precision (implementing 128 bit floats when GNU or Intel compiler is used to build the source, and long double precision for other compilers). The measurements were computed by averaging the execution time of 5 computations for matrices of size smaller than , otherwise, just a single execution measurement was taken.

The CPU computations were performed on a two-socket CPU system consisting of AMD EPYC 7542 32-Core processors with two logical threads on each core and two non-uniform memory access (NUMA) nodes where we bound the computations to one of the NUMA nodes. (Thus the CPU benchmark was measured by the performance on 64 computing threads.) The FPGA communication uses a high-speed PCI Express 3.0 channel over which the input matrix and series of other input parameters are uploaded, like scalar initialization parameters for each kernel, clock synchronization data, etc. The frequency of the DFE implementation was 320 MHz The initialization time of programming the DFE when uploading the compiled circuit from the generated VHDL program takes 56.2 s for a single DFE while it takes 112.9 s for a dual DFE mode (i.e. splitting the permanent evaluation over two DFEs). The uploaded bitstream program is preserved on the DFE until it is reprogrammed, hence the initial uploading time is needed to be spent only at the very start of the usage. The initialization of the TheWalrus library is around 6 milliseconds, while the BB/FG implementation of the Piquasso Boost library exhibits the most negligible loading delay with about 19.5 microseconds.

We compared the performance of our implementation provided in the Piquasso Boost library to the implementation of TheWalrus version 0.16.2 [60] package also having implemented parallelized C++ engines to evaluate the permanent function. (Newer versions on TheWalrus do not contain C++ engines, nor extended precision implementations.) The results are plotted in figure 2 showing the performance of the different libraries. The red-colored data points show the execution time of the single and dual DFE implementations. (For details related to the dual mode see section S2 of the supplementary material.) For matrices of size less than the initialization overhead dominates the execution of the DFE resulting in a constant execution time. (The permanent evaluation of the smallest matrices is done solely on the CPU side explaining the first 3 data points in the set.) Above the crossover region, the advantage of the DFE execution becomes evident. Only the double precision BB/FG implementation comes close to the performance of the DFE. However, while the single DFE implementation is only about 1.4 times faster than the double precision BB/FG implementation, the accuracy of the DFE implementation is significantly better, having better or identical accuracy as the long double precision BB/FG implementation up to a matrix size . (For details see section 3.2.) The long double precision CPU implementations are significantly slower than the DFE (see the indicated speedup factors in the legend of figure 2), implying the advantage of DFE over CPU implementations when high numerical accuracy is required. The total run-time to evaluate the permanent of an n × n unitary on a single (k = 2) and dual (k = 3) DFE can be calculated as where t₀ is a small PCI Express communication delay plus the pipeline delay of the kernels (approximately half a millisecond), and f is the frequency in Hertz. The n − 1 clock cycles are spent to initialize the column sum, and in the remaining ticks the permanent in evaluated. Our design is compiled for 330 MHz frequency so the execution time for a input matrix on the dual DFE build is in total 207 s. (equivalent to 337 GOPS (10⁹ operations per seconds) For such 'long-run' execution, the t₀ initialization overhead is negligible. For comparison, [51] reported the required time to compute the permanent of a matrix on 98304 CPU cores of the Tianhe 2 supercomputer in 24 s using Ryser's formula in double precision. Consequently, one CPU server with two DFE engines calculates a permanent only 8.6 × slower than 4096 CPU nodes (each node containing of 24 cores) in the benchmark of [51].

High performance Boson sampling simulation via data-flow engines (92)

From a practical usability point of view, we also examine the performance of the permanent calculation implementations when photon multiplicities on the photonic modes induce repeated rows and columns in the unitary describing the photonic scattering process. (See details in section 2.2.) In this case, the average photon multiplicity on the photonic modes is controlled by the unitary size n and the total photon number. In figure 3, we compare the performance of the DFE engine to the CPU implementations of the Piquasso Boost library for photon counts 20 and 30. The figure shows the execution time measured at permanent evaluation for random unitaries constructed for randomly distributed photons over the photonic modes. From our numerical results, we see that at a photon count larger than 20 the DFE turns out to be more efficient than the CPU implementations in the evaluation of permanents. While keeping up to the numerical accuracy of the long double precision BB/FG implementation, the DFE is about 3.6 times faster than the double precision BB/FG implementation at photon count 30 and about 7.4 times faster at photon count 40. (The speedup compared to long double precision implementations at the same photon counts are 35.4 × and 27.9 ×, respectively.) By using dual DFE implementation the speedup is further doubled. At lower photon count, however, the PCIe delay time of the DFE dominates the execution. (See the left side of figure 3) In the repeated row DFE variant of the permanent calculator, the columns sum initialization achieves the same performance by staggering computations the loop tiling/staggering technique. This requires the CPU to pre-compute and upload some additional data which scales on the order of the loop length. The loop length in practice is small or 9 cycles. The cycle delay is based upon multiplication in computing the binomial updates. To keep the parallelism of 4 or 8, we also force the first 3 or 4 row multiplicities to be 1 (the original formula already requires one such reduction and 2 or 3 extra for parallelism), which can slightly increase the operation count. However, by choosing the smallest multiplicities for row expansion, the consequence of maintaining power-of-2 parallelism is largely mitigated. The DFE implementation accounting for row/column multiplicities is also compiled for 330 MHz frequency.

High performance Boson sampling simulation via data-flow engines (93)

Finally, in section S4 of the supplementary material we show that by streaming multiple input matrices to the DFE in a single execution, one needs to pay the PCIe delay time only once and divide it between the permanent calculation instances, thus lowering the crossover in the DFE-CPU performance down to a matrix size of n ≈ 13.

3.2.Numerical accuracy of permanent calculator engines

As pointed out earlier by [51] the numerical accuracy becomes a notable issue with increasing the size of the input matrix. The final numerical precision of an implementation depends on the interplay of various factors. The number representation used in the individual arithmetic operations and the conversion between them, or the computational design of mathematical operations have both great effect on the accuracy of the final result. For example, [51] found that the BB/FG formula shows bigger numerical fidelity than the Ryser formula in the experiment of reproducing the analytical result evaluated for the identity matrix According to our reasoning, the different numerical properties of the two approaches can be explained by the difference in the number of addends in the inner sums of equations (2) and (4). While in equation (4) the sum involves always the same number of matrix elements before calculating the products of the column sums, in equation (2) the number of the summed matrix elements varies according to the actual partitioning S resulting in a possible wider range of magnitude in the calculated products before summing them up. In order to increase the accuracy of the calculated permanent in the Piquasso Boost library, we evaluate the outer sum of the BB/FG formula by classifying the individual addends according to their magnitude, split them into 'pockets' and always calculate the sum of those partial results that are close to each other in magnitude. (We also applied this strategy in our previous work [68].)

Now we examine the numerical accuracy of the individual calculation methods. We implemented the designed scalable BB/FG algorithm with several levels of floating point number representations provided by the Piquasso Boost library and compared the result to the Ryser formula implemented in The Walrus package (version 0.16.2). Among them, the least accurate variant turned out to be the Ryser formula implemented by double precision (i.e. 64 bit) floating point arithmetic operations, followed by the double-precision BB/FG formula evaluated by the Gray code strategy. As default, the Piquasso Boost library implements extended precision floating point arithmetic operations to calculate the permanent providing high numerical accuracy for photonic quantum computer simulations up to photon numbers still accessible on traditional HPC nodes.

To establish a proper ground truth, we also incorporated the GNU Multiple Precision (GMP) arithmetic library's [56] extension of Multiple Precision Floating-Point Reliability (MPFR) to compute the permanent with 'infinite precision' using the designed recursive BB/FG algorithm. (The correctness of the 'infinite precision' implementation was tested on special input cases. namely, when evaluating the BB/FG formula on m × n rectangular shaped matrices with , the final result should be inevitably 0 due to the exact cancellation of the addends. Such a computation, where approximation with normal floating or fixed point numbers would never return a proper 0 result, provides very convincing evidence for the correctness of the implementation.) Figure 4 shows the relative error

of several benchmarked permanent calculation implementations with standing for the infinite precision implementation. Our numerical experiments carried out on random unitary matrices justify our expectations. (This choice is justified as our primary goal is to use the developed permanent calculation engines to support photonic quantum computer simulations, where the physical system is described by unitary matrices.) The 64 bit floating point representation is significantly less accurate than the extended precision counterparts. Secondly, our results revealed that the Ryser formula is less accurate than the BB/FG variant. (The accuracy of the double precision Gray coded BB/FG implementation is close to the Ryser formula evaluated with extended precision.) A reduction in accuracy associated with the extended precision Ryser method first appears at a matrix size of n ≈ 20 and the difference increases with the matrix size. (See the green circles in figure 4.) Though we leave the discussion of the implementation details of the DFE permanent calculator engine for section 3, here we just notice that the numerical accuracy of the DFE implementation stays close to the extended precision CPU implementation of the BB/FG formula, we experienced a deviance only at matrix size . However, even at such matrix size the accuracy of the DFE implementation still remains better than the extended precision Ryser formula.

High performance Boson sampling simulation via data-flow engines (98)

According to our reasoning, the main source of the numerical error in the CPU implementations can be explained by the fact that in Gray code variants of the Ryser and BB/FG formula, some computational data are reused from cycle to cycle, in each turn modifying their value by addition/subtraction. Consequently, some addends to the permanent are derived via exponentially many arithmetic operations. This results in an accumulation of numerical error compared to the naive implementations, where each addend added to the permanent is a result of only n² arithmetic operations. This reasoning leads us to the conclusion that the implementations are expected to be more accurate than the Gray-coded variants. We justified our expectation by evaluating the numerical accuracy of the BB/FG formula without the Gray code strategy. The associated orange-colored data points in figure 4 revealed, that the double precision, non-Gray-coded BB/FG formula indeed gives as good accuracy as the Gray coded variant BB/FG formula evaluated with extended precision. The Gray-coded implementations, in turn, perform so much faster that executing them in extended precision (to obtain equivalent numerical precision) is still favorable.

4.Classical simulation of BS including photon losses

The classic experimental setup of BS is shown in figure 5. Given an m-mode n-particle Fock state , and the interferometer, described by a m × m matrix U, one performs the passive (particles-number preserving) linear-optical evolution of with the interferometer, then we measure the outcome using the particle-number resolving detectors.

High performance Boson sampling simulation via data-flow engines (105)

In subsequent formulas, we use s_i and t_i to label the components of the occupation vector describing the input state and the output state as and . In the particle-conserving case, . The output probability p_U associated with a given process is then given by (10)

where U_ST is an n × n matrix constructed from the unitary U describing the interferometer as follows. First, we construct an m × n matrix U_S , by taking ith column of U s_i times and then we take the jth row of U_S t_j times. The hardness of BS is a consequence of the fact that the probability amplitude in equation (10) is proportional to matrix permanent. The situation is especially clear in the regime , where for typical Haar random U the probability distribution given in equation (10) is concentrated on collision-free sub-spaces spanned by states with . This is the regime in which arguments for hardness put forward in the original BS paper [4] hold (see also [69] for a more refined analysis of hardness of this problem). The regime when the number of photons is comparable to the number of modes was investigated by a separate analysis of [70] establishing an evidence on the hardness of BS simulation in this regime as well.

From an experimental point of view, it should be noted that in real-world devices realizing BS photon losses occur on a regular basis. Thus,

where we previously defined all of the symbols. Lossy BS introduces many new challenges in simulations. In [71, 72] the authors discuss loss handling and a classical simulation algorithm for lossy BS simulation. In general, interferometers are a net of interconnected beam splitters. In the lossy scenario, the lossy beam splitter transfers the particle into an inaccessible (unmeasurable) mode. One can easily see the practicality of this approach as its implementation does not need any new tools besides beam splitters. Graphically, one usually presents it as in the right panel of figure 5. The authors of [72] also noticed that the uniform part of the losses (i.e. loss that applies uniformly to all of the modes) can be extracted from the interferometer and applied as a pure-loss channel at the beginning of the simulation leading to a new instance of the matrix U describing the interferometer. One may still incorporate further, non-uniform losses into the model, while the simulation will be computationally less demanding due to the lower number of particles at its input. It's worth pointing out that the matrix describing a lossy interferometer is no longer unitary as its eigenvalues can be lower than unity.

The most popular algorithm of BS simulation is the Clifford andClifford algorithm [46, 47]. Although it was designed for the exact simulation of BS, the algorithm can be adopted for lossy BS simulations as well by virtually doubling the number of the optical modes according to the reasoning of [71], while taking the sample only from the first half of the outputs.

We implemented the Clifford and Clifford algorithm in the Piquasso Boost high-performance C++ library linked against traditional CPU and DFE permanent calculation engines. The BS experiment can be designed by the high-level programming interface of the Piquasso framework in Python language. The BS simulation can be scaled up over multiple servers via MPI communication protocol. In our numerical experiments, we used AMD EPYC 7542 32-Core Processor servers, each server containing 2 CPU nodes and two Xilinx Alveo U250 FPGA cards. We executed the BS simulation on 4 MPI processes, each process utilizing one CPU node with one FPGA card. Figure 6 shows the results of our BS simulation performance measurement carried out on host servers equipped with the DFE permanent evaluation engines. Following the reasoning in Clifford and Clifford [47], the theoretical prediction of the average sampling time (i.e. averaged over many samples) in the BS simulation (proportional to the average-case sampling complexity) can be given by the

expression in the limit. The expression contains a single parameter T₀ providing a portable measure to capture the performance of a BS simulator. We fitted equation (12) to three data sets obtained from the averaged sampling time of a 60-mode interferometer. One data set corresponds to an ideal BS without any photon loss in the simulation. The second and third data sets were obtained by simulations assuming 30% and 50% losses. (The 60-mode interferometer and 30% loss correspond to the experimental setup of [32].) In the case of lossy BS, the simulation implies a doubling of the photonic modes, resulting in increased sampling time reported also in figure 6. We found that equation (12) describes the complexity of BS simulations very well at larger input photon counts. For the developed BS simulator design we obtained from the fitting. At lower input photon count, the measured sampling performance is different from the prediction of equation (12). At the corresponding 10⁻³–10⁻⁶ s timescale the CPU-DFE performance crossover, the CPU parallelization overhead, and the initialization cost of the computational model (such as the Python-C++ binding, library loading, etc) dominates the execution time of the BS simulator.

High performance Boson sampling simulation via data-flow engines (121)

Depending on the number of optical modes, when the number of optical modes (m) is significantly larger than the photon count (m = 120 case), the average execution time scales exponentially with the number of input photons. Conversely, when the interferometers feature fewer optical modes, the average time complexity tends towards sub-exponential behavior, as evidenced by the slight deviation of the blue data points in figure 6 from a line. This finding resembles prior theoretical results on the original hardness result for exact BS of Aaronson and Arkhipov requiring as a crucial condition in their hardness analysis.

Consequently, it is recommended that experimentalists design devices with a substantially higher number of optical modes than input photons. However, the ratio between the number of optical modes and the photon count does not necessarily need to be large. Based on this finding, achieving quantum advantage in boson sampling experiments appears feasible, as a moderate increase in system size relative to the photon count can suffice for this purpose.

Finally, we notice that the obtained T₀ fitting parameter is expected to be inversely proportional to the number of the incorporated DFEs, scaling ideally up to a large number of accelerators. Since the CPU servers hosting the DFEs are collecting the samples on their own, the gathering of the samples over the host servers will pose a single-time communication overhead that can be neglected compared to the overall sampling time. We have successfully demonstrated this design using two CPU servers hosting 4 FPGA cards in total.

5.Conclusions

In this work, we described a novel scalable approach to evaluate the permanent function based on the BB/FG formula. Utilizing the mathematical properties of reflected Gray code ordering one may concurrently evaluate partitions of the outer sum, paving the way for parallel computation of the permanent. We further generalized the algorithm for cases when columns or row multiplicities occur in the input matrix (corresponding to multiple occupations of photonic modes) and the complexity of the permanent calculation might be reduced. We achieved this by the utilization of generalized n-ary Gray code ordering of the outer sum in the BB/FG permanent formula, with digits varying between zero and the occupation number of the individual optical modes. This generalization makes the BB/FG formula applicable for high-performance BS simulation utilizing significant reduction in the computational complexity as it was previously demonstrated using the Ryser formula as well [47, 58, 59]. The main advantage of the BB/FG formula opposed to the Ryser variant lies in the numerical accuracy of the calculated permanent value. Our numerical experiments using the MPFR multi-precision library showed that Ryser's formula loses against the BB/FG method by several orders of magnitude in accuracy in both the double and extended precision calculations.

We also implemented the BB/FG method on FPGA-based DFEs. Our implementations are capable of handling matrices of arbitrary size (up to 40 columns) without the overhead of reprogramming the FPGA chips and accounting for row multiplicities in the input matrix via the N-ary Gray code ordering ported to the FPGA chips. The throughput of the DFEs was further increased by organizing the input matrices into batches and executing multiple permanent calculations on the FPGA chips in a single execution. Finally, the fix-point number representation implemented on the FPGA chips provides competitive accuracy to the extended precision BB/FG CPU implementation in the evaluation of the permanent of unitary matrices. The accuracy of the DFE implementation—equivalent to extended precision—holds up to matrices of size n = 40.

These features turned out to be essential to achieve an efficient BS simulator design supported by high-performance DFEs. We integrated our permanent evaluation DFEs into the computation flow of both the ideal and lossy BS simulation. Since the simulation of lossy BS involves twice as many photonic modes as the ideal variant of the same BS design, the simulation of lossy BS takes more time in general. On average, our setup of 4 Alveo U250 FPGA cards made it possible to take a single sample from a 60-mode interferometer with 40 input photons in s without photon losses. Introducing photon losses into the design, our numerical experiments could draw a single sample in s. The theoretical description of [47] fits the measured performance data very well by fitting a single parameter (labeled by T₀ in equation (12) to all data points. The fitting parameter provides a portable measure to compare different BS simulator implementations. However, we did not find any competitive work in the literature on BS simulation providing similar performance measurements to ours. In turn, we can compare the performance of our simulator to a real BS experiment involving 60 photonic modes, 20 input photons, and 14 measured photons (i.e. loss of 30% on average). We could simulate the described design in milliseconds per sample, while in [32] authors detected 150 valid samples of 14-photon coincidence measurements in 26 h.

We should mention that numerous possible sources can be named to cause discrepancies between the theory and experiments. Photon losses originating from the imperfection of the photon detectors or disappearing in the circuitry can be accounted for via different loss models. Our boson sampling simulator implements photon loss by means of unmeasured photonic modes over which the photon could virtually escape from the device without capturing them. A real experiment can be also affected by decoherency, imperfect sources and uncertainties of interferometer settings which are not captured by the simulation technique. These might be a sources for differences between the simulator and a real experiment. In addition, we should not forget about the high dimensionality of the photon distribution that makes it hard to make conclusive judgments about the equivalence or difference between numerical and experimental results. For this purpose several validation approaches have been developed [40–45] which have been mentioned in the introduction.

Finally, we notice that the BS simulation capabilities described in this work can be further improved by utilizing the concept of approximate BS described in [72], in which part of the optical modes are treated with MF approximation. In this approach, the number of the approximated modes is a hyper-parameter in the algorithm controlling both the speed and the fidelity of the BS simulation. We leave the study of approximate BS simulation with DFEs for future work.

Acknowledgments

This research was supported by the Ministry of Culture and Innovation and the National Research, Development and Innovation Office within the Quantum Information National Laboratory of Hungary (Grant No. 2022-2.1.1-NL-2022-00004), by the ÚNKP-22-5 New National Excellence Program of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund, and by the Hungarian Scientific Research Fund (OTKA) Grants Nos. K134437 and FK135220. We also acknowledge funding by the QuantERA II project HQCC-101017733. R P acknowledges support from the Hungarian Academy of Sciences through the Bolyai János Stipendium (BO/00571/22/11) as well. We acknowledge the computational resources provided by the Wigner Scientific Computational Laboratory (WSCLAB) (the former Wigner GPU Laboratory). T R and M O acknowledge financial support by the Foundation for Polish Science through TEAM-NET project (Contract No. POIR.04.04.00-00-17C1/18-00). Supported by IRAP AstroCeNT (MAB/2018/7) funded by FNP from ERDF.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).