My work involves a lot of computer calculations. By and large, there are three different types of such computations:
- Large numerical calculations that can run in parallel without supervision. Usually their output is just some number. For example numerically computing millions of Feynman integrals in order to determine the beta function of
theory. These are typically purpose-built native programs without user interface. Once these programs have been developed, they run for thousands of hours and write the result into a file. This type of project is probably what most people would imagine when they hear “computational physics”.
- More experimental calculations. Sometimes, their output is a number (such as generating and counting all graphs of a certain type), but often they are about checking some property, finding the distribution of something, and so on. The difference with respect to the previous point is that these computations require much more coding, adjustments, or providing specific input values. While these programs are not “interactive” in a usual sense, they often run only for a few hours, and then their output needs to be inspected and the program adjusted.
- The third type of computer-intensive work is the more or less interactive evaluation, transformation, and plotting of data. For this, I mostly use generic mathematical software like Mathematica, R. I my more recent papers, most plots are done with pgfplots right inside Latex, but this, too, requires preparation of the data. This type of computation is generally faster than the others above, but it can well take several hours, for example the correlation analysis for statistics of Feynman integrals, or the measurement and transformations of growth rates in tropical field theory.
Specifically for the recent project concerning tropical field theory, I wrote several programs that compute various of the series expansions in that theory. In the process of developing one of them, I got interested in figuring out how fast the various computers I or friends had access to at that time actually were for this purpose, so I did some benchmarks. This particular program is a C++ code which computes the mass anomalous dimension of the tropical theory to 50 loops, but the precise physical purpose is uninteresting for us here. It can run on an arbitrary number of parallel threads. I compiled it with GCC individually for each of the machines, although this rarely produced faster execution time than the pre-compiled version.
The program takes a few minutes to run, all speeds below will be reported in “runs per hour”, hence 3600 divided by the runtime in seconds. These numbers should be taken as approximations. I did make sure no other programs were running, but of course, the performance of a system depends on all sorts of details such as memory speeds, processor cooling, power management settings etc, which I did not investigate in detail.
For reference, I measured the speed on one of the University’s compute cluster nodes. There are nodes with various different specifications, this particular one has (several) Xeon 6342 processors, each of which comes with 24 cores/48 threads at 2.8GHz base and 3.5GHz turbo frequency. I tested it up to 24 threads, which therefore all run on physical cores on the same CPU, and the performance shows a nice regular increase. A single thread reaches around 1.4 runs/h, but with increasing number of threads, the efficiency decreases. This has two reasons: Firstly, the CPU itself probably runs at highest boost frequency with a single thread, but at a lower frequency under full load. Secondly, the parallelization in the program is never quite perfect: For example, starting and synchronizing many parallel threads introduces certain overload and wait times, not every workload can be distributed evenly on any number of threads, the parallel memory access might block each other, and so on. In fact, the present program parallelizes remarkably well, at 24 threads we reach around 17 runs/h, or 0.7 per thread.


Now, let’s compare this to a typical desktop processor: A Ryzen 7 5800 with 8 physical cores with 2 threads each, at a base frequency of 3.4GHz and boost frequency of 4.6GHz. This particular machine is water cooled and runs at around 4.1GHz even under full load. This can be seen from the plots below: Up to 8 threads, the performance per thread decreases only slightly. At more than 8 threads, there is a substantial drop in performance, this is probably caused by using simultaneous multithreading on the same physical cores setting in: Running two SMT threads on one hardware core is not as fast as having two hardware cores. Another, smaller, reduction in performance is visible when going from 4 to 5 threads. This processor consists of two core complexes with 4 cores each, the effect might be caused from the second core complex getting involved, but maybe there is some other mechanism.


The peak performance of the Ryzen 5800 is reached at 16 threads. Performance collapses when more threads are being started. Apparently, these threads block each other and make the overall computation slower.
Incidentally, I had access to a quite similar processor, but for notebooks: The Ryzen 7 pro 6850H has 8 cores / 16 threads as well, but with 3.2 and 4.7GHz frequencies. Since it is not water cooled, the performance decreases a bit quicker with thread count than the desktop. Still, given the lower power consumption of 45W vs 65W, I was surprised that the laptop got so close to the desktop. Secondly, this CPU shows horrific performance specifically at 16 threads. I don’t know why this is, but it persisted over several repetitions.


It is quite interesting to compare this outcome to another pair of machines, which are much newer and based in Intel processors. The desktop is a Core i7 14700K. It has 8 “performance” cores, each of which can run at up to 5.6GHz and supports 2 threads. Additionally, there are 12 “efficiency” cores at up to 4.3GHz, which, however, only support one thread per core. The CPU therefore can run 28 threads in parallel in total, and it is a good example of the modern paradigm to have multiple different types of physical cores in order to be able to save energy at low load, but don’t sacrifice performance at full load. Numerous such CPUs with different configurations are on the market, and I was particularly interested in seeing whether the efficiency cores are useful at all for numerical computations.
In the plots below, we see the familiar pattern of almost linear increase in performance with more physical cores involved, which then breaks down at above 8 threads. From that point on, another almost linear pattern emerges. Interestingly, while there are some fluctuations, there is neither a visible break at 16 threads, nor at 20 threads. From this data, it is not even clear whether the first additional threads run as second threads on the performance cores, or on the efficiency cores. Above 28 threads, we observe the familiar breakdown in performance when more threads are started than can physically run in parallel. All in all, the performance of this CPU is quite remarkable, over 25 runs/h are not even reached by the Xeon 6342 server when all 48 threads are used.


On the laptop side, we have a Core Ultra 165U, featuring 2 performance cores with up to 4.9GHz and two threads each, as well as 8 single-threaded efficiency cores at up to 2.1GHz. Unlike the previous ones, this is a CPU meant to be energy efficient in small ultrabooks, with only 15W TDP, it runs at significantly lower power levels than the 45W of the 6850H and 125W of the 1470K. This shows in the performance plots below; the performance of up to two threads is still good, but there just isn’t enough power budget for anything meaningful beyond that. The fact that the efficiency cores only run at half the clock rate as for the 14700K clearly shows in the data. Interestingly, this is one of the few machines where there is no significant drop in performance if too many threads (here: more than 12) are used.


I tested some additional CPUs which roughly showed similar patterns as the ones above. The table below shows the performance (measured in runs/hour of this particular program) of the different CPUs, when 1, 4, or the best performing number of threads are being used. For the server, I wasn’t able to measure systematically beyond 24 threads due to scheduling issues. “freq” of the CPU is maximum boost frequency of any core in GHz, “thr.” is the total number of threads, including SMT or efficiency cores if available, and “TDP” is thermal design power in W.
| CPU | freq | thr. | TDP | 1 thread | 4 threads | best threads |
| Xeon 6342 | 3.5 | 48 | 230 | 1.4 | 5.4 | >17.4 (>24) |
| Ryzen 5800 | 4.6 | 16 | 65 | 2.0 | 7.6 | 18.3 (16) |
| Ryzen 6850H | 4.7 | 16 | 45 | 1.7 | 6.2 | 13.3 (14) |
| Core i7 14700K | 5.6 | 28 | 125 | 2.2 | 8.5 | 25.7 (27) |
| Core Ultra 165U | 4.9 | 12 | 15 | 1.7 | 3.3 | 5.5 (10) |
| Core i7 8550 | 4.0 | 8 | 15 | 1.2 | 3.2 | 4.3 (8) |
| Core i5 13500T | 4.6 | 20 | 35 | 1.7 | 6.0 | 10.5 (16) |
It is important to keep in mind that all of this data is for only one particular program, which happens to be nicely parallelizable and require very little RAM. The table is not meant to establish a precise hierarchy of these CPUs, but merely give an impression of the kind of performance one typically sees.
All in all, it turns out that a high end desktop PC can in fact compete with one of the university compute nodes, the lower number of cores is largely compensated by higher clock speed. This is why in practice I run many of my computations on desktop computers. The true advantage of the compute cluster, of course, is not the individual CPUs. Instead, these nodes have around 1TB RAM, which is often crucial for larger computations, and there are more than a dozen nodes freely available that run 24/7. Hence, for tasks that take more than a day, I do in fact use the cluster, not because it is faster, but because the program can run there without blocking my desktop PC.
