Math Benchmarks¶
The AMDC is used for real-time control of motor drive systems: every X seconds, the AMDC samples various sensor input, performs some math on the sampled values, and then updates the PWM outputs based on the math. In the default firmware, the value of X is 100 microseconds, or a control rate of 10 kHz. For this all to work correctly, the firmware must compute the required math operations in a short time, i.e., much less than 100 us.
The AMDC uses a PicoZed system-on-module for its “brains”. On this module, it has a AMD Xilinx Zynq-700 system-on-chip which is the main processor. This processor has dual core DSP and FPGA. The code which computes the math operations as described above runs on the DSP. The DSP is a standard ARM Cortex-A9 core. This is a relatively powerful processor.
We are interested in understanding how long various math operations take to complete on the Cortex-A9 processor. For example, sin()
, sqrt()
, /
, etc. In order to get a better idea of the timing budget per callback cycle.
Mathematical operations from the <math.h> library as well as common operations and casts were timed and benchmarked.
Benchmark Methodology¶
For each test, each operation was run 100 times with different operands. Clock cycles were counted for the total 100 operations, then a baseline number of clock cycles were subtracted from this total. The baseline was also gathered empirically for every test. It was how many clock cycles it took to set the output variable of the operations to an increasing value.
To understand the sequence of events, take the example of the addi
test:
First, the baseline was calculated. The global variable
mathCommandOutputInt
was set to(1 + 0)
and the number of clock cycles taken to perform the operation was logged. ThenmathCommandOutputInt
was set to(1 + 1)
, then(1 + 2)
, etc. The total number of clock cycles for 100 sets of the variable was summed.Then, the operations were performed. Two random numbers (from 0 to 1024) were selected for the inputs to
addi
. Lets use the examples639
and912
. The time taken to setmathCommandOutputInt
to(639 + 912 + 0)
is recorded. ThenmathCommandOutputInt
is set to(639 + 912 + 1)
, then(639 + 912 + 2)
, etc, and the clock cycles taken for all 100 operations are summed. The reason for the changing variable is to avoid caching of the results and ensure that the processor must redo the add operation every time.With the measured and baseline numbers of clock cycles logged, the totals are subtracted from each other. The average number of clock cycles is calculated as
(measured - baseline) / 100.0
clock cycles.
Code example:
/* gather baseline */
for (register int i = 0; i < iter; i++) {
register uint32_t tickStart = cpu_timer_now(); // start timer
mathCommandOutputInt = 1 + i; // set output variable to increasing value
baselineTotalTicks += (cpu_timer_now() - tickStart); // record timer
}
/* run test on integer add */
for (register int i = 0; i < iter; i++) {
register uint32_t tickStart = cpu_timer_now(); // start timer
mathCommandOutput = input1 + (input2 + i); // set output variable to input1 + input2 + i (for increasing value)
totalTicks += (cpu_timer_now() - tickStart); // record timer
}
printf("total ticks: %li ticks\n", totalTicks);
totalTicks -= baselineTotalTicks; // subtract baseline ticks
double averageTicks = (double) totalTicks / iter; // compute average ticks
double averageTime = cpu_timer_ticks_to_usec(abs(totalTicks)) / iter;
printf("\naverage ticks for add: %lf ticks\n", averageTicks);
printf("average time for add: %lf microseconds\n", averageTime);
Additional information:
In some cases, the time taken to perform the operation was less than the baseline, like in the case of muli. This results in a negative number on the graph. This is unexpected and further tests would be required to determine the cause.
Tests were ran on the debug version of the code. This uses the -O2 optimization flag with the gcc compiler
The code that ran these tests lives here and the python driver code is here
Different output variables were used for integers, floats, and doubles to avoid implicit casting.
It was difficult to control caching speedups. Operands to the operations were either constants or declared as register to try to mitigate this problem, but with more complex operations there are many factors that cannot be controlled.
Results¶
Common Operations¶
Operation |
Argument Signature |
Time per Operation (ns) |
---|---|---|
|
|
1.365 |
|
|
1.605 |
|
|
1.395 |
|
|
24.285 |
|
|
37.395 |
|
|
0.150 |
|
|
0.105 |
|
|
0.015 |
|
|
-2.625 |
|
|
0.030 |
|
|
-0.045 |
|
|
0.000 |
|
|
-2.700 |
|
|
91.635 |
The addi
, subi
, muli
, and divi
refer to integer addition, subtraction, multiplication, and division.
Similarily addf
, subf
, mulf
, and divf
refer to 32-bit floating point operations.
Finally, the add
, sub
, mul
, and div
refer to 64-bit floating point (double) operations.
A few of these results are negative, which indicates that this operation performed faster than the baseline. This shouldn’t happen, but there are likely factors that not accounted for or extra optimizations that the compiler figured out that results in this behaviour. In any case we can conclude that doing these operations are essentially “free”, as it takes as long to compute and output the result as it does to output a constant result.
Other Operations¶
Operation |
Argument Signature |
Time per Operation (ns) |
---|---|---|
|
|
32.940 |
|
|
35.235 |
|
|
28.935 |
|
|
10.770 |
|
|
37.800 |
|
|
191.775 |
|
|
337.920 |
|
|
276.795 |
|
|
248.175 |
|
|
348.900 |
|
|
168.105 |
|
|
154.410 |
|
|
178.830 |
|
|
273.495 |
|
|
226.110 |
|
|
130.025 |
|
|
26.445 |
|
|
347.580 |
|
|
360.630 |
|
|
124.335 |
|
|
269.865 |
|
|
102.555 |
|
|
35.265 |
|
|
844.876 |
|
|
226.110 |
|
|
284.175 |
All of these functions come from the <math.h> library.
Casts¶
Operation |
Argument Signature |
Time per Operation (ns) |
---|---|---|
|
|
0.015 |
|
|
0.150 |
|
|
3.015 |
|
|
1.545 |
|
|
1.575 |
|
|
0.150 |
Naming format:
castfi
refers to a cast from a float (f) to an int (i).castid
refers to a cast from an int (i) to a double (d).
Conclusions¶
The numbers in the graphs are in nanoseconds, not clock cycles. However the conversion is simple. At a current frequency of 666.666 MHz, 1 clock cycle equals 1.5 nanoseconds. Changing the frequency of the clock will not change the number of clock cycles needed to complete an operation, but it will change the number of nanoseconds per clock cycle.
Quick analysis:
Integer division and integer and double modulo are the slowest common operations by far
Both float and double common operations tended to outperform their integer counterparts.
the
sqrt
function was significantly faster than others in a similar complexity groupInverse trig operations outperformed normal trig operations.
Normal trig and inverse trig operations outperformed hyperbolic trig and inverse trig operations. With the exception of
tanh
andatanh
which performed much faster than expected.ceiling and floor operations were slower than I thought, being much slower than any floating point common operations and almost as slow as
sqrt
By far the slowest operation was
pow
, which is likely because it takes in two doubles and also allows negative inputsThe natural log was quite a bit faster than log base 10.
cbrt
was significantly slower thansqrt
Acceleration strategies¶
Integer division is so slow. It’s faster to cast both inputs to doubles, do double division, and then cast that back to an integer.
// Computes (arg1 / arg2) for two integers
int fastdivi(int arg1, int arg2) {
return (int) ((double) arg1 / arg2);
}
Likewise, why bother using the slow floor()
function when you can just cast the double to an int and then back to a double.
// Truncates the decimal portion of a double (note: rounds negative numbers up)
double fastfloor(double arg) {
return (double) (int) arg;
}
One strange result is the slowness of the hypot()
function, which takes in two lengths and returns the hypotenuse as if the two inputs were sides of a right triangle. This function is abnormally slow, given that the formula is sqrt(x * x + y * y). Which should be around 42.615 nanoseconds but instead it takes on average 181.5 nanoseconds.
// Calculates the hypotenuse of a right triangle given the lengths of the two legs
double fasthypot(double arg1, double arg2) {
return sqrt(arg1 * arg1 + arg2 * arg2);
}
Perhaps the common cause for these operations taking longer is that there is overhead associated with setting extra variables. In order to find out, I ran some extra tests to see if I could increase the speed of certain operations.
The fast functions outperform their standard counterparts in every case.
Conclusion¶
The experimental approach and results presented on this page can be used as a reference for ballparking about how many of a certain operation will fit within a time window. For example, given 100 μs of total time, the pow() function can be used 140 times, while about 4350 integer divisions can be performed.
Users can evaluate the expected run-time of their code by counting each math operation type and summing the operation time using the tables presented above. Then, users can profile their code on real hardware using similar approaches to above to validate the prediction.
If you have any other ideas for speeding up operations, create a discussion thread on the AMDC-Firmware repo here!