We have new option in clang (https://github.com/llvm/llvm-project/pull/124834)
to mark globals to be allocated in non-large sections. We can mark all globals
that are referenced from hardcoded assembly (which implicitly references globals
assuming they are in non-large sections) with this attribute to avoid running
into problems when dav1d is built with -mcmodel=medium with clang.
This patch adds a vectorised variant of the mv_projection calculation
and a faster initialisation of motion vectors for load_tmvs_neon.
Checkasm uplifts after this patch on some Neoverse and Cortex CPU cores
compared to the C reference compiled with GCC-13 and Clang-19:
GCC Clang
AWS Graviton 4: 1.62x 1.59x
Cortex-X4: 1.45x 1.46x
Cortex-X3: 1.68x 1.69x
Cortex-X1: 1.55x 1.52x
Cortex-A720: 1.54x 1.57x
Cortex-A715: 1.47x 1.55x
Cortex-A78: 1.21x 1.18x
Cortex-A76: 1.38x 1.37x
Cortex-A72: 1.08x 1.11x
Cortex-A520: 0.97x 1.18x
Cortex-A510: 0.99x 1.14x
Cortex-A55: 1.16x 1.23x
This patch increases the .text by ~660 bytes, but smaller than the
reference implementation by about 0.5 KiB.
For the bilin cases, this seems to make things marginally faster
(measured on x86_64; 7-25% faster with compiler autovectorization).
For 8tap, it doesn't make much of a difference at all.
Before: GCC Clang
mc_scaled_8tap_regular_w128_8bpc_c: 115155.5 98549.3
mc_scaled_8tap_regular_w128_8bpc_ssse3: 17936.0 18411.1
mc_scaled_bilinear_w128_8bpc_c: 40290.0 51812.9
mc_scaled_bilinear_w128_8bpc_ssse3: 18243.9 18177.0
After:
mc_scaled_8tap_regular_w128_8bpc_c: 116304.3 99453.2
mc_scaled_8tap_regular_w128_8bpc_ssse3: 18387.0 18077.3
mc_scaled_bilinear_w128_8bpc_c: 37381.4 41145.0
mc_scaled_bilinear_w128_8bpc_ssse3: 18423.8 18031.6
(Benchmarked with the seed 0; the total runtime for the scaled
benchmarks are significantly affected by the random seed.)
This reduces the stack usage of these functions from around 65 KB
each, to less than 1 KB for bilin, and around 2 KB for 8tap.
With this in place, the required stack space for dav1d should
be mostly identical across configurations; on x86_64 (both with
and without assembly), it can run with 62 KB of stack, and
on arm and aarch64, it can run with 58 KB of stack.
Switch to the same cache-friendly algorithm as was done for arm64
in 2e73051c57 and for the reference
C code in 8291a66e50.
Contrary to the arm64 implementation, this uses a main loop in C
(very similar to the one in the main C implementation in
8291a66e50) rather than assembly;
this gives a bit more overhead on the call to each function, but
it shouldn't affect the big picture much.
Performane wise, this doesn't make much of a difference - it makes
things a little bit faster on some cores, and a little bit slower
on others:
Before: Cortex A7 A8 A53 A72 A73
wiener_7tap_8bpc_neon: 269384.4 147730.7 140028.5 92662.5 92929.0
wiener_7tap_10bpc_neon: 352690.2 159970.2 169427.8 116614.9 119371.1
After:
wiener_7tap_8bpc_neon: 238328.0 157274.1 134588.6 92200.3 97619.6
wiener_7tap_10bpc_neon: 336369.3 162182.0 161954.4 125521.2 130634.0
This is mostly in line with the results on arm64 in
2e73051c57. On arm64, there was a
bit larger speedup for the 7tap case, mostly attributed to
unrolling the vertical filter (and the new filter_hv function) to
operate on 16 pixels at a time. On arm32, there's not enough
registers to do that, so we can't get such gains from unrolling.
(Reducing the unrolling on the arm64 version to match the case
on arm32 also shows similar performance numbers as on arm32 here.)
In the arm64 version, we also added separate 5tap versions of all
functions; not doing that for arm32 at this point.
This increases the binary size by 2 KB.
This doesn't have any immediate effect on how much stack space
dav1d requires in total, since the largest stack users on arm
currently are the 8tap_scaled functions.
This uses a separate function for combined horizontal and vertical
filtering, without needing to write the intermediate results
back to memory inbetween.
This mostly serves as an example for how to adjust the logic for
that case; unless we actually merge the horizontal and vertical
filtering within the _hv function, we still need space for a
7th row on the stack within that function (which means we use just
as much stack as before), but we also need one extra memcpy to
write it into the right destination.
In a build where the compiler is allowed to vectorize and inline
the wiener functions into each other, this change actually reduces
the final binary size by 4 KB, if the C version of the wiener filter
is retained.
This change makes the vectorized C code as fast as it was before
with Clang 18; on Xcode Clang 16, it's 2x slower than it was before.
Unfortunately, with GCC, this change makes the code a bit slower
again.
This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16,
if the C version of the filter is retained (which it isn't
by default).
This makes the vectorized C code roughly as fast as it was before
the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster,
while still being slower than it was initially.
This reduces the stack usage of these functions (the C version)
significantly.
These C versions aren't used on architectures that already have
wiener filters implemented in assembly, but they matter both if
running e.g. with assembly disabled (e.g. for sanitizer builds),
and matter as example for how to do a cache efficient SIMD
implementation.
This roughly matches how these functions are implemented in the
aarch64 assembly (although that assembly function uses a mainloop
function written in assembly, and custom calling conventions
between the functions).
With this in place, dav1d can run with around 76 KB of stack
with assembly disabled.
This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16), unless built with (the default)
-Dtrim_dsp=true. (By default, the C version of the wiener filter
gets skipped entirely.)
On 32 bit arm, the assembly wiener function implementation still
uses large buffers on the stack though, but due to other functions
using less stack there, dav1d can still run with 72 KB of stack
there.
Unfortunately, this change also makes the functions slower, depending
on how well the compiler was able to optimize the previous version.
On GCC (which didn't manage to vectorize the functions so well before),
it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang
(where it was very well vectorized before).
Most of this performance can be gained back with later changes on
top, though.
It previously used 'pixel' which is typedefed to uint8_t in files
that aren't bitdepth-templated, but those are indices and not
pixels so that was just confusing and misleading.
f->task_thread.error can be set during flushing, not resetting this can
lead to c->task_thread.first being increased after having already submitted
a frame post flushing. That's fine if it happens on the very first frame,
but if that's the case on any subsequent frame it will incur a wrong frame
ordering.
Now that a non-first frame will be considered as such, its tasks won't be
able to execute (since they depend on a truly previous frame considered as
being after) and c->task_thread.cur will be increased past that frame, with
no way of it being reset, eventually leading to a hang.
This would allow to immediately detect unintended writes out of
bounds like the ones fixed in
72b5380757 and
1c7433a5eb.
Extend the PIXEL_RECT macro to provide a variable containing the
full, padded height of the buffer, for uses that operate on the
full buffer.
Allow overwriting past the right edge of the target output rectangle,
up to an alignment of 64 pixels, but allow no overwrite past the
bottom.
Switch to the same cache-friendly algorithm as was done for arm64
in c121b831e2.
This uses much less stack memory, and is much more cache friendly.
In this form, most of the individual asm functions only operate on
one single row of data at a time.
Some of the functions used to be unrolled to operate on two rows
at a time, while they now only operate on one at a time. In practice,
this is still a large performance win, as data is accessed in a
much more cache friendly manner.
This gives a 2-37% speedup, and reduces the peak amount of stack
used for these functions from 255 KB to 33 KB.
Before: Cortex A7 A8 A53 A72 A73
sgr_3x3_8bpc_neon: 873990.7 748341.9 543410.2 383200.4 357502.9
sgr_3x3_10bpc_neon: 909728.0 732594.5 560123.6 392765.5 359377.7
sgr_5x5_8bpc_neon: 591597.9 527353.1 350347.4 263464.9 243098.8
sgr_5x5_10bpc_neon: 637958.2 529462.8 364613.3 280664.6 255164.6
sgr_mix_8bpc_neon: 1458977.4 1185423.2 884017.7 632922.5 587395.2
sgr_mix_10bpc_neon: 1532376.5 1259111.4 918729.3 658787.6 600317.0
After:
sgr_3x3_8bpc_neon: 836138.7 635556.5 530596.1 335794.6 348209.9
sgr_3x3_10bpc_neon: 850835.4 596445.0 534583.2 342713.4 349713.5
sgr_5x5_8bpc_neon: 577039.7 443916.5 341684.8 223374.0 232841.3
sgr_5x5_10bpc_neon: 600975.7 400041.3 347529.8 234759.9 239351.7
sgr_mix_8bpc_neon: 1297988.7 925739.1 830360.7 545476.1 548706.6
sgr_mix_10bpc_neon: 1340112.6 914395.7 873342.4 574815.7 554681.6
With this change in place, dav1d can run with around 72 KB of stack
on arm targets.
Not all functions have been merged in the same way as they were
for arm64 in c121b831e2, so some
minor differences remain; it's possible to incrementally optimize
this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse
finish_filter_row1/2 with sgr_weighted_row1, and make a version of
finish_filter_row1 that produces 2 rows, like is done for arm64.
It's also possible to rewrite the logic for calculating sgr_x_by_x
in the same way as was done for arm64 in
79db162487.
This applies the same simplifications that were done for the C
code and the x86 assembly in 4613d3a530,
and the arm64 assembly in ce80e6daf6,
to the arm32 implementation.
This gives a minor speedup of around a couple percent.
Before: Cortex A7 A8 A53 A72 A73
sgr_3x3_8bpc_neon: 926600.0 753468.3 553704.1 399379.1 369674.4
sgr_5x5_8bpc_neon: 621722.9 540412.7 357275.9 274474.3 254996.0
sgr_mix_8bpc_neon: 1529715.1 1171282.5 894982.9 659996.6 610407.2
After:
sgr_3x3_8bpc_neon: 899020.3 697278.6 541569.9 382824.3 353891.8
sgr_5x5_8bpc_neon: 602183.2 498322.9 348974.5 264833.9 243837.7
sgr_mix_8bpc_neon: 1497870.8 1182121.3 880470.9 635939.3 590909.3
After processing one block, this accidentally jumped to the loop
for processing two lines at once.
The same bug was replicated in both 32 and 64 bit versions.
This reduces the stack usage of these functions (the C version)
significantly, and gives them a 15-40% speedup (on an Apple M3,
with Xcode Clang 16).
The C versions of this function does matter; even though we have
assembly implementations of it on x86 and aarch64, those only
covert the 8 and 10 bpc cases, while the C version is used as
fallback for 12 bpc.
This matches how these functions are implemented in the aarch64
assembly; operate over a window of 3 or 5 lines (of 384 pixels
each), instead of doing a full 384 x 64 block.
The individual functions for filtering a line each end up
much simpler, and closer to how this can be implemented in
assembly - but the overall business logic ends up much much
more complex.
The main difference to the aarch64 assembly implementation,
is that any buffer which is of int16_t size in the aarch64
assembly implementation, uses the type "coef" here, which
is 32 bit in the 10/12 bpc cases. (This is required for handling
the 12 bpc cases.)
With this in place, dav1d can run with around 66 KB of stack
on x86_64 with assembly enabled, with around 74 KB of stack on
aarch64 with assembly enabled, and with 118 KB of stack with
assembly disabled.
This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16).
On 32 bit arm, dav1d still requires around 270 KB of stack, as
that assembly implementation of the SGR filter uses a different
algorithm.
As the machine specific init file is included in the common
template, give symbols and defines unique names that won't
clash with similar ones in the main template.
When renumbering argument registers in
1648c232ee, this one register
reference was missed.
The missed register was meant to compare h with 2, but accidentally
ended up comparing bitdepth_max to 2. In the case of 8 bpc, there's
actually no bitdepth_max parameter, so it ended up comparing an
uninitialized value.
For individual tests in dav1d-test-data, the default timeout
is 30 seconds (which is the Meson default if nothing is
specified). Previously it ran with a multiplier of 4, resulting
in a total timeout of 120 seconds.
When running tests in QEMU, exceeding this 120 second timeout
could happen occasionally. Raise the multiplier to 10, allowing
each individual job to run for up to 5 minutes.
This should hopefully reduce the amount of stray failures in the
CI.
For tests that already have a higher default timeout set, such
as checkasm which has got a 180 second default timeout, this results
in a much longer timeout period. However as long as we don't
frequently see issues where these actually hang, it should be
beneficial to just let them run to completion, rather than
aborting early due to a tight timeout.
This applies the same simplifications that were done for the C
code and the x86 assembly in 4613d3a530,
to the arm64 implementation.
This gives a minor speedup of around a couple percent.
Before: Cortex A53 A55 A72 A73 A76 Apple
M3
sgr_3x3_8bpc_neon: 368583.2 363654.2 279958.1 272065.1 169353.3 354.6
sgr_5x5_8bpc_neon: 258570.7 255018.5 200410.6 199478.3 117968.3 260.9
sgr_mix_8bpc_neon: 603698.1 577383.3 482468.3 436540.4 256632.9 541.8
After:
sgr_3x3_8bpc_neon: 367873.2 357884.1 275462.4 268363.9 165909.8 346.0
sgr_5x5_8bpc_neon: 254988.4 248184.2 190875.1 196939.1 120517.2 252.1
sgr_mix_8bpc_neon: 589204.7 563565.8 414025.6 427702.2 251651.2 533.4