100 Commits
Author SHA1 Message Date
Martin StorsjöandJean-Baptiste Kempf 720adf9b5b ci: Add -Dtrim_dsp=false in a couple of aarch64/arm configurations
For the "release" build configurations, trim_dsp defaults to true,
while it defaults to false for "debugoptimized". This means that
the configurations with release mode, without -Dtrim_dsp=false
actually run checkasm before.

In practice, checkasm is covered by later, full-test configurations,
but this ensures that we do test it at this stage as well, as
intended.
2026-06-07 22:36:10 +02:00
Martin Storsjö c85856e360 aarch64: Fix a name mismatch in a macro error message
For the 64 bit assembly, the macro is just named "sub_sp", while it
was named "sub_sp_align" in the 32 bit form.
2026-05-15 14:24:57 +03:00
Martin Storsjö 037430193a arm: Fix up code style slightly
The existing code has been written striving to align columns so
that the largest register names can be typed, e.g. r10 on ARM
(and similarly for x10 or q10 on AArch64), or v31.16b for AArch64
vectors.

Fix some cases, where the current forms were clearly
inconsistent/wrong. Not all cases have been fixed up to match this
norm, but some individual ones that were clearly wrong have been
fixed.
2026-05-06 15:32:26 +00:00
Martin Storsjö 7b9ab8373e ci: Update the main CI image
This version includes llvm-symbolizer, which should improve
backtraces in sanitizer builds with Clang.
2026-05-06 15:11:16 +00:00
Martin Storsjö ac5dfb0a85 examples: Treat SDL2 headers as system headers
This makes those headers included with -isystem rather than -I,
which makes the compiler skip producing any warnings about them
(as they're expected to be out of the user code's control).

This avoids warnings with newer versions of the
dav1d-debian-unstable CI image, warnings (treated as errors in CI)
like this:

    In file included from /usr/include/SDL2/SDL_config.h:51,
                     from /usr/include/SDL2/SDL_stdinc.h:33,
                     from /usr/include/SDL2/SDL_main.h:25,
                     from /usr/include/SDL2/SDL.h:31,
                     from ../examples/dav1dplay.c:33:
    /usr/include/SDL2/SDL_config_unix.h:186:9: error: 'HAVE_GETAUXVAL' redefined [-Werror]
      186 | #define HAVE_GETAUXVAL 1
          |         ^~~~~~~~~~~~~~
    In file included from ../examples/dav1dplay.c:27:
    ./config.h:66:9: note: this is the location of the previous definition
       66 | #define HAVE_GETAUXVAL 0
          |         ^~~~~~~~~~~~~~

Recently, Debian Unstable has switched from providing the
actual SDL 2 to providing the SDL 2 API through the sdl2-compat
package on top of SDL 3.

The SDL 2 headers expose their full config.h as part of their
installed headers (that the user code ends up including). This
includes unnamespaced defines, such as "#define HAVE_GETAUXVAL 1".

This issue hasn't shown up with the original SDL 2 package in
Debian, due to a Debian packaging detail. While most SDL 2
headers are installed in /usr/include/SDL2 (and user code
includes it as <SDL.h>, requiring the build system to include
/usr/include/SDL2), the Debian packaging has replaced
/usr/include/SDL2/SDL_config.h with a header that includes
<SDL2/_real_SDL_config.h>, which then gets resolved in
/usr/include/x86_64-linux-gnu/SDL2. Due to this being included
from a compiler default system include path
(/usr/include/x86_64-linux-gnu), no warnings about the header
was printed, even though that one also produced the same kind
of conflicting redefinitions. (We could also avoid the same issue
by attempting to include <SDL2/SDL.h> instead of <SDL.h>,
avoiding the use of the build system provided include directory,
resolving that from /usr/include, and having the compiler consider
it a system header.)

The sdl2-compat package in Debian doesn't redirect that header
in the same way, but includes SDL_config_unix.h in the same
directory in /usr/include/SDL2. Due to this being included
from a user specified -I (as long as it is included as <SDL.h>,
not <SDL2/SDL.h>), it's considered a user header, and warnings
are printed for it.

It seems like SDL 3 no longer exposes their config.h headers as
part of the installed headers.

The conflict between SDL 2's config.h's HAVE_GETAUXVAL and
our stems from the fact that we only try to detect GETAUXVAL
on architectures where we want to use it (arm/aarch64, loongarch,
ppc or riscv). On x86, where we don't need it, we don't try
to detect it, and set "#define HAVE_GETAUXVAL 0" in our
config.h.

To avoid warnings due to the conflict, we can declare the
SDL 2 dependency with the argument "include_type: 'system'",
which should silence any warnings in the SDL headers. This
Meson feature is available since Meson 0.52.0 (and we currently
require Meson 0.54.0).

An alternative way to avoid the redefinition conflict would be
to always try to detect getauxval on all architectures, to make
our config.h agree with SDL 2's config headers.

A third (and much more hacky way) around the conflict would be
to avoid the public SDL headers including the SDL_config header
by defining "SDL_config_h_" before including SDL.h. Doing this
also requires manually including a couple more standard headers
before SDL.h (stdint.h, stdio.h, stddef.h).
2026-05-06 14:00:31 +03:00
Martin Storsjö 556c5202b4 ci: Add testing on macOS on Apple Silicon too 2026-05-03 22:39:01 +03:00
Martin Storsjö e1bd6f76c2 checkasm: Readd a dependency on threads
3a2a874994, which switched to using
the checkasm core from the separate checkasm project, removed the
thread dependency from the checkasm executable, as the checkasm
library itself has a thread dependency.

However, checkasm doesn't always include that thread dependency,
it only does that when pthread_setaffinity_np is detected.

The dav1d object files themselves use pthreads as well, causing
undefined symbols if checkasm doesn't link in pthreads.

This should fix linking on OpenBSD after
3a2a874994, fixing issue #467.
2026-05-03 12:42:39 +03:00
Martin Storsjö 5cfc383268 arm: mc: Optimize prep_neon for the w4/w8 cases
Use alternating registers for immediately sequential loads/stores,
pack two 4 pixel rows into one register.

Before:                           Cortex A7      A8     A53     A55     A72     A73     A76
mct_8tap_regular_w4_0_8bpc_neon:      112.0    68.6    79.7    82.9    45.3    39.4    24.4
mct_8tap_regular_w8_0_8bpc_neon:      158.2    89.5   108.4   113.4    55.4    53.0    30.0
After:
mct_8tap_regular_w4_0_8bpc_neon:       89.7    69.9    76.3    85.1    36.2    35.2    25.0
mct_8tap_regular_w8_0_8bpc_neon:      149.0    92.7   102.6   115.8    56.6    52.8    31.4

The numbers aren't entirely consistent, but this is mostly favourable.
2026-04-29 15:56:01 +03:00
Martin Storsjö 727d0f984b arm: mc: Fix a comment typo
This seems to be right in all the other similar places
(arm/64/mc.S, arm/32/mc16.S and arm/64/mc16.S).
2026-04-29 15:56:01 +03:00
Martin Storsjö aa4504729c arm: Fix a typo in a URL
This was added in a00289b6d8.
2026-03-31 13:48:43 +03:00
Martin Storsjö 594d1601ff arm: Add Armv9.3-A GCS (Guarded Control Stack) support
Signal that our assembly is compliant with the GCS feature, if
the GCS feature is enabled in the compiler (available since Clang
18 and GCC 15) - this is enabled by -mbranch-protection=standard
with a new enough compiler.

GCS doesn't require any specific modifications to the assembly
code, but requires that all functions return to the expected call
address (checked through a shadow stack).
2026-03-17 20:40:05 +00:00
Martin Storsjö 4fd22e97d8 arm: Switch to a more correct Windows flag for detecting I8MM
Newer revisions of WinSDK 10.0.26100.0 have exposed more flags for
IsProcessorFeaturePresent; now there is a separate one for
detecting specifically I8MM and not just SVE-I8MM. Switch to using
this flag instead.
2026-03-04 15:16:37 +02:00
Martin Storsjö de4ce4f32d arm: mc: Add missing # for some immediate constants, for consistency
The assembler doesn't require the # here, but we use that everywhere
else, so add it here as well for consistency.
2026-02-06 16:01:57 +02:00
Martin Storsjö 9c13b5fbd0 subprojects: Update checkasm to v1.1.0
This version, together with the previous commit
574e7f4727, fixes issue #460.

Due to checkasm internal restructuring, one may run into build
issues if rebuilding in an old build directory after updating
the checkasm subproject, without getting rid of older meson
generated headers in the build directory.
2026-02-03 09:27:22 +00:00
Martin Storsjö a44b589872 Silence a new MSVC warning
This silences the following warnings in MSVC 2026 18.0 (and
2022 17.14):

    ../tools/dav1d_cli_parse.c(213): warning C5287: operands are different enum types 'CpuFlags' and 'CpuMask'; use an explicit cast to silence this warning
    ../tools/dav1d_cli_parse.c(214): warning C5287: operands are different enum types 'CpuFlags' and 'CpuMask'; use an explicit cast to silence this warning
    ../tools/dav1d_cli_parse.c(215): warning C5287: operands are different enum types 'CpuFlags' and 'CpuMask'; use an explicit cast to silence this warning
    ../tools/dav1d_cli_parse.c(216): warning C5287: operands are different enum types 'CpuFlags' and 'CpuMask'; use an explicit cast to silence this warning

This warning flag was new in MSVC 2022 17.14, but it was buggy
in that version - it produced spurious warnings for other cases
as well (and using an explicit cast to silence it didn't work
as advertised), see [1] and [2].

The bugs were fixed in 18.0, and the remaining construct that it
warns about is something that is somewhat reasonable to warn about:

    enum CpuFlags {
        DAV1D_X86_CPU_FLAG_SSE2        = 1 << 0,
        DAV1D_X86_CPU_FLAG_SSSE3       = 1 << 1,
    };
    enum CpuMask {
        X86_CPU_MASK_SSE2      = DAV1D_X86_CPU_FLAG_SSE2,
        X86_CPU_MASK_SSSE3     = DAV1D_X86_CPU_FLAG_SSSE3     | X86_CPU_MASK_SSE2,
    };

Instead of adding explicit casts on the constants from the foreign
enum, just disable this warning.

[1] https://developercommunity.visualstudio.com/t/False-positive-C5287:-operands-are-diff/10915265
[2] https://developercommunity.visualstudio.com/t/warning-C5287:-operands-are-different-e/10877942
2026-01-20 12:00:29 +02:00
Martin Storsjö 04b69f93e5 checkasm: Reinstate check for TRIM_DSP_FUNCTIONS
This was lost in 3a2a874994.

Without this, checkasm ends up printing a quite confusing output
consisting only of the functions that have two or more assembly
implementations, if trim_dsp happens to be enabled.
2026-01-14 22:29:38 +02:00
Martin Storsjö afd13d8906 arm: Fix a few misindented lines 2026-01-09 14:00:39 +02:00
Martin Storsjö 574e7f4727 checkasm: Pass HAVE_C11_GENERIC to checkasm as -DCHECKASM_HAVE_GENERIC=1/0
For this to have an effect, it requires using a newer version of
the wrapped checkasm subproject; including checkasm commit
be05a7972e47c658a7c5c186294d27caa5735db2 or newer.
2026-01-07 16:22:18 +02:00
Martin Storsjö b2f9c10670 checkasm: Fix building with MSVC
The glue code in our headers, for integrating with the external
checkasm, was incompatible with MSVC.

MSVC has a nonstandard handling of __VA_ARGS__ with macros; when
one macro invokes another macro, __VA_ARGS__ gets treated as one
single parameter and can't map to more than one parameter in the
invoked macro. (In other words, when calling another macro,
__VA_ARGS__ must map in its entirety to a ... parameter of the
other macro.)

Modern versions of MSVC do implement the correct mode as well,
but defaults to the old one for backwards compatibility. To
choose the new mode, we'd have to build our code with
-Zc:preprocessor. That's certainly doable, but it's fairly easy to
avoid the issue as well.

To avoid this issue, change the variadic PIXEL_RECT(...) to explicitly
names its arguments. There's actually no variability in the arguments
involved here. (Alternatively, we could force the preprocessor to expand
the arguments one extra time, avoiding the issue, with e.g.
"#define EXPAND(x) x" and wrapping PIXEL_RECT with it, e.g.
"#define PIXEL_RECT(...) EXPAND(BUF_RECT(pixel, __VA_ARGS__))".)

See [1], [2] and [3] for more discussion on the matter.

[1] https://stackoverflow.com/a/5134656/3115956
[2] https://stackoverflow.com/a/7459803/3115956
[2] https://learn.microsoft.com/en-us/cpp/preprocessor/preprocessor-experimental-overview?view=msvc-160
2026-01-07 11:57:09 +02:00
Martin Storsjö e7c280e4cd x86: Sync the latest upstream version of x86inc.asm 2025-11-12 23:19:22 +02:00
Martin Storsjö 2eac05d648 checkasm: arm: Use X() instead of inline ifdefs
This works fine when the referenced symbol has the same prefix
as PRIVATE_PREFIX in the same file; otherwise we could also
create a macro like X() that only prepends the extern symbol
prefix but no symbol namespace prefix.
2025-11-12 15:54:40 +02:00
Martin Storsjö b129d9f2cb mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap}
For the bilin cases, this seems to make things marginally faster
(measured on x86_64; 7-25% faster with compiler autovectorization).
For 8tap, it doesn't make much of a difference at all.

Before:                                      GCC   Clang
mc_scaled_8tap_regular_w128_8bpc_c:     115155.5   98549.3
mc_scaled_8tap_regular_w128_8bpc_ssse3:  17936.0   18411.1
mc_scaled_bilinear_w128_8bpc_c:          40290.0   51812.9
mc_scaled_bilinear_w128_8bpc_ssse3:      18243.9   18177.0
After:
mc_scaled_8tap_regular_w128_8bpc_c:     116304.3   99453.2
mc_scaled_8tap_regular_w128_8bpc_ssse3:  18387.0   18077.3
mc_scaled_bilinear_w128_8bpc_c:          37381.4   41145.0
mc_scaled_bilinear_w128_8bpc_ssse3:      18423.8   18031.6

(Benchmarked with the seed 0; the total runtime for the scaled
benchmarks are significantly affected by the random seed.)

This reduces the stack usage of these functions from around 65 KB
each, to less than 1 KB for bilin, and around 2 KB for 8tap.

With this in place, the required stack space for dav1d should
be mostly identical across configurations; on x86_64 (both with
and without assembly), it can run with 62 KB of stack, and
on arm and aarch64, it can run with 58 KB of stack.
2025-01-02 15:30:21 +00:00
Martin Storsjö 2ba57aa535 arm32: looprestoration: Rewrite the wiener functions
Switch to the same cache-friendly algorithm as was done for arm64
in 2e73051c57 and for the reference
C code in 8291a66e50.

Contrary to the arm64 implementation, this uses a main loop in C
(very similar to the one in the main C implementation in
8291a66e50) rather than assembly;
this gives a bit more overhead on the call to each function, but
it shouldn't affect the big picture much.

Performane wise, this doesn't make much of a difference - it makes
things a little bit faster on some cores, and a little bit slower
on others:

Before:                 Cortex A7        A8       A53       A72       A73
wiener_7tap_8bpc_neon:   269384.4  147730.7  140028.5   92662.5   92929.0
wiener_7tap_10bpc_neon:  352690.2  159970.2  169427.8  116614.9  119371.1
After:
wiener_7tap_8bpc_neon:   238328.0  157274.1  134588.6   92200.3   97619.6
wiener_7tap_10bpc_neon:  336369.3  162182.0  161954.4  125521.2  130634.0

This is mostly in line with the results on arm64 in
2e73051c57. On arm64, there was a
bit larger speedup for the 7tap case, mostly attributed to
unrolling the vertical filter (and the new filter_hv function) to
operate on 16 pixels at a time. On arm32, there's not enough
registers to do that, so we can't get such gains from unrolling.
(Reducing the unrolling on the arm64 version to match the case
on arm32 also shows similar performance numbers as on arm32 here.)

In the arm64 version, we also added separate 5tap versions of all
functions; not doing that for arm32 at this point.

This increases the binary size by 2 KB.

This doesn't have any immediate effect on how much stack space
dav1d requires in total, since the largest stack users on arm
currently are the 8tap_scaled functions.
2024-12-20 14:32:32 +02:00
Martin Storsjö 8291a66e50 looprestoration: Use only 6 row buffer for wiener, like NEON/x86
This uses a separate function for combined horizontal and vertical
filtering, without needing to write the intermediate results
back to memory inbetween.

This mostly serves as an example for how to adjust the logic for
that case; unless we actually merge the horizontal and vertical
filtering within the _hv function, we still need space for a
7th row on the stack within that function (which means we use just
as much stack as before), but we also need one extra memcpy to
write it into the right destination.

In a build where the compiler is allowed to vectorize and inline
the wiener functions into each other, this change actually reduces
the final binary size by 4 KB, if the C version of the wiener filter
is retained.

This change makes the vectorized C code as fast as it was before
with Clang 18; on Xcode Clang 16, it's 2x slower than it was before.

Unfortunately, with GCC, this change makes the code a bit slower
again.
2024-12-19 14:19:19 +02:00
Martin Storsjö a149f5c3c0 looprestoration: Make the C wiener h filter more optimizable for the compiler
This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16,
if the C version of the filter is retained (which it isn't
by default).

This makes the vectorized C code roughly as fast as it was before
the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster,
while still being slower than it was initially.
2024-12-19 14:19:19 +02:00
Martin Storsjö 9da303e989 looprestoration: Rewrite the C version of the wiener filter
This reduces the stack usage of these functions (the C version)
significantly.

These C versions aren't used on architectures that already have
wiener filters implemented in assembly, but they matter both if
running e.g. with assembly disabled (e.g. for sanitizer builds),
and matter as example for how to do a cache efficient SIMD
implementation.

This roughly matches how these functions are implemented in the
aarch64 assembly (although that assembly function uses a mainloop
function written in assembly, and custom calling conventions
between the functions).

With this in place, dav1d can run with around 76 KB of stack
with assembly disabled.

This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16), unless built with (the default)
-Dtrim_dsp=true. (By default, the C version of the wiener filter
gets skipped entirely.)

On 32 bit arm, the assembly wiener function implementation still
uses large buffers on the stack though, but due to other functions
using less stack there, dav1d can still run with 72 KB of stack
there.

Unfortunately, this change also makes the functions slower, depending
on how well the compiler was able to optimize the previous version.
On GCC (which didn't manage to vectorize the functions so well before),
it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang
(where it was very well vectorized before).

Most of this performance can be gained back with later changes on
top, though.
2024-12-19 14:19:13 +02:00
Martin StorsjöandJean-Baptiste Kempf f8d2620d82 checkasm: looprestoration: Do strict bounds checking of the output
This would allow to immediately detect unintended writes out of
bounds like the ones fixed in
72b5380757 and
1c7433a5eb.

Extend the PIXEL_RECT macro to provide a variable containing the
full, padded height of the buffer, for uses that operate on the
full buffer.

Allow overwriting past the right edge of the target output rectangle,
up to an alignment of 64 pixels, but allow no overwrite past the
bottom.
2024-11-21 09:05:33 +00:00
Martin Storsjö 30c3dd8edd arm32: looprestoration: Rewrite the SGR functions
Switch to the same cache-friendly algorithm as was done for arm64
in c121b831e2.

This uses much less stack memory, and is much more cache friendly.
In this form, most of the individual asm functions only operate on
one single row of data at a time.

Some of the functions used to be unrolled to operate on two rows
at a time, while they now only operate on one at a time. In practice,
this is still a large performance win, as data is accessed in a
much more cache friendly manner.

This gives a 2-37% speedup, and reduces the peak amount of stack
used for these functions from 255 KB to 33 KB.

Before:              Cortex A7         A8        A53        A72        A73
sgr_3x3_8bpc_neon:    873990.7   748341.9   543410.2   383200.4   357502.9
sgr_3x3_10bpc_neon:   909728.0   732594.5   560123.6   392765.5   359377.7
sgr_5x5_8bpc_neon:    591597.9   527353.1   350347.4   263464.9   243098.8
sgr_5x5_10bpc_neon:   637958.2   529462.8   364613.3   280664.6   255164.6
sgr_mix_8bpc_neon:   1458977.4  1185423.2   884017.7   632922.5   587395.2
sgr_mix_10bpc_neon:  1532376.5  1259111.4   918729.3   658787.6   600317.0
After:
sgr_3x3_8bpc_neon:    836138.7   635556.5   530596.1   335794.6   348209.9
sgr_3x3_10bpc_neon:   850835.4   596445.0   534583.2   342713.4   349713.5
sgr_5x5_8bpc_neon:    577039.7   443916.5   341684.8   223374.0   232841.3
sgr_5x5_10bpc_neon:   600975.7   400041.3   347529.8   234759.9   239351.7
sgr_mix_8bpc_neon:   1297988.7   925739.1   830360.7   545476.1   548706.6
sgr_mix_10bpc_neon:  1340112.6   914395.7   873342.4   574815.7   554681.6

With this change in place, dav1d can run with around 72 KB of stack
on arm targets.

Not all functions have been merged in the same way as they were
for arm64 in c121b831e2, so some
minor differences remain; it's possible to incrementally optimize
this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse
finish_filter_row1/2 with sgr_weighted_row1, and make a version of
finish_filter_row1 that produces 2 rows, like is done for arm64.

It's also possible to rewrite the logic for calculating sgr_x_by_x
in the same way as was done for arm64 in
79db162487.
2024-11-19 11:58:25 +02:00
Martin Storsjö 1b7f126361 arm32: looprestoration: Apply simplifications to align with C code
This applies the same simplifications that were done for the C
code and the x86 assembly in 4613d3a530,
and the arm64 assembly in ce80e6daf6,
to the arm32 implementation.

This gives a minor speedup of around a couple percent.

Before:             Cortex A7         A8        A53        A72        A73
sgr_3x3_8bpc_neon:   926600.0   753468.3   553704.1   399379.1   369674.4
sgr_5x5_8bpc_neon:   621722.9   540412.7   357275.9   274474.3   254996.0
sgr_mix_8bpc_neon:  1529715.1  1171282.5   894982.9   659996.6   610407.2
After:
sgr_3x3_8bpc_neon:   899020.3   697278.6   541569.9   382824.3   353891.8
sgr_5x5_8bpc_neon:   602183.2   498322.9   348974.5   264833.9   243837.7
sgr_mix_8bpc_neon:  1497870.8  1182121.3   880470.9   635939.3   590909.3
2024-11-18 16:08:00 +02:00
Martin Storsjö c43debf1b1 arm64: looprestoration: Fix a comment typo 2024-11-18 16:07:40 +02:00
Martin Storsjö 1c7433a5eb arm: looprestoration: Fix the single line loop in sgr_weighted2
After processing one block, this accidentally jumped to the loop
for processing two lines at once.

The same bug was replicated in both 32 and 64 bit versions.
2024-11-18 16:07:40 +02:00
Martin Storsjö f32b314616 looprestoration: Rewrite the C version of the SGR filter
This reduces the stack usage of these functions (the C version)
significantly, and gives them a 15-40% speedup (on an Apple M3,
with Xcode Clang 16).

The C versions of this function does matter; even though we have
assembly implementations of it on x86 and aarch64, those only
covert the 8 and 10 bpc cases, while the C version is used as
fallback for 12 bpc.

This matches how these functions are implemented in the aarch64
assembly; operate over a window of 3 or 5 lines (of 384 pixels
each), instead of doing a full 384 x 64 block.

The individual functions for filtering a line each end up
much simpler, and closer to how this can be implemented in
assembly - but the overall business logic ends up much much
more complex.

The main difference to the aarch64 assembly implementation,
is that any buffer which is of int16_t size in the aarch64
assembly implementation, uses the type "coef" here, which
is 32 bit in the 10/12 bpc cases. (This is required for handling
the 12 bpc cases.)

With this in place, dav1d can run with around 66 KB of stack
on x86_64 with assembly enabled, with around 74 KB of stack on
aarch64 with assembly enabled, and with 118 KB of stack with
assembly disabled.

This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16).

On 32 bit arm, dav1d still requires around 270 KB of stack, as
that assembly implementation of the SGR filter uses a different
algorithm.
2024-11-18 15:57:19 +02:00
Martin Storsjö 01d417c2fa arm: looprestoration: Give symbols and defines unique names
As the machine specific init file is included in the common
template, give symbols and defines unique names that won't
clash with similar ones in the main template.
2024-11-18 15:39:28 +02:00
Martin Storsjö 847eece170 arm: looprestoration: Add spacing around operators 2024-11-18 15:39:28 +02:00
Martin Storsjö 56a55933b3 arm: looprestoration: Get rid of unnecessary rotate_ab_N intermediate functions 2024-11-18 15:39:28 +02:00
Martin Storsjö 9db59d8904 arm: looprestoration: Apply 'const' more consistently on parameters 2024-11-18 15:39:28 +02:00
Martin Storsjö 72b5380757 arm64: looprestoration: Fix use of the wrong register
When renumbering argument registers in
1648c232ee, this one register
reference was missed.

The missed register was meant to compare h with 2, but accidentally
ended up comparing bitdepth_max to 2. In the case of 8 bpc, there's
actually no bitdepth_max parameter, so it ended up comparing an
uninitialized value.
2024-11-15 12:23:11 +02:00
Martin StorsjöandJean-Baptiste Kempf bed3a34365 arm: Use /proc/cpuinfo on linux if getauxval is unavailable
On really old libc versions, getauxval isn't available. Fall back
on /proc/cpuinfo in those cases, just like we do on android too.
2024-11-14 14:44:21 +00:00
Martin StorsjöandJean-Baptiste Kempf 718b62c8cd ci: Raise the timeout multipliers for jobs that run in QEMU
For individual tests in dav1d-test-data, the default timeout
is 30 seconds (which is the Meson default if nothing is
specified). Previously it ran with a multiplier of 4, resulting
in a total timeout of 120 seconds.

When running tests in QEMU, exceeding this 120 second timeout
could happen occasionally. Raise the multiplier to 10, allowing
each individual job to run for up to 5 minutes.

This should hopefully reduce the amount of stray failures in the
CI.

For tests that already have a higher default timeout set, such
as checkasm which has got a 180 second default timeout, this results
in a much longer timeout period. However as long as we don't
frequently see issues where these actually hang, it should be
beneficial to just let them run to completion, rather than
aborting early due to a tight timeout.
2024-11-14 13:38:18 +00:00
Martin Storsjö 1648c232ee arm64: looprestoration: Remove an unnecessary duplicate parameter in dav1d_sgr_weighted2_Xbpc_neon
Also fix one case where the 32 bit input parameter w (which was in
x6, now in x4) was used without zero extension, by referencing to
it as w4 instead.
2024-11-14 11:53:50 +02:00
Martin Storsjö ce80e6daf6 arm64: looprestoration: Apply simplifications to align with C code
This applies the same simplifications that were done for the C
code and the x86 assembly in 4613d3a530,
to the arm64 implementation.

This gives a minor speedup of around a couple percent.

Before:            Cortex A53        A55        A72        A73       A76  Apple
M3
sgr_3x3_8bpc_neon:   368583.2   363654.2   279958.1   272065.1  169353.3  354.6
sgr_5x5_8bpc_neon:   258570.7   255018.5   200410.6   199478.3  117968.3  260.9
sgr_mix_8bpc_neon:   603698.1   577383.3   482468.3   436540.4  256632.9  541.8
After:
sgr_3x3_8bpc_neon:   367873.2   357884.1   275462.4   268363.9  165909.8  346.0
sgr_5x5_8bpc_neon:   254988.4   248184.2   190875.1   196939.1  120517.2  252.1
sgr_mix_8bpc_neon:   589204.7   563565.8   414025.6   427702.2  251651.2  533.4
2024-11-13 23:39:04 +02:00
Martin Storsjö 8bd31a92a5 arm: looprestoration: Split an overly long line 2024-11-13 15:38:20 +02:00
Martin Storsjö 55fb9433b7 checkasm: Remove leftover comment
This comment no longer is relevant after
9278a14cf4.
2024-10-18 14:37:28 +00:00
Martin Storsjö 23f2769266 meson: Test support for aarch64 extensions with gas-preprocessor too 2024-10-18 10:55:59 +00:00
Martin Storsjö b13d1bc2bb meson: Move checks for gas-preprocessor earlier
Locate the assembler tools before checking for support for various
assembler features.
2024-10-18 10:55:59 +00:00
Martin Storsjö 166e1df543 tests: Add an option to dav1d_argon.bash for using a wrapper tool
This allows executing all the tools within e.g. valgrind.

This matches the "meson test --wrap <tool>" feature.
2024-09-06 20:32:45 +00:00
Martin Storsjö 41511bf12e aarch64: Split the jump tables to a separate const section
This should allow executing in environments where the executable
memory isn't readable.

Use 4 byte entries instead of 2; most object file formats support
relocations for a 4 byte symbol difference across sections, which
allows keeping the rest of the table lookup code similar to what
it was before.

Referencing a symbol in an arbitrary location in the executable
requires a two instruction sequence (adrp+add, via the movrel
macro).

Thus, the cost of this rewrite is doubling the size of the jump
tables (which were quite small so far), and adding one instruction
in each jump table setup prologue. On an ELF build, the .text section
shrinks by 1176 bytes, and the .rodata section grows by 3136 bytes,
i.e. a 1960 byte increase.

While refactoring, prefer doing sign extension during the load
(using ldrsw rather than ldr, to avoid using the "sxtw" modifier on
the add instruction), as extending ALU arithmetics have a higher
latency.

MS armasm64 doesn't seem to support calculating symbol differences
across sections (see [1]), so keep the jump tables in the text
section there, to let the assembler calculate it at assembly time
instead. (Keeping the condition as _WIN32 for simplicity, as we don't
interact directly with armasm64, but it is wrapped in gas-preprocessor.)

[1] https://developercommunity.visualstudio.com/t/armasm64-unable-to-create-cross-section/10722340
2024-08-29 20:43:57 +00:00
Martin Storsjö 0d8abee540 Fix the macro parameter name for the CHECK_SIZE macro 2024-08-29 23:29:30 +03:00
Martin Storsjö ccb02ddf8d aarch64: Enable detection of SVE/SVE2 on Windows
WinSDK 10.0.26100 added these processor feature constants.

Unfortunately, no constant was added for I8MM, but if SVE_I8MM
is available, we can at least be sure that regular I8MM is
available too.
2024-08-26 14:04:37 +03:00
Martin Storsjö 27491dd953 aarch64: Fix a label typo
Apparently, this case isn't actually ever executed, at least in most
checkasm runs, but some tools could complain about the relocation
against 160b, which pointed elsewhere than intended.
2024-08-24 10:08:00 +03:00
Martin Storsjö e560d2ba08 aarch64: Avoid looping through the BTI instructions
This does the same optimizations as
3329f8d139 and
1790e1329d on the rest of the
code.
2024-08-23 16:15:45 +03:00
Martin Storsjö 5a33c5c628 aarch64: ipred: Use the right fill width loop in ipred_z3_fill_padding_neon
This makes the code behave as intended, when filling a rectangle
with arbitrary width (filling with the largest power of two width
until filled); previously, it accidentally fell back on writing 4
pixel wide stripes immediately.

No measurable effect on checkasm benchmarks though.
2024-08-23 12:10:35 +03:00
Martin Storsjö 3329f8d139 aarch64: mc16: Optimize the BTI landing pads in put/prep_neon
Don't include the BTI landing pad instruction in the loops.

If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to
a no-op instruction that indicates that indirect jumps can land
there. But there's no need for the loops to include that instruction.
2024-08-22 16:34:39 +03:00
Martin Storsjö 7fbcdc6d04 aarch64: Explicitly use the ldur instruction where relevant in mc_dotprod.S
The ldr instruction only can handle offsets that are a multiple
of the element size; most assemblers implicitly produce the ldur
instruction when a non-aligned offset is provided.

Older versions of MS armasm64, however, error out on this. Since
MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7
and earlier require explicitly writing the instruction as ldur.

Despite this, even older versions still fail to build the mc_dotprod.S
sources, with errors like this:

    src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range
        mov             x10, (((0*15-1)<<7)|(3*15-1))

This happens on MSVC 2022 17.1 and older, while 17.2 and newer
accept the negative value expression here.

In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure
script at the moment, as it uses inline assembly to test for external
assembler features.
2024-06-25 19:10:59 +00:00
Martin Storsjö 9469e18458 arm64: msac: Explicitly use the ldur instruction
The ldr instruction can take an immediate offset which is a multiple
of the loaded element size. If the ldr instruction is given an
immediate offset which isn't a multiple of the element size,
most assemblers implicitly generate a "ldur" instruction instead.

Older versions of MS armasm64.exe don't do this, but instead error
out with "error A2518: operand 2: Memory offset must be aligned".
(Current versions don't do this but correctly generate "ldur"
implicitly.)

Switch this instruction to an explicit "ldur", like we do elsewhere,
to fix building with these older tools.
2024-05-19 22:36:09 +03:00
Martin Storsjö 236e1d1912 tools: Make ARM cpu flags imply relevant lower level flags
The --cpumask flag only takes one single flag name, one can't set
a combination like neon+dotprod.

Therefore, apply the same pattern as for x86, by adding mask values
that contain all the implied lower level flags.

This is somewhat complicated, as the set of features isn't entirely
linear - in particular, SVE doesn't imply either dotprod or i8mm,
and SVE2 only implies dotprod, but not i8mm.

This makes sure that "dav1d --cpumask dotprod" actually uses any
SIMD at all, as it previously only set the dotprod flag but not
neon, which essentially opted out from all SIMD.
2024-04-26 15:09:19 +00:00
Martin Storsjö cb8151c969 aarch64: Avoid unaligned jump tables
Manually add a padding 0 entry to make the odd number of .hword
entries align with the instruction size.

This fixes assembling with GAS, with the --gdwarf2 option, where
it previously produced the error message "unaligned opcodes detected
in executable segment".

The message is slightly misleading, as the error is printed even
if there actually are no opcodes that are misaligned, as the jump
table is the last thing within the .text section. The issue can
be reproduced with an input as small as this, assembled with
"as --gdwarf2 -c test.s".

        .text
        nop
        .hword 0

See a6228f47f0 for earlier cases of
the same error - although in those cases, we actually did have more
code and labels following the unaligned jump tables.

This error is present with binutils 2.39 and earlier; in
binutils 2.40, this input no longer is considered an error, fixed
in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=6f6f5b0adc9efd103c434fd316e8c880a259775d.
2024-04-22 09:11:37 +00:00
Martin StorsjöandJ. Dekker 5e31720b89 checkasm: Add support for the private macOS kperf API for benchmarking
On AArch64, the performance counter registers usually are
restricted and not accessible from user space.

On macOS, we currently use mach_absolute_time() as timer on
aarch64. This measures wallclock time but with a very coarse
resolution.

There is a private API, kperf, that one can use for getting
high precision timers though. Unfortunately, it requires running
the checkasm binary as root (e.g. with sudo).

Also, as it is a private, undocumented API, it can potentially
change at any time.

This is handled by adding a new meson build option, for switching
to this timer. If the timer source in checkasm could be changed
at runtime with an option, this wouldn't need to be a build time
option.

This allows getting benchmarks like this:

mc_8tap_regular_w16_hv_8bpc_c:              1522.1 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_neon:            331.8 ( 4.59x)

Instead of this:

mc_8tap_regular_w16_hv_8bpc_c:                 9.0 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_neon:              1.9 ( 4.76x)

Co-authored-by: J. Dekker <jdek@itanimul.li>
2024-04-02 10:35:29 +00:00
Martin Storsjö 024b260cb9 arm32: Fix right shifts in the 16bpc iwht implementation
These shifts used the wrong element size; this only was noticed in
some argon tests.
2024-03-08 21:49:57 +00:00
Martin Storsjö fd60097eb2 checkasm: aarch64: Print the SVE vector length, if available 2024-03-04 23:04:51 +02:00
Martin Storsjö e1f80dec00 aarch64: Check for assembler support for various aarch64 extensions
First check if the assembler supports the ".arch" directive, and
what architecture levels are supported.

In principle, we'd only need to check for support for ".arch armv8.2-a",
since that's enough for enabling the i8mm and sve2 extensions.

However, recent Clang versions (before version 17) wasn't able to
enable the dotprod and i8mm extensions via the ".arch_extension"
directives, so check for support for armv8.4-a and armv8.6-a as well,
which enable dotprod and i8mm implicitly.

This allows assembling these instructions on most commonly available
GCC and Clang based toolchains, while still allowing toggling support
for the instruction sets on and off within the source files.

Within assembly, we disable these extensions by default, so that
instructions enabled within these extension sets can't be used
by accident in unintended functions. Code meaning to use these
extensions can be assembled like this:

    #if HAVE_SVE
    ENABLE_SVE
    // code
    DISABLE_SVE
    #endif
2024-03-04 20:50:39 +00:00
Martin Storsjö 0d2e83cc16 ci: Add an aarch64 cross compile CI job with a recent Clang 2024-02-28 16:40:14 +00:00
Martin Storsjö 302334a6be ci: Test aarch64 with QEMU, with varying SVE vector lengths
This allows testing all modern aarch64 CPU features, that the
HW based test runners might not support.

Especially for SVE, this allows testing all valid vector lengths,
which might not exist in hardware form yet.
2024-02-28 16:40:14 +00:00
Martin Storsjö 39be9fb438 ci: Bump to the latest dav1d-debian-unstable image
This one contains aarch64 cross tools, for use with QEMU.
2024-02-28 16:40:14 +00:00
Martin Storsjö 5149b27447 checkasm: Map SIGBUS to the right error text
This was missed in 2ef970a885.

Also print this text for EXCEPTION_IN_PAGE_ERROR on Windows.
2023-12-15 14:10:01 +02:00
Martin Storsjö 2179b30c84 checkasm: Fix catching crashes on Windows on ARM
longjmp on Windows uses SEH to unwind on ARM/ARM64 too, just like on
x86_64, thus use RtlCaptureContext/RtlRestoreContext instead of
setjmp/longjmp on those architectures as well.
2023-11-01 19:28:07 +02:00
Martin Storsjö a7e12b6284 windows: Clarify unicode characters in RC files
Windows RC files can have strings expressed either as narrow
chars expressed in a specific codepage, or as wide unicode strings.
Regardless of which way they are expressed, they are converted into
unicode strings in the compiled resource files.

When using narrow strings, even if using escaped chars like \251,
those chars are interpreted according to a specific codepage. The
codepage can be specified with arguments to the RC/windres tool
(or with a pragma, but not all tools support the pragmas),
but when no codepage is specified, the exact interpretation varies.

llvm-rc uses a hard stance of defaulting to only accepting ANSI
chars unless something else has been specified (and pragmas aren't
supported). llvm-windres defaults to CP 850 though, for compatibility
with what most people probably intend to.

However, GNU windres and MS rc.exe actually default to what the
system's current default codepage is. That means that if the resource
file is built on a machine with e.g. Japanese as the default locale,
the file gets built differently, with a different Unicode character
than what was intended.

By converting the strings to wide strings, it is unambiguous that
\251 refers to the Unicode code point u00A9 (octal 0251), i.e.
copyright sign.

This fixes building the RC files with llvm-rc. With GNU windres,
llvm-windres and rc.exe, the files still generate the bitwise exact
same output as before.
2023-07-08 00:24:57 +03:00
Martin StorsjöandHenrik Gramner bc76a22015 arm: ipred: Update pal_pred to work with packed indices 2023-07-06 23:12:02 +02:00
Martin Storsjö 616bfd1506 arm32: refmvs: Fix building with MS armasm
Add an explicit align before the jump table; this avoids armasm bugs
in how label differences are calculated. This matches how all other
jump tables are written in our 32 bit arm assembly.
2023-07-01 11:36:39 +03:00
Martin Storsjö b33d77f903 arm32: refmvs: Add NEON implementation of save_tmvs
Relative speedup compared to C:
             Cortex A7     A8     A9    A53    A72    A73
save_tmvs_neon:   1.20   1.42   1.25   1.58   1.26   1.99
2023-06-30 11:44:17 +03:00
Martin Storsjö a1d7763f7b arm64: refmvs: Use addp instead of trn2+add
Also improve scheduling in the prologue and fix a few cases of
inconsistent indentation.

Before:        Cortex A53       A55      A72       A73      A76  Apple M1
save_tmvs_neon:   73657.2   74470.9  72238.1   56095.4  34135.7  207.9
After:
save_tmvs_neon:   72187.2   74434.6  71068.9   56043.9  33237.4  201.0

(The changes to the M1 numbers are mostly measurement noise though.)
2023-06-30 11:42:33 +03:00
Martin Storsjö 189d47c2fa arm64: refmvs: Fix building with MSVC
Binutils and LLVM assemblers can infer that this str instruction must
be stur (and implicitly assemble it into that instruction), while MS
armasm64 errored out with this message:

src\libdav1d.a.p\refmvs.obj.asm(673) : error A2518: operand 2: Memory offset must be aligned
        str             q2, [x3, #(8*5-16)]
2023-06-28 15:37:09 +03:00
Martin Storsjö c39779f474 arm64: refmvs: Process two blocks at a time in save_tmvs
Before:        Cortex A53       A55     A72       A73      A76  Apple M1
save_tmvs_neon:   79184.7   79889.9  54720.2  54522.6  29919.6  216.4
After:
save_tmvs_neon:   73780.0   74339.2  70414.1  59102.0  35028.4  213.9

The benefit from this is marginal on Cortex A53 and A55, and Apple
M1, while this change actually makes the code notably slower on
Cortex A72, A73 and A76.
2023-06-27 00:10:21 +03:00
Martin Storsjö 6aa37aec8f arm64: refmvs: Add NEON implementation of save_tmvs
Cortex A53       A55      A72      A73      A76  Apple M1
save_tmvs_c:     116768.4  122653.1  82587.7  90445.0  45386.8  242.1
save_tmvs_neon:   79184.7   79889.9  54720.2  54522.6  29919.6  216.4

Relative speedup compared with C:
            Cortex A53    A55    A72    A73    A76   Apple M1
save_tmvs_neon:   1.47   1.54   1.51   1.66   1.52   1.12
2023-06-27 00:10:21 +03:00
Martin Storsjö c121b831e2 arm64: looprestoration: Rewrite the SGR functions
Make them operate in a more cache friendly manner, interleaving the
various passes, and merging some of the functions that operate on
data in similar patterns.

This reduces the amount of stack used from 207 KB to 14 KB for sgr_3x3,
from 207 KB to 16 KB for sgr_5x5 and from 255 KB to 33 KB for sgr_mix.

This does however increase the size of the binary by about 12 KB. (The
executable code generated from assembly actually shrinks by a little,
but the higher level logic in C is quite nontrivial.)

This is somewhat similar to what was done for x86 in
fe2bb77424.

Benchmarks from checkasm:

Before:             Cortex A53        A55        A72        A73        A76   Apple M1
sgr_3x3_8bpc_neon:    493005.0   483133.2   365056.3   345197.9   202819.1   537.3
sgr_5x5_8bpc_neon:    353152.6   349614.3   268962.2   248431.8   142302.4   385.9
sgr_mix_8bpc_neon:    829903.9   815910.9   622858.5   577238.0   333362.9   881.7
sgr_3x3_10bpc_neon:   504778.6   499851.6   379203.1   346695.2   199738.7   537.0
sgr_5x5_10bpc_neon:   363111.9   362489.7   267903.1   247506.5   138417.2   351.3
sgr_mix_10bpc_neon:   853053.7   846768.8   628349.6   584553.8   328399.5   843.6

After:
sgr_3x3_8bpc_neon:    387949.9   384216.4   294423.7   301968.2   184643.1   492.4
sgr_5x5_8bpc_neon:    259854.7   257233.2   193983.7   198388.4   128497.0   341.2
sgr_mix_8bpc_neon:    606401.5   595661.3   457209.7   462721.8   281906.7   738.6
sgr_3x3_10bpc_neon:   392472.7   394100.5   296048.1   304339.4   184271.4   471.3
sgr_5x5_10bpc_neon:   257248.3   257651.1   197552.5   199655.1   130739.7   322.9
sgr_mix_10bpc_neon:   605263.3   611197.4   441789.3   461339.2   286320.1   721.4

Speedup vs before:
                        27-41%     25-40%     23-42%     13-26%      5-18%   8-19%
2023-06-22 13:57:17 +03:00
Martin Storsjö 3c2f2087d8 arm64: looprestoration: Properly use 32 bit registers for 32 bit parameters
This issue isn't caught by checkasm, since these functions are
internal to the SGR implementation, and checkasm only affects
the parameters on the external DSP function interface.

This could potentially trigger errors with future compilers.
2023-06-22 11:03:35 +03:00
Martin Storsjö 77d0cbaf0e Avoid an MSVC warning about conversion to smaller data types
After 8f320d5958, MSVC started
producing this warning:

[63/123] Compiling C object src/libdav1d.a.p/obu.c.obj
../src/obu.c(708): warning C4244: '=': conversion from 'uint16_t' to 'uint8_t',
possible loss of data
2023-06-07 11:04:37 +00:00
Martin Storsjö ca39c862ac arm64: ipred: 16 bpc NEON implementation of the Z2 function
Relative speedup over unvectorized C code:
                          Cortex A53    A55    A72    A73    A76   Apple M1
intra_pred_z2_w4_16bpc_neon:    2.98   2.98   2.38   2.77   3.19   7.75
intra_pred_z2_w8_16bpc_neon:    3.91   4.22   2.64   3.29   3.73   4.78
intra_pred_z2_w16_16bpc_neon:   4.43   5.12   2.89   3.90   3.50   4.26
intra_pred_z2_w32_16bpc_neon:   5.08   6.36   3.44   4.40   4.05   4.96
intra_pred_z2_w64_16bpc_neon:   4.68   5.97   3.29   4.40   3.68   5.23
2023-05-25 16:51:35 +03:00
Martin Storsjö 1dd0cd3a39 arm64: ipred: Remove unnecessary instructions from z2_fill 2023-05-25 16:51:35 +03:00
Martin StorsjöandJean-Baptiste Kempf 8af8244a3a arm64: ipred: 8 bpc NEON implementation of the Z2 function
Relative speedup over C code:
                         Cortex A53    A55    A72    A73    A76   Apple M1
intra_pred_z2_w4_8bpc_neon:    3.91   3.55   3.31   3.94   3.46   8.50
intra_pred_z2_w8_8bpc_neon:    5.68   5.67   4.31   5.31   4.34   5.83
intra_pred_z2_w16_8bpc_neon:   8.39   9.28   5.53   7.04   7.01   9.45
intra_pred_z2_w32_8bpc_neon:   7.01   8.01   5.04   6.32   5.48   7.48
intra_pred_z2_w64_8bpc_neon:   8.73  10.25   5.92   7.61   6.63  10.05
2023-05-05 15:40:57 +00:00
Martin Storsjö e75caab99e arm64: ipred: 16 bpc NEON implementation of the Z3 function
Relative speedup over the C code:
                          Cortex A53    A55    A72    A73    A76   Apple M1
intra_pred_z3_w4_16bpc_neon:    3.06   2.87   2.17   1.97   2.33   7.75
intra_pred_z3_w8_16bpc_neon:    3.90   3.94   2.97   3.16   2.93   4.43
intra_pred_z3_w16_16bpc_neon:   4.08   4.48   3.31   4.68   3.13   5.00
intra_pred_z3_w32_16bpc_neon:   4.43   4.85   3.50   4.02   3.33   5.62
intra_pred_z3_w64_16bpc_neon:   4.68   5.30   3.72   3.96   3.52   5.78
2023-03-21 08:57:44 +02:00
Martin Storsjö 2eb9239100 arm64: ipred: 16 bpc NEON implementation of the Z1 function
Relative speedup over the C code:
                          Cortex A53    A55    A72    A73    A76   Apple M1
intra_pred_z1_w4_16bpc_neon:    3.49   2.63   2.83   3.85   3.14   9.00
intra_pred_z1_w8_16bpc_neon:    6.19   4.39   3.65   6.58   4.99   6.50
intra_pred_z1_w16_16bpc_neon:   6.65   4.64   3.97   7.78   4.87   7.00
intra_pred_z1_w32_16bpc_neon:   7.76   5.49   5.17   7.83   5.59   8.24
intra_pred_z1_w64_16bpc_neon:   8.02   5.80   5.33   8.41   5.77   8.70
2023-03-21 08:57:43 +02:00
Martin Storsjö ec38062a12 arm: ipred: Make a SIMD pixel_set function for padding
For 8 bpc, there's probably not much difference to a decent memset,
but for 16 bpc, there might be a bigger difference.
2023-03-21 08:57:43 +02:00
Martin Storsjö 6f5bf165e4 arm64: ipred: Use fewer registers for table lookups in w=8 in z3_fill1 for 8bpc
Add comments explaining the exact dimensions of the gather tables
used currently. That reasoning shows that the w=8 case can do with
one register less.

Before:                  Cortex A53     A55     A72     A73    A76  Apple M1
intra_pred_z3_w8_8bpc_neon:   356.2   376.2   218.9   246.4  176.1  0.6
After:
intra_pred_z3_w8_8bpc_neon:   339.6   357.3   205.6   232.3  160.0  0.5
2023-03-21 08:57:43 +02:00
Martin Storsjö 7be5347c97 arm64: ipred: Improve accumulation ordering in 8bpc z1
Start out the multiplication/accumulation with a register that is
available sooner.

Before:                    Cortex A53      A55      A72      A73     A76   Apple M1
intra_pred_z1_w8_8bpc_neon:     266.3    268.9    146.6    155.3   103.9   0.4
intra_pred_z1_w16_8bpc_neon:    528.6    574.4    333.9    364.3   209.1   0.7
intra_pred_z1_w32_8bpc_neon:   1149.3   1245.4    752.3    811.5   503.4   1.3
intra_pred_z1_w64_8bpc_neon:   2198.4   2360.6   1462.9   1575.0  1007.6   2.4
After:
intra_pred_z1_w8_8bpc_neon:     266.3    269.1    146.6    155.0   100.1   0.4
intra_pred_z1_w16_8bpc_neon:    528.6    573.3    347.9    352.4   204.3   0.7
intra_pred_z1_w32_8bpc_neon:   1149.2   1245.3    763.4    759.6   474.8   1.3
intra_pred_z1_w64_8bpc_neon:   2198.8   2360.6   1430.0   1417.4   943.5   2.3
2023-03-21 08:57:43 +02:00
Martin Storsjö 92d93f4b35 arm64: ipred: Optimize the 3tap filter padding in z1_filter_edge
The second register will at most contain one valid pixel, the
padding pixel. Thus skip padding the register and just fill it
with the padding pixel.
2023-03-21 08:57:43 +02:00
Martin Storsjö 8ee450cbd0 arm64: ipred: Remove leftover instructions at the start of z3_fill2
There were redundant leftovers from copypasting bits when writing this
function.
2023-03-21 08:57:43 +02:00
Martin Storsjö ab6977bc04 arm64: ipred: Rename a misnamed local label in the assembly
This is for cases with h >= 16.
2023-03-21 08:57:42 +02:00
Martin Storsjö da9602a32b arm64: ipred: Fix a misindented operand in the assembly 2023-03-21 08:57:42 +02:00
Martin Storsjö 50a89b6383 arm: ipred: Fix a misindented line in the C wrapper 2023-03-21 08:57:42 +02:00
Martin StorsjöandMatthias Dressel 5c9d651edc Add a -j option to dav1d_argon.bash 2023-03-01 19:59:10 +01:00
Martin Storsjö ef0fb0b6fc Fix building with MSVC after recent commit
98b0c96d21 added an include of
src/ref.h in src/fg_apply_tmpl.c. That template source file is
included in tests/checkasm/filmgrain.c.

src/ref.h includes <stdatomic.h>. Including this file requires
declaring a dependency on stdatomic_dependencies in meson, which
provides the fallback implementation of stdatomic.h when building
with MSVC.
2023-02-27 01:04:25 +02:00
Martin Storsjö 77b3955537 checkasm: Add an --affinity= option for selecting a CPU core
Add an option for selecting the core where the single thread of
checkasm runs. This allows benchmarking on specific CPU cores on
heterogenous CPUs, like ARM big.LITTLE configurations.

On Linux, one can easily wrap an invocation of checkasm with
"taskset -c <n> [...]" - so this option isn't very essential
there - however it is quite useful on Windows.

On Windows, it is somewhat possible to do the same by launching
the tool with "start /B /affinity <hexmask> [...]", but that
doesn't work well with scripting ("start" returns before the
command has finished running, and it's not obvious how to
invoke "start" from within WSL).

Using "taskset" to launch processes on specific cores within WSL
on Windows doesn't work - regardless of the Linux level affinity,
the process ends up running on the performance cores anyway.
2023-01-31 15:33:58 +02:00
Martin Storsjö 99956c737a arm64: ipred: 8 bpc NEON implementation of the Z3 function
The implementation is a hybrid between two approaches; one generic
(but non-ideal) for cases with large max_base_y, which fills two
pixel columns at a time, i.e. looping over pixels first vertically,
then horizontally - i.e. in a non-optimal manner.

For cases with smaller max_base_y, it does two rows at a time, essentially
doing gathers with the TBX instruction.

Relative speedup over the C code:

                         Cortex A53    A55    A72    A73    A76   Apple M1
intra_pred_z3_w4_8bpc_neon:    3.32   2.89   2.78   3.52   2.52   9.67
intra_pred_z3_w8_8bpc_neon:    6.24   5.55   4.76   5.60   4.11   6.40
intra_pred_z3_w16_8bpc_neon:   7.64   7.07   4.37   6.23   4.18   8.60
intra_pred_z3_w32_8bpc_neon:   7.51   7.21   4.34   5.92   4.27   7.88
intra_pred_z3_w64_8bpc_neon:   6.82   6.25   4.08   5.83   3.52   7.31
2023-01-31 10:16:16 +02:00
Martin Storsjö fd4f348e70 arm64: ipred: 8 bpc NEON implementation of the Z1 function
Relative speedup over the C code:

                         Cortex A53    A55    A72    A73    A76  Apple M1
intra_pred_z1_w4_8bpc_neon:    4.09   3.15   3.63   4.16   3.27  13.00
intra_pred_z1_w8_8bpc_neon:    6.93   5.66   5.57   6.76   5.51   5.50
intra_pred_z1_w16_8bpc_neon:   7.81   6.85   6.24   7.78   6.59   9.00
intra_pred_z1_w32_8bpc_neon:  10.56   9.95   8.72  10.95   8.28  13.33
intra_pred_z1_w64_8bpc_neon:  11.00  11.38   9.11  11.62   8.65  14.61

(The speedup numbers for M1 are kinda noisy due to the very coarse
granularity of the timer used there.)
2023-01-27 23:54:44 +02:00
Martin Storsjö 2e990b370e checkasm: ipred: Iterate 5 times for each Z1/Z2/Z3 function
These functions contain a number of different codepaths; try to
make sure that we hit most codepaths for each size combination.

This both gives better test coverage in one single run of checkasm,
but also should give a better averaged runtime in benchmarks.
2023-01-27 23:54:20 +02:00
Martin Storsjö 8a4932ff03 Implement atomic_compare_exchange_strong in the atomic compat headers
This fixes building with MSVC (and older GCC versions) after
3e7886db54.
2022-10-26 16:14:52 +03:00
Martin Storsjö 345127a795 arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths
This fixes conformance with the argon test samples, in particular
with these samples:
    profile0_core/streams/test10100_579_8614.obu
    profile0_core/streams/test10218_6914.obu

This gives a pretty notable slowdown to these transforms - some
examples:

Before:                                 Cortex A53       A72       A73    Apple M1
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       365.7     290.2     299.8    0.3
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    1865.2    1384.1    1457.5    2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   33976.3   26817.0   24864.2   40.4
After:
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       397.7     322.2     335.1    0.4
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    2121.9    1336.7    1664.6    2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   38569.4   27622.6   28176.0   51.0

Thus, for the transforms alone, it makes them around 10-13% slower
(the Apple M1 measurements are too noisy to be conclusive here).

Measured on actual full decoding, it makes decoding of 10 bpc
Chimera around maybe 1% slower on an Apple M1 - close to measurement
noise anyway.
2022-09-19 20:40:34 +00:00
Martin Storsjö cc9651f516 Don't use gas-preprocessor with clang-cl for arm targets
Since meson 0.58.0 (released in May 2021), meson accepts adding '.S'
assembly files as source files to the clang-cl compiler.

If using an older version of meson, keep using gas-preprocessor
just like for MSVC builds.
2022-09-15 11:25:37 +03:00
Martin Storsjö 08c708015e tools: Allocate the priv structs with proper alignment
Previously, they could be allocated with any random alignment
matching the end of the MuxerContext/DemuxerContext. The
priv structs themselves can have members that require specific
alignment, or at least the default alignment of malloc()/calloc()
(which is sufficient for native types such as uint64_t and
doubles).

This fixes crashes in some arm builds, where GCC (correctly) wants
to use 64 bit aligned stores to write to MD5Context.
2022-09-14 15:59:19 +03:00