dav1d

x/dav1d

mirror of https://code.videolan.org/videolan/dav1d synced 2026-06-11 04:03:05 +00:00

Author	SHA1	Message	Date
Martin StorsjöandJean-Baptiste Kempf	720adf9b5b	ci: Add -Dtrim_dsp=false in a couple of aarch64/arm configurations For the "release" build configurations, trim_dsp defaults to true, while it defaults to false for "debugoptimized". This means that the configurations with release mode, without -Dtrim_dsp=false actually run checkasm before. In practice, checkasm is covered by later, full-test configurations, but this ensures that we do test it at this stage as well, as intended.	2026-06-07 22:36:10 +02:00
Martin Storsjö	c85856e360	aarch64: Fix a name mismatch in a macro error message For the 64 bit assembly, the macro is just named "sub_sp", while it was named "sub_sp_align" in the 32 bit form.	2026-05-15 14:24:57 +03:00
Martin Storsjö	037430193a	arm: Fix up code style slightly The existing code has been written striving to align columns so that the largest register names can be typed, e.g. r10 on ARM (and similarly for x10 or q10 on AArch64), or v31.16b for AArch64 vectors. Fix some cases, where the current forms were clearly inconsistent/wrong. Not all cases have been fixed up to match this norm, but some individual ones that were clearly wrong have been fixed.	2026-05-06 15:32:26 +00:00
Martin Storsjö	7b9ab8373e	ci: Update the main CI image This version includes llvm-symbolizer, which should improve backtraces in sanitizer builds with Clang.	2026-05-06 15:11:16 +00:00
Martin Storsjö	ac5dfb0a85	examples: Treat SDL2 headers as system headers This makes those headers included with -isystem rather than -I, which makes the compiler skip producing any warnings about them (as they're expected to be out of the user code's control). This avoids warnings with newer versions of the dav1d-debian-unstable CI image, warnings (treated as errors in CI) like this: In file included from /usr/include/SDL2/SDL_config.h:51, from /usr/include/SDL2/SDL_stdinc.h:33, from /usr/include/SDL2/SDL_main.h:25, from /usr/include/SDL2/SDL.h:31, from ../examples/dav1dplay.c:33: /usr/include/SDL2/SDL_config_unix.h:186:9: error: 'HAVE_GETAUXVAL' redefined [-Werror] 186 \| #define HAVE_GETAUXVAL 1 \| ^~~~~~~~~~~~~~ In file included from ../examples/dav1dplay.c:27: ./config.h:66:9: note: this is the location of the previous definition 66 \| #define HAVE_GETAUXVAL 0 \| ^~~~~~~~~~~~~~ Recently, Debian Unstable has switched from providing the actual SDL 2 to providing the SDL 2 API through the sdl2-compat package on top of SDL 3. The SDL 2 headers expose their full config.h as part of their installed headers (that the user code ends up including). This includes unnamespaced defines, such as "#define HAVE_GETAUXVAL 1". This issue hasn't shown up with the original SDL 2 package in Debian, due to a Debian packaging detail. While most SDL 2 headers are installed in /usr/include/SDL2 (and user code includes it as <SDL.h>, requiring the build system to include /usr/include/SDL2), the Debian packaging has replaced /usr/include/SDL2/SDL_config.h with a header that includes <SDL2/_real_SDL_config.h>, which then gets resolved in /usr/include/x86_64-linux-gnu/SDL2. Due to this being included from a compiler default system include path (/usr/include/x86_64-linux-gnu), no warnings about the header was printed, even though that one also produced the same kind of conflicting redefinitions. (We could also avoid the same issue by attempting to include <SDL2/SDL.h> instead of <SDL.h>, avoiding the use of the build system provided include directory, resolving that from /usr/include, and having the compiler consider it a system header.) The sdl2-compat package in Debian doesn't redirect that header in the same way, but includes SDL_config_unix.h in the same directory in /usr/include/SDL2. Due to this being included from a user specified -I (as long as it is included as <SDL.h>, not <SDL2/SDL.h>), it's considered a user header, and warnings are printed for it. It seems like SDL 3 no longer exposes their config.h headers as part of the installed headers. The conflict between SDL 2's config.h's HAVE_GETAUXVAL and our stems from the fact that we only try to detect GETAUXVAL on architectures where we want to use it (arm/aarch64, loongarch, ppc or riscv). On x86, where we don't need it, we don't try to detect it, and set "#define HAVE_GETAUXVAL 0" in our config.h. To avoid warnings due to the conflict, we can declare the SDL 2 dependency with the argument "include_type: 'system'", which should silence any warnings in the SDL headers. This Meson feature is available since Meson 0.52.0 (and we currently require Meson 0.54.0). An alternative way to avoid the redefinition conflict would be to always try to detect getauxval on all architectures, to make our config.h agree with SDL 2's config headers. A third (and much more hacky way) around the conflict would be to avoid the public SDL headers including the SDL_config header by defining "SDL_config_h_" before including SDL.h. Doing this also requires manually including a couple more standard headers before SDL.h (stdint.h, stdio.h, stddef.h).	2026-05-06 14:00:31 +03:00
Martin Storsjö	556c5202b4	ci: Add testing on macOS on Apple Silicon too	2026-05-03 22:39:01 +03:00
Martin Storsjö	e1bd6f76c2	checkasm: Readd a dependency on threads `3a2a874994`, which switched to using the checkasm core from the separate checkasm project, removed the thread dependency from the checkasm executable, as the checkasm library itself has a thread dependency. However, checkasm doesn't always include that thread dependency, it only does that when pthread_setaffinity_np is detected. The dav1d object files themselves use pthreads as well, causing undefined symbols if checkasm doesn't link in pthreads. This should fix linking on OpenBSD after `3a2a874994`, fixing issue #467.	2026-05-03 12:42:39 +03:00
Martin Storsjö	5cfc383268	arm: mc: Optimize prep_neon for the w4/w8 cases Use alternating registers for immediately sequential loads/stores, pack two 4 pixel rows into one register. Before: Cortex A7 A8 A53 A55 A72 A73 A76 mct_8tap_regular_w4_0_8bpc_neon: 112.0 68.6 79.7 82.9 45.3 39.4 24.4 mct_8tap_regular_w8_0_8bpc_neon: 158.2 89.5 108.4 113.4 55.4 53.0 30.0 After: mct_8tap_regular_w4_0_8bpc_neon: 89.7 69.9 76.3 85.1 36.2 35.2 25.0 mct_8tap_regular_w8_0_8bpc_neon: 149.0 92.7 102.6 115.8 56.6 52.8 31.4 The numbers aren't entirely consistent, but this is mostly favourable.	2026-04-29 15:56:01 +03:00
Martin Storsjö	727d0f984b	arm: mc: Fix a comment typo This seems to be right in all the other similar places (arm/64/mc.S, arm/32/mc16.S and arm/64/mc16.S).	2026-04-29 15:56:01 +03:00
Martin Storsjö	aa4504729c	arm: Fix a typo in a URL This was added in `a00289b6d8`.	2026-03-31 13:48:43 +03:00
Martin Storsjö	594d1601ff	arm: Add Armv9.3-A GCS (Guarded Control Stack) support Signal that our assembly is compliant with the GCS feature, if the GCS feature is enabled in the compiler (available since Clang 18 and GCC 15) - this is enabled by -mbranch-protection=standard with a new enough compiler. GCS doesn't require any specific modifications to the assembly code, but requires that all functions return to the expected call address (checked through a shadow stack).	2026-03-17 20:40:05 +00:00
Martin Storsjö	4fd22e97d8	arm: Switch to a more correct Windows flag for detecting I8MM Newer revisions of WinSDK 10.0.26100.0 have exposed more flags for IsProcessorFeaturePresent; now there is a separate one for detecting specifically I8MM and not just SVE-I8MM. Switch to using this flag instead.	2026-03-04 15:16:37 +02:00
Martin Storsjö	de4ce4f32d	arm: mc: Add missing # for some immediate constants, for consistency The assembler doesn't require the # here, but we use that everywhere else, so add it here as well for consistency.	2026-02-06 16:01:57 +02:00
Martin Storsjö	9c13b5fbd0	subprojects: Update checkasm to v1.1.0 This version, together with the previous commit `574e7f4727`, fixes issue #460. Due to checkasm internal restructuring, one may run into build issues if rebuilding in an old build directory after updating the checkasm subproject, without getting rid of older meson generated headers in the build directory.	2026-02-03 09:27:22 +00:00
Martin Storsjö	a44b589872	Silence a new MSVC warning This silences the following warnings in MSVC 2026 18.0 (and 2022 17.14): ../tools/dav1d_cli_parse.c(213): warning C5287: operands are different enum types 'CpuFlags' and 'CpuMask'; use an explicit cast to silence this warning ../tools/dav1d_cli_parse.c(214): warning C5287: operands are different enum types 'CpuFlags' and 'CpuMask'; use an explicit cast to silence this warning ../tools/dav1d_cli_parse.c(215): warning C5287: operands are different enum types 'CpuFlags' and 'CpuMask'; use an explicit cast to silence this warning ../tools/dav1d_cli_parse.c(216): warning C5287: operands are different enum types 'CpuFlags' and 'CpuMask'; use an explicit cast to silence this warning This warning flag was new in MSVC 2022 17.14, but it was buggy in that version - it produced spurious warnings for other cases as well (and using an explicit cast to silence it didn't work as advertised), see [1] and [2]. The bugs were fixed in 18.0, and the remaining construct that it warns about is something that is somewhat reasonable to warn about: enum CpuFlags { DAV1D_X86_CPU_FLAG_SSE2 = 1 << 0, DAV1D_X86_CPU_FLAG_SSSE3 = 1 << 1, }; enum CpuMask { X86_CPU_MASK_SSE2 = DAV1D_X86_CPU_FLAG_SSE2, X86_CPU_MASK_SSSE3 = DAV1D_X86_CPU_FLAG_SSSE3 \| X86_CPU_MASK_SSE2, }; Instead of adding explicit casts on the constants from the foreign enum, just disable this warning. [1] https://developercommunity.visualstudio.com/t/False-positive-C5287:-operands-are-diff/10915265 [2] https://developercommunity.visualstudio.com/t/warning-C5287:-operands-are-different-e/10877942	2026-01-20 12:00:29 +02:00
Martin Storsjö	04b69f93e5	checkasm: Reinstate check for TRIM_DSP_FUNCTIONS This was lost in `3a2a874994`. Without this, checkasm ends up printing a quite confusing output consisting only of the functions that have two or more assembly implementations, if trim_dsp happens to be enabled.	2026-01-14 22:29:38 +02:00
Martin Storsjö	afd13d8906	arm: Fix a few misindented lines	2026-01-09 14:00:39 +02:00
Martin Storsjö	574e7f4727	checkasm: Pass HAVE_C11_GENERIC to checkasm as -DCHECKASM_HAVE_GENERIC=1/0 For this to have an effect, it requires using a newer version of the wrapped checkasm subproject; including checkasm commit be05a7972e47c658a7c5c186294d27caa5735db2 or newer.	2026-01-07 16:22:18 +02:00
Martin Storsjö	b2f9c10670	checkasm: Fix building with MSVC The glue code in our headers, for integrating with the external checkasm, was incompatible with MSVC. MSVC has a nonstandard handling of __VA_ARGS__ with macros; when one macro invokes another macro, __VA_ARGS__ gets treated as one single parameter and can't map to more than one parameter in the invoked macro. (In other words, when calling another macro, __VA_ARGS__ must map in its entirety to a ... parameter of the other macro.) Modern versions of MSVC do implement the correct mode as well, but defaults to the old one for backwards compatibility. To choose the new mode, we'd have to build our code with -Zc:preprocessor. That's certainly doable, but it's fairly easy to avoid the issue as well. To avoid this issue, change the variadic PIXEL_RECT(...) to explicitly names its arguments. There's actually no variability in the arguments involved here. (Alternatively, we could force the preprocessor to expand the arguments one extra time, avoiding the issue, with e.g. "#define EXPAND(x) x" and wrapping PIXEL_RECT with it, e.g. "#define PIXEL_RECT(...) EXPAND(BUF_RECT(pixel, __VA_ARGS__))".) See [1], [2] and [3] for more discussion on the matter. [1] https://stackoverflow.com/a/5134656/3115956 [2] https://stackoverflow.com/a/7459803/3115956 [2] https://learn.microsoft.com/en-us/cpp/preprocessor/preprocessor-experimental-overview?view=msvc-160	2026-01-07 11:57:09 +02:00
Martin Storsjö	e7c280e4cd	x86: Sync the latest upstream version of x86inc.asm	2025-11-12 23:19:22 +02:00
Martin Storsjö	2eac05d648	checkasm: arm: Use X() instead of inline ifdefs This works fine when the referenced symbol has the same prefix as PRIVATE_PREFIX in the same file; otherwise we could also create a macro like X() that only prepends the extern symbol prefix but no symbol namespace prefix.	2025-11-12 15:54:40 +02:00
Martin Storsjö	b129d9f2cb	mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap} For the bilin cases, this seems to make things marginally faster (measured on x86_64; 7-25% faster with compiler autovectorization). For 8tap, it doesn't make much of a difference at all. Before: GCC Clang mc_scaled_8tap_regular_w128_8bpc_c: 115155.5 98549.3 mc_scaled_8tap_regular_w128_8bpc_ssse3: 17936.0 18411.1 mc_scaled_bilinear_w128_8bpc_c: 40290.0 51812.9 mc_scaled_bilinear_w128_8bpc_ssse3: 18243.9 18177.0 After: mc_scaled_8tap_regular_w128_8bpc_c: 116304.3 99453.2 mc_scaled_8tap_regular_w128_8bpc_ssse3: 18387.0 18077.3 mc_scaled_bilinear_w128_8bpc_c: 37381.4 41145.0 mc_scaled_bilinear_w128_8bpc_ssse3: 18423.8 18031.6 (Benchmarked with the seed 0; the total runtime for the scaled benchmarks are significantly affected by the random seed.) This reduces the stack usage of these functions from around 65 KB each, to less than 1 KB for bilin, and around 2 KB for 8tap. With this in place, the required stack space for dav1d should be mostly identical across configurations; on x86_64 (both with and without assembly), it can run with 62 KB of stack, and on arm and aarch64, it can run with 58 KB of stack.	2025-01-02 15:30:21 +00:00
Martin Storsjö	2ba57aa535	arm32: looprestoration: Rewrite the wiener functions Switch to the same cache-friendly algorithm as was done for arm64 in `2e73051c57` and for the reference C code in `8291a66e50`. Contrary to the arm64 implementation, this uses a main loop in C (very similar to the one in the main C implementation in `8291a66e50`) rather than assembly; this gives a bit more overhead on the call to each function, but it shouldn't affect the big picture much. Performane wise, this doesn't make much of a difference - it makes things a little bit faster on some cores, and a little bit slower on others: Before: Cortex A7 A8 A53 A72 A73 wiener_7tap_8bpc_neon: 269384.4 147730.7 140028.5 92662.5 92929.0 wiener_7tap_10bpc_neon: 352690.2 159970.2 169427.8 116614.9 119371.1 After: wiener_7tap_8bpc_neon: 238328.0 157274.1 134588.6 92200.3 97619.6 wiener_7tap_10bpc_neon: 336369.3 162182.0 161954.4 125521.2 130634.0 This is mostly in line with the results on arm64 in `2e73051c57`. On arm64, there was a bit larger speedup for the 7tap case, mostly attributed to unrolling the vertical filter (and the new filter_hv function) to operate on 16 pixels at a time. On arm32, there's not enough registers to do that, so we can't get such gains from unrolling. (Reducing the unrolling on the arm64 version to match the case on arm32 also shows similar performance numbers as on arm32 here.) In the arm64 version, we also added separate 5tap versions of all functions; not doing that for arm32 at this point. This increases the binary size by 2 KB. This doesn't have any immediate effect on how much stack space dav1d requires in total, since the largest stack users on arm currently are the 8tap_scaled functions.	2024-12-20 14:32:32 +02:00
Martin Storsjö	8291a66e50	looprestoration: Use only 6 row buffer for wiener, like NEON/x86 This uses a separate function for combined horizontal and vertical filtering, without needing to write the intermediate results back to memory inbetween. This mostly serves as an example for how to adjust the logic for that case; unless we actually merge the horizontal and vertical filtering within the _hv function, we still need space for a 7th row on the stack within that function (which means we use just as much stack as before), but we also need one extra memcpy to write it into the right destination. In a build where the compiler is allowed to vectorize and inline the wiener functions into each other, this change actually reduces the final binary size by 4 KB, if the C version of the wiener filter is retained. This change makes the vectorized C code as fast as it was before with Clang 18; on Xcode Clang 16, it's 2x slower than it was before. Unfortunately, with GCC, this change makes the code a bit slower again.	2024-12-19 14:19:19 +02:00
Martin Storsjö	a149f5c3c0	looprestoration: Make the C wiener h filter more optimizable for the compiler This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16, if the C version of the filter is retained (which it isn't by default). This makes the vectorized C code roughly as fast as it was before the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster, while still being slower than it was initially.	2024-12-19 14:19:19 +02:00
Martin Storsjö	9da303e989	looprestoration: Rewrite the C version of the wiener filter This reduces the stack usage of these functions (the C version) significantly. These C versions aren't used on architectures that already have wiener filters implemented in assembly, but they matter both if running e.g. with assembly disabled (e.g. for sanitizer builds), and matter as example for how to do a cache efficient SIMD implementation. This roughly matches how these functions are implemented in the aarch64 assembly (although that assembly function uses a mainloop function written in assembly, and custom calling conventions between the functions). With this in place, dav1d can run with around 76 KB of stack with assembly disabled. This increases the binary size by around 14 KB (in the case of aarch64 with Xcode Clang 16), unless built with (the default) -Dtrim_dsp=true. (By default, the C version of the wiener filter gets skipped entirely.) On 32 bit arm, the assembly wiener function implementation still uses large buffers on the stack though, but due to other functions using less stack there, dav1d can still run with 72 KB of stack there. Unfortunately, this change also makes the functions slower, depending on how well the compiler was able to optimize the previous version. On GCC (which didn't manage to vectorize the functions so well before), it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang (where it was very well vectorized before). Most of this performance can be gained back with later changes on top, though.	2024-12-19 14:19:13 +02:00
Martin StorsjöandJean-Baptiste Kempf	f8d2620d82	checkasm: looprestoration: Do strict bounds checking of the output This would allow to immediately detect unintended writes out of bounds like the ones fixed in `72b5380757` and `1c7433a5eb`. Extend the PIXEL_RECT macro to provide a variable containing the full, padded height of the buffer, for uses that operate on the full buffer. Allow overwriting past the right edge of the target output rectangle, up to an alignment of 64 pixels, but allow no overwrite past the bottom.	2024-11-21 09:05:33 +00:00
Martin Storsjö	30c3dd8edd	arm32: looprestoration: Rewrite the SGR functions Switch to the same cache-friendly algorithm as was done for arm64 in `c121b831e2`. This uses much less stack memory, and is much more cache friendly. In this form, most of the individual asm functions only operate on one single row of data at a time. Some of the functions used to be unrolled to operate on two rows at a time, while they now only operate on one at a time. In practice, this is still a large performance win, as data is accessed in a much more cache friendly manner. This gives a 2-37% speedup, and reduces the peak amount of stack used for these functions from 255 KB to 33 KB. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 873990.7 748341.9 543410.2 383200.4 357502.9 sgr_3x3_10bpc_neon: 909728.0 732594.5 560123.6 392765.5 359377.7 sgr_5x5_8bpc_neon: 591597.9 527353.1 350347.4 263464.9 243098.8 sgr_5x5_10bpc_neon: 637958.2 529462.8 364613.3 280664.6 255164.6 sgr_mix_8bpc_neon: 1458977.4 1185423.2 884017.7 632922.5 587395.2 sgr_mix_10bpc_neon: 1532376.5 1259111.4 918729.3 658787.6 600317.0 After: sgr_3x3_8bpc_neon: 836138.7 635556.5 530596.1 335794.6 348209.9 sgr_3x3_10bpc_neon: 850835.4 596445.0 534583.2 342713.4 349713.5 sgr_5x5_8bpc_neon: 577039.7 443916.5 341684.8 223374.0 232841.3 sgr_5x5_10bpc_neon: 600975.7 400041.3 347529.8 234759.9 239351.7 sgr_mix_8bpc_neon: 1297988.7 925739.1 830360.7 545476.1 548706.6 sgr_mix_10bpc_neon: 1340112.6 914395.7 873342.4 574815.7 554681.6 With this change in place, dav1d can run with around 72 KB of stack on arm targets. Not all functions have been merged in the same way as they were for arm64 in `c121b831e2`, so some minor differences remain; it's possible to incrementally optimize this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse finish_filter_row1/2 with sgr_weighted_row1, and make a version of finish_filter_row1 that produces 2 rows, like is done for arm64. It's also possible to rewrite the logic for calculating sgr_x_by_x in the same way as was done for arm64 in `79db162487`.	2024-11-19 11:58:25 +02:00
Martin Storsjö	1b7f126361	arm32: looprestoration: Apply simplifications to align with C code This applies the same simplifications that were done for the C code and the x86 assembly in `4613d3a530`, and the arm64 assembly in `ce80e6daf6`, to the arm32 implementation. This gives a minor speedup of around a couple percent. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 926600.0 753468.3 553704.1 399379.1 369674.4 sgr_5x5_8bpc_neon: 621722.9 540412.7 357275.9 274474.3 254996.0 sgr_mix_8bpc_neon: 1529715.1 1171282.5 894982.9 659996.6 610407.2 After: sgr_3x3_8bpc_neon: 899020.3 697278.6 541569.9 382824.3 353891.8 sgr_5x5_8bpc_neon: 602183.2 498322.9 348974.5 264833.9 243837.7 sgr_mix_8bpc_neon: 1497870.8 1182121.3 880470.9 635939.3 590909.3	2024-11-18 16:08:00 +02:00
Martin Storsjö	c43debf1b1	arm64: looprestoration: Fix a comment typo	2024-11-18 16:07:40 +02:00
Martin Storsjö	1c7433a5eb	arm: looprestoration: Fix the single line loop in sgr_weighted2 After processing one block, this accidentally jumped to the loop for processing two lines at once. The same bug was replicated in both 32 and 64 bit versions.	2024-11-18 16:07:40 +02:00
Martin Storsjö	f32b314616	looprestoration: Rewrite the C version of the SGR filter This reduces the stack usage of these functions (the C version) significantly, and gives them a 15-40% speedup (on an Apple M3, with Xcode Clang 16). The C versions of this function does matter; even though we have assembly implementations of it on x86 and aarch64, those only covert the 8 and 10 bpc cases, while the C version is used as fallback for 12 bpc. This matches how these functions are implemented in the aarch64 assembly; operate over a window of 3 or 5 lines (of 384 pixels each), instead of doing a full 384 x 64 block. The individual functions for filtering a line each end up much simpler, and closer to how this can be implemented in assembly - but the overall business logic ends up much much more complex. The main difference to the aarch64 assembly implementation, is that any buffer which is of int16_t size in the aarch64 assembly implementation, uses the type "coef" here, which is 32 bit in the 10/12 bpc cases. (This is required for handling the 12 bpc cases.) With this in place, dav1d can run with around 66 KB of stack on x86_64 with assembly enabled, with around 74 KB of stack on aarch64 with assembly enabled, and with 118 KB of stack with assembly disabled. This increases the binary size by around 14 KB (in the case of aarch64 with Xcode Clang 16). On 32 bit arm, dav1d still requires around 270 KB of stack, as that assembly implementation of the SGR filter uses a different algorithm.	2024-11-18 15:57:19 +02:00
Martin Storsjö	01d417c2fa	arm: looprestoration: Give symbols and defines unique names As the machine specific init file is included in the common template, give symbols and defines unique names that won't clash with similar ones in the main template.	2024-11-18 15:39:28 +02:00
Martin Storsjö	847eece170	arm: looprestoration: Add spacing around operators	2024-11-18 15:39:28 +02:00
Martin Storsjö	56a55933b3	arm: looprestoration: Get rid of unnecessary rotate_ab_N intermediate functions	2024-11-18 15:39:28 +02:00
Martin Storsjö	9db59d8904	arm: looprestoration: Apply 'const' more consistently on parameters	2024-11-18 15:39:28 +02:00
Martin Storsjö	72b5380757	arm64: looprestoration: Fix use of the wrong register When renumbering argument registers in `1648c232ee`, this one register reference was missed. The missed register was meant to compare h with 2, but accidentally ended up comparing bitdepth_max to 2. In the case of 8 bpc, there's actually no bitdepth_max parameter, so it ended up comparing an uninitialized value.	2024-11-15 12:23:11 +02:00
Martin StorsjöandJean-Baptiste Kempf	bed3a34365	arm: Use /proc/cpuinfo on linux if getauxval is unavailable On really old libc versions, getauxval isn't available. Fall back on /proc/cpuinfo in those cases, just like we do on android too.	2024-11-14 14:44:21 +00:00
Martin StorsjöandJean-Baptiste Kempf	718b62c8cd	ci: Raise the timeout multipliers for jobs that run in QEMU For individual tests in dav1d-test-data, the default timeout is 30 seconds (which is the Meson default if nothing is specified). Previously it ran with a multiplier of 4, resulting in a total timeout of 120 seconds. When running tests in QEMU, exceeding this 120 second timeout could happen occasionally. Raise the multiplier to 10, allowing each individual job to run for up to 5 minutes. This should hopefully reduce the amount of stray failures in the CI. For tests that already have a higher default timeout set, such as checkasm which has got a 180 second default timeout, this results in a much longer timeout period. However as long as we don't frequently see issues where these actually hang, it should be beneficial to just let them run to completion, rather than aborting early due to a tight timeout.	2024-11-14 13:38:18 +00:00
Martin Storsjö	1648c232ee	arm64: looprestoration: Remove an unnecessary duplicate parameter in dav1d_sgr_weighted2_Xbpc_neon Also fix one case where the 32 bit input parameter w (which was in x6, now in x4) was used without zero extension, by referencing to it as w4 instead.	2024-11-14 11:53:50 +02:00
Martin Storsjö	ce80e6daf6	arm64: looprestoration: Apply simplifications to align with C code This applies the same simplifications that were done for the C code and the x86 assembly in `4613d3a530`, to the arm64 implementation. This gives a minor speedup of around a couple percent. Before: Cortex A53 A55 A72 A73 A76 Apple M3 sgr_3x3_8bpc_neon: 368583.2 363654.2 279958.1 272065.1 169353.3 354.6 sgr_5x5_8bpc_neon: 258570.7 255018.5 200410.6 199478.3 117968.3 260.9 sgr_mix_8bpc_neon: 603698.1 577383.3 482468.3 436540.4 256632.9 541.8 After: sgr_3x3_8bpc_neon: 367873.2 357884.1 275462.4 268363.9 165909.8 346.0 sgr_5x5_8bpc_neon: 254988.4 248184.2 190875.1 196939.1 120517.2 252.1 sgr_mix_8bpc_neon: 589204.7 563565.8 414025.6 427702.2 251651.2 533.4	2024-11-13 23:39:04 +02:00
Martin Storsjö	8bd31a92a5	arm: looprestoration: Split an overly long line	2024-11-13 15:38:20 +02:00
Martin Storsjö	55fb9433b7	checkasm: Remove leftover comment This comment no longer is relevant after `9278a14cf4`.	2024-10-18 14:37:28 +00:00
Martin Storsjö	23f2769266	meson: Test support for aarch64 extensions with gas-preprocessor too	2024-10-18 10:55:59 +00:00
Martin Storsjö	b13d1bc2bb	meson: Move checks for gas-preprocessor earlier Locate the assembler tools before checking for support for various assembler features.	2024-10-18 10:55:59 +00:00
Martin Storsjö	166e1df543	tests: Add an option to dav1d_argon.bash for using a wrapper tool This allows executing all the tools within e.g. valgrind. This matches the "meson test --wrap <tool>" feature.	2024-09-06 20:32:45 +00:00
Martin Storsjö	41511bf12e	aarch64: Split the jump tables to a separate const section This should allow executing in environments where the executable memory isn't readable. Use 4 byte entries instead of 2; most object file formats support relocations for a 4 byte symbol difference across sections, which allows keeping the rest of the table lookup code similar to what it was before. Referencing a symbol in an arbitrary location in the executable requires a two instruction sequence (adrp+add, via the movrel macro). Thus, the cost of this rewrite is doubling the size of the jump tables (which were quite small so far), and adding one instruction in each jump table setup prologue. On an ELF build, the .text section shrinks by 1176 bytes, and the .rodata section grows by 3136 bytes, i.e. a 1960 byte increase. While refactoring, prefer doing sign extension during the load (using ldrsw rather than ldr, to avoid using the "sxtw" modifier on the add instruction), as extending ALU arithmetics have a higher latency. MS armasm64 doesn't seem to support calculating symbol differences across sections (see [1]), so keep the jump tables in the text section there, to let the assembler calculate it at assembly time instead. (Keeping the condition as _WIN32 for simplicity, as we don't interact directly with armasm64, but it is wrapped in gas-preprocessor.) [1] https://developercommunity.visualstudio.com/t/armasm64-unable-to-create-cross-section/10722340	2024-08-29 20:43:57 +00:00
Martin Storsjö	0d8abee540	Fix the macro parameter name for the CHECK_SIZE macro	2024-08-29 23:29:30 +03:00
Martin Storsjö	ccb02ddf8d	aarch64: Enable detection of SVE/SVE2 on Windows WinSDK 10.0.26100 added these processor feature constants. Unfortunately, no constant was added for I8MM, but if SVE_I8MM is available, we can at least be sure that regular I8MM is available too.	2024-08-26 14:04:37 +03:00
Martin Storsjö	27491dd953	aarch64: Fix a label typo Apparently, this case isn't actually ever executed, at least in most checkasm runs, but some tools could complain about the relocation against 160b, which pointed elsewhere than intended.	2024-08-24 10:08:00 +03:00
Martin Storsjö	e560d2ba08	aarch64: Avoid looping through the BTI instructions This does the same optimizations as `3329f8d139` and `1790e1329d` on the rest of the code.	2024-08-23 16:15:45 +03:00
Martin Storsjö	5a33c5c628	aarch64: ipred: Use the right fill width loop in ipred_z3_fill_padding_neon This makes the code behave as intended, when filling a rectangle with arbitrary width (filling with the largest power of two width until filled); previously, it accidentally fell back on writing 4 pixel wide stripes immediately. No measurable effect on checkasm benchmarks though.	2024-08-23 12:10:35 +03:00
Martin Storsjö	3329f8d139	aarch64: mc16: Optimize the BTI landing pads in put/prep_neon Don't include the BTI landing pad instruction in the loops. If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to a no-op instruction that indicates that indirect jumps can land there. But there's no need for the loops to include that instruction.	2024-08-22 16:34:39 +03:00
Martin Storsjö	7fbcdc6d04	aarch64: Explicitly use the ldur instruction where relevant in mc_dotprod.S The ldr instruction only can handle offsets that are a multiple of the element size; most assemblers implicitly produce the ldur instruction when a non-aligned offset is provided. Older versions of MS armasm64, however, error out on this. Since MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7 and earlier require explicitly writing the instruction as ldur. Despite this, even older versions still fail to build the mc_dotprod.S sources, with errors like this: src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range mov x10, (((015-1)<<7)\|(315-1)) This happens on MSVC 2022 17.1 and older, while 17.2 and newer accept the negative value expression here. In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure script at the moment, as it uses inline assembly to test for external assembler features.	2024-06-25 19:10:59 +00:00
Martin Storsjö	9469e18458	arm64: msac: Explicitly use the ldur instruction The ldr instruction can take an immediate offset which is a multiple of the loaded element size. If the ldr instruction is given an immediate offset which isn't a multiple of the element size, most assemblers implicitly generate a "ldur" instruction instead. Older versions of MS armasm64.exe don't do this, but instead error out with "error A2518: operand 2: Memory offset must be aligned". (Current versions don't do this but correctly generate "ldur" implicitly.) Switch this instruction to an explicit "ldur", like we do elsewhere, to fix building with these older tools.	2024-05-19 22:36:09 +03:00
Martin Storsjö	236e1d1912	tools: Make ARM cpu flags imply relevant lower level flags The --cpumask flag only takes one single flag name, one can't set a combination like neon+dotprod. Therefore, apply the same pattern as for x86, by adding mask values that contain all the implied lower level flags. This is somewhat complicated, as the set of features isn't entirely linear - in particular, SVE doesn't imply either dotprod or i8mm, and SVE2 only implies dotprod, but not i8mm. This makes sure that "dav1d --cpumask dotprod" actually uses any SIMD at all, as it previously only set the dotprod flag but not neon, which essentially opted out from all SIMD.	2024-04-26 15:09:19 +00:00
Martin Storsjö	cb8151c969	aarch64: Avoid unaligned jump tables Manually add a padding 0 entry to make the odd number of .hword entries align with the instruction size. This fixes assembling with GAS, with the --gdwarf2 option, where it previously produced the error message "unaligned opcodes detected in executable segment". The message is slightly misleading, as the error is printed even if there actually are no opcodes that are misaligned, as the jump table is the last thing within the .text section. The issue can be reproduced with an input as small as this, assembled with "as --gdwarf2 -c test.s". .text nop .hword 0 See `a6228f47f0` for earlier cases of the same error - although in those cases, we actually did have more code and labels following the unaligned jump tables. This error is present with binutils 2.39 and earlier; in binutils 2.40, this input no longer is considered an error, fixed in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=6f6f5b0adc9efd103c434fd316e8c880a259775d.	2024-04-22 09:11:37 +00:00
Martin StorsjöandJ. Dekker	5e31720b89	checkasm: Add support for the private macOS kperf API for benchmarking On AArch64, the performance counter registers usually are restricted and not accessible from user space. On macOS, we currently use mach_absolute_time() as timer on aarch64. This measures wallclock time but with a very coarse resolution. There is a private API, kperf, that one can use for getting high precision timers though. Unfortunately, it requires running the checkasm binary as root (e.g. with sudo). Also, as it is a private, undocumented API, it can potentially change at any time. This is handled by adding a new meson build option, for switching to this timer. If the timer source in checkasm could be changed at runtime with an option, this wouldn't need to be a build time option. This allows getting benchmarks like this: mc_8tap_regular_w16_hv_8bpc_c: 1522.1 ( 1.00x) mc_8tap_regular_w16_hv_8bpc_neon: 331.8 ( 4.59x) Instead of this: mc_8tap_regular_w16_hv_8bpc_c: 9.0 ( 1.00x) mc_8tap_regular_w16_hv_8bpc_neon: 1.9 ( 4.76x) Co-authored-by: J. Dekker <jdek@itanimul.li>	2024-04-02 10:35:29 +00:00
Martin Storsjö	024b260cb9	arm32: Fix right shifts in the 16bpc iwht implementation These shifts used the wrong element size; this only was noticed in some argon tests.	2024-03-08 21:49:57 +00:00
Martin Storsjö	fd60097eb2	checkasm: aarch64: Print the SVE vector length, if available	2024-03-04 23:04:51 +02:00
Martin Storsjö	e1f80dec00	aarch64: Check for assembler support for various aarch64 extensions First check if the assembler supports the ".arch" directive, and what architecture levels are supported. In principle, we'd only need to check for support for ".arch armv8.2-a", since that's enough for enabling the i8mm and sve2 extensions. However, recent Clang versions (before version 17) wasn't able to enable the dotprod and i8mm extensions via the ".arch_extension" directives, so check for support for armv8.4-a and armv8.6-a as well, which enable dotprod and i8mm implicitly. This allows assembling these instructions on most commonly available GCC and Clang based toolchains, while still allowing toggling support for the instruction sets on and off within the source files. Within assembly, we disable these extensions by default, so that instructions enabled within these extension sets can't be used by accident in unintended functions. Code meaning to use these extensions can be assembled like this: #if HAVE_SVE ENABLE_SVE // code DISABLE_SVE #endif	2024-03-04 20:50:39 +00:00
Martin Storsjö	0d2e83cc16	ci: Add an aarch64 cross compile CI job with a recent Clang	2024-02-28 16:40:14 +00:00
Martin Storsjö	302334a6be	ci: Test aarch64 with QEMU, with varying SVE vector lengths This allows testing all modern aarch64 CPU features, that the HW based test runners might not support. Especially for SVE, this allows testing all valid vector lengths, which might not exist in hardware form yet.	2024-02-28 16:40:14 +00:00
Martin Storsjö	39be9fb438	ci: Bump to the latest dav1d-debian-unstable image This one contains aarch64 cross tools, for use with QEMU.	2024-02-28 16:40:14 +00:00
Martin Storsjö	5149b27447	checkasm: Map SIGBUS to the right error text This was missed in `2ef970a885`. Also print this text for EXCEPTION_IN_PAGE_ERROR on Windows.	2023-12-15 14:10:01 +02:00
Martin Storsjö	2179b30c84	checkasm: Fix catching crashes on Windows on ARM longjmp on Windows uses SEH to unwind on ARM/ARM64 too, just like on x86_64, thus use RtlCaptureContext/RtlRestoreContext instead of setjmp/longjmp on those architectures as well.	2023-11-01 19:28:07 +02:00
Martin Storsjö	a7e12b6284	windows: Clarify unicode characters in RC files Windows RC files can have strings expressed either as narrow chars expressed in a specific codepage, or as wide unicode strings. Regardless of which way they are expressed, they are converted into unicode strings in the compiled resource files. When using narrow strings, even if using escaped chars like \251, those chars are interpreted according to a specific codepage. The codepage can be specified with arguments to the RC/windres tool (or with a pragma, but not all tools support the pragmas), but when no codepage is specified, the exact interpretation varies. llvm-rc uses a hard stance of defaulting to only accepting ANSI chars unless something else has been specified (and pragmas aren't supported). llvm-windres defaults to CP 850 though, for compatibility with what most people probably intend to. However, GNU windres and MS rc.exe actually default to what the system's current default codepage is. That means that if the resource file is built on a machine with e.g. Japanese as the default locale, the file gets built differently, with a different Unicode character than what was intended. By converting the strings to wide strings, it is unambiguous that \251 refers to the Unicode code point u00A9 (octal 0251), i.e. copyright sign. This fixes building the RC files with llvm-rc. With GNU windres, llvm-windres and rc.exe, the files still generate the bitwise exact same output as before.	2023-07-08 00:24:57 +03:00
Martin StorsjöandHenrik Gramner	bc76a22015	arm: ipred: Update pal_pred to work with packed indices	2023-07-06 23:12:02 +02:00
Martin Storsjö	616bfd1506	arm32: refmvs: Fix building with MS armasm Add an explicit align before the jump table; this avoids armasm bugs in how label differences are calculated. This matches how all other jump tables are written in our 32 bit arm assembly.	2023-07-01 11:36:39 +03:00
Martin Storsjö	b33d77f903	arm32: refmvs: Add NEON implementation of save_tmvs Relative speedup compared to C: Cortex A7 A8 A9 A53 A72 A73 save_tmvs_neon: 1.20 1.42 1.25 1.58 1.26 1.99	2023-06-30 11:44:17 +03:00
Martin Storsjö	a1d7763f7b	arm64: refmvs: Use addp instead of trn2+add Also improve scheduling in the prologue and fix a few cases of inconsistent indentation. Before: Cortex A53 A55 A72 A73 A76 Apple M1 save_tmvs_neon: 73657.2 74470.9 72238.1 56095.4 34135.7 207.9 After: save_tmvs_neon: 72187.2 74434.6 71068.9 56043.9 33237.4 201.0 (The changes to the M1 numbers are mostly measurement noise though.)	2023-06-30 11:42:33 +03:00
Martin Storsjö	189d47c2fa	arm64: refmvs: Fix building with MSVC Binutils and LLVM assemblers can infer that this str instruction must be stur (and implicitly assemble it into that instruction), while MS armasm64 errored out with this message: src\libdav1d.a.p\refmvs.obj.asm(673) : error A2518: operand 2: Memory offset must be aligned str q2, [x3, #(8*5-16)]	2023-06-28 15:37:09 +03:00
Martin Storsjö	c39779f474	arm64: refmvs: Process two blocks at a time in save_tmvs Before: Cortex A53 A55 A72 A73 A76 Apple M1 save_tmvs_neon: 79184.7 79889.9 54720.2 54522.6 29919.6 216.4 After: save_tmvs_neon: 73780.0 74339.2 70414.1 59102.0 35028.4 213.9 The benefit from this is marginal on Cortex A53 and A55, and Apple M1, while this change actually makes the code notably slower on Cortex A72, A73 and A76.	2023-06-27 00:10:21 +03:00
Martin Storsjö	6aa37aec8f	arm64: refmvs: Add NEON implementation of save_tmvs Cortex A53 A55 A72 A73 A76 Apple M1 save_tmvs_c: 116768.4 122653.1 82587.7 90445.0 45386.8 242.1 save_tmvs_neon: 79184.7 79889.9 54720.2 54522.6 29919.6 216.4 Relative speedup compared with C: Cortex A53 A55 A72 A73 A76 Apple M1 save_tmvs_neon: 1.47 1.54 1.51 1.66 1.52 1.12	2023-06-27 00:10:21 +03:00
Martin Storsjö	c121b831e2	arm64: looprestoration: Rewrite the SGR functions Make them operate in a more cache friendly manner, interleaving the various passes, and merging some of the functions that operate on data in similar patterns. This reduces the amount of stack used from 207 KB to 14 KB for sgr_3x3, from 207 KB to 16 KB for sgr_5x5 and from 255 KB to 33 KB for sgr_mix. This does however increase the size of the binary by about 12 KB. (The executable code generated from assembly actually shrinks by a little, but the higher level logic in C is quite nontrivial.) This is somewhat similar to what was done for x86 in `fe2bb77424`. Benchmarks from checkasm: Before: Cortex A53 A55 A72 A73 A76 Apple M1 sgr_3x3_8bpc_neon: 493005.0 483133.2 365056.3 345197.9 202819.1 537.3 sgr_5x5_8bpc_neon: 353152.6 349614.3 268962.2 248431.8 142302.4 385.9 sgr_mix_8bpc_neon: 829903.9 815910.9 622858.5 577238.0 333362.9 881.7 sgr_3x3_10bpc_neon: 504778.6 499851.6 379203.1 346695.2 199738.7 537.0 sgr_5x5_10bpc_neon: 363111.9 362489.7 267903.1 247506.5 138417.2 351.3 sgr_mix_10bpc_neon: 853053.7 846768.8 628349.6 584553.8 328399.5 843.6 After: sgr_3x3_8bpc_neon: 387949.9 384216.4 294423.7 301968.2 184643.1 492.4 sgr_5x5_8bpc_neon: 259854.7 257233.2 193983.7 198388.4 128497.0 341.2 sgr_mix_8bpc_neon: 606401.5 595661.3 457209.7 462721.8 281906.7 738.6 sgr_3x3_10bpc_neon: 392472.7 394100.5 296048.1 304339.4 184271.4 471.3 sgr_5x5_10bpc_neon: 257248.3 257651.1 197552.5 199655.1 130739.7 322.9 sgr_mix_10bpc_neon: 605263.3 611197.4 441789.3 461339.2 286320.1 721.4 Speedup vs before: 27-41% 25-40% 23-42% 13-26% 5-18% 8-19%	2023-06-22 13:57:17 +03:00
Martin Storsjö	3c2f2087d8	arm64: looprestoration: Properly use 32 bit registers for 32 bit parameters This issue isn't caught by checkasm, since these functions are internal to the SGR implementation, and checkasm only affects the parameters on the external DSP function interface. This could potentially trigger errors with future compilers.	2023-06-22 11:03:35 +03:00
Martin Storsjö	77d0cbaf0e	Avoid an MSVC warning about conversion to smaller data types After `8f320d5958`, MSVC started producing this warning: [63/123] Compiling C object src/libdav1d.a.p/obu.c.obj ../src/obu.c(708): warning C4244: '=': conversion from 'uint16_t' to 'uint8_t', possible loss of data	2023-06-07 11:04:37 +00:00
Martin Storsjö	ca39c862ac	arm64: ipred: 16 bpc NEON implementation of the Z2 function Relative speedup over unvectorized C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z2_w4_16bpc_neon: 2.98 2.98 2.38 2.77 3.19 7.75 intra_pred_z2_w8_16bpc_neon: 3.91 4.22 2.64 3.29 3.73 4.78 intra_pred_z2_w16_16bpc_neon: 4.43 5.12 2.89 3.90 3.50 4.26 intra_pred_z2_w32_16bpc_neon: 5.08 6.36 3.44 4.40 4.05 4.96 intra_pred_z2_w64_16bpc_neon: 4.68 5.97 3.29 4.40 3.68 5.23	2023-05-25 16:51:35 +03:00
Martin Storsjö	1dd0cd3a39	arm64: ipred: Remove unnecessary instructions from z2_fill	2023-05-25 16:51:35 +03:00
Martin StorsjöandJean-Baptiste Kempf	8af8244a3a	arm64: ipred: 8 bpc NEON implementation of the Z2 function Relative speedup over C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z2_w4_8bpc_neon: 3.91 3.55 3.31 3.94 3.46 8.50 intra_pred_z2_w8_8bpc_neon: 5.68 5.67 4.31 5.31 4.34 5.83 intra_pred_z2_w16_8bpc_neon: 8.39 9.28 5.53 7.04 7.01 9.45 intra_pred_z2_w32_8bpc_neon: 7.01 8.01 5.04 6.32 5.48 7.48 intra_pred_z2_w64_8bpc_neon: 8.73 10.25 5.92 7.61 6.63 10.05	2023-05-05 15:40:57 +00:00
Martin Storsjö	e75caab99e	arm64: ipred: 16 bpc NEON implementation of the Z3 function Relative speedup over the C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z3_w4_16bpc_neon: 3.06 2.87 2.17 1.97 2.33 7.75 intra_pred_z3_w8_16bpc_neon: 3.90 3.94 2.97 3.16 2.93 4.43 intra_pred_z3_w16_16bpc_neon: 4.08 4.48 3.31 4.68 3.13 5.00 intra_pred_z3_w32_16bpc_neon: 4.43 4.85 3.50 4.02 3.33 5.62 intra_pred_z3_w64_16bpc_neon: 4.68 5.30 3.72 3.96 3.52 5.78	2023-03-21 08:57:44 +02:00
Martin Storsjö	2eb9239100	arm64: ipred: 16 bpc NEON implementation of the Z1 function Relative speedup over the C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z1_w4_16bpc_neon: 3.49 2.63 2.83 3.85 3.14 9.00 intra_pred_z1_w8_16bpc_neon: 6.19 4.39 3.65 6.58 4.99 6.50 intra_pred_z1_w16_16bpc_neon: 6.65 4.64 3.97 7.78 4.87 7.00 intra_pred_z1_w32_16bpc_neon: 7.76 5.49 5.17 7.83 5.59 8.24 intra_pred_z1_w64_16bpc_neon: 8.02 5.80 5.33 8.41 5.77 8.70	2023-03-21 08:57:43 +02:00
Martin Storsjö	ec38062a12	arm: ipred: Make a SIMD pixel_set function for padding For 8 bpc, there's probably not much difference to a decent memset, but for 16 bpc, there might be a bigger difference.	2023-03-21 08:57:43 +02:00
Martin Storsjö	6f5bf165e4	arm64: ipred: Use fewer registers for table lookups in w=8 in z3_fill1 for 8bpc Add comments explaining the exact dimensions of the gather tables used currently. That reasoning shows that the w=8 case can do with one register less. Before: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z3_w8_8bpc_neon: 356.2 376.2 218.9 246.4 176.1 0.6 After: intra_pred_z3_w8_8bpc_neon: 339.6 357.3 205.6 232.3 160.0 0.5	2023-03-21 08:57:43 +02:00
Martin Storsjö	7be5347c97	arm64: ipred: Improve accumulation ordering in 8bpc z1 Start out the multiplication/accumulation with a register that is available sooner. Before: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z1_w8_8bpc_neon: 266.3 268.9 146.6 155.3 103.9 0.4 intra_pred_z1_w16_8bpc_neon: 528.6 574.4 333.9 364.3 209.1 0.7 intra_pred_z1_w32_8bpc_neon: 1149.3 1245.4 752.3 811.5 503.4 1.3 intra_pred_z1_w64_8bpc_neon: 2198.4 2360.6 1462.9 1575.0 1007.6 2.4 After: intra_pred_z1_w8_8bpc_neon: 266.3 269.1 146.6 155.0 100.1 0.4 intra_pred_z1_w16_8bpc_neon: 528.6 573.3 347.9 352.4 204.3 0.7 intra_pred_z1_w32_8bpc_neon: 1149.2 1245.3 763.4 759.6 474.8 1.3 intra_pred_z1_w64_8bpc_neon: 2198.8 2360.6 1430.0 1417.4 943.5 2.3	2023-03-21 08:57:43 +02:00
Martin Storsjö	92d93f4b35	arm64: ipred: Optimize the 3tap filter padding in z1_filter_edge The second register will at most contain one valid pixel, the padding pixel. Thus skip padding the register and just fill it with the padding pixel.	2023-03-21 08:57:43 +02:00
Martin Storsjö	8ee450cbd0	arm64: ipred: Remove leftover instructions at the start of z3_fill2 There were redundant leftovers from copypasting bits when writing this function.	2023-03-21 08:57:43 +02:00
Martin Storsjö	ab6977bc04	arm64: ipred: Rename a misnamed local label in the assembly This is for cases with h >= 16.	2023-03-21 08:57:42 +02:00
Martin Storsjö	da9602a32b	arm64: ipred: Fix a misindented operand in the assembly	2023-03-21 08:57:42 +02:00
Martin Storsjö	50a89b6383	arm: ipred: Fix a misindented line in the C wrapper	2023-03-21 08:57:42 +02:00
Martin StorsjöandMatthias Dressel	5c9d651edc	Add a -j option to dav1d_argon.bash	2023-03-01 19:59:10 +01:00
Martin Storsjö	ef0fb0b6fc	Fix building with MSVC after recent commit `98b0c96d21` added an include of src/ref.h in src/fg_apply_tmpl.c. That template source file is included in tests/checkasm/filmgrain.c. src/ref.h includes <stdatomic.h>. Including this file requires declaring a dependency on stdatomic_dependencies in meson, which provides the fallback implementation of stdatomic.h when building with MSVC.	2023-02-27 01:04:25 +02:00
Martin Storsjö	77b3955537	checkasm: Add an --affinity= option for selecting a CPU core Add an option for selecting the core where the single thread of checkasm runs. This allows benchmarking on specific CPU cores on heterogenous CPUs, like ARM big.LITTLE configurations. On Linux, one can easily wrap an invocation of checkasm with "taskset -c <n> [...]" - so this option isn't very essential there - however it is quite useful on Windows. On Windows, it is somewhat possible to do the same by launching the tool with "start /B /affinity <hexmask> [...]", but that doesn't work well with scripting ("start" returns before the command has finished running, and it's not obvious how to invoke "start" from within WSL). Using "taskset" to launch processes on specific cores within WSL on Windows doesn't work - regardless of the Linux level affinity, the process ends up running on the performance cores anyway.	2023-01-31 15:33:58 +02:00
Martin Storsjö	99956c737a	arm64: ipred: 8 bpc NEON implementation of the Z3 function The implementation is a hybrid between two approaches; one generic (but non-ideal) for cases with large max_base_y, which fills two pixel columns at a time, i.e. looping over pixels first vertically, then horizontally - i.e. in a non-optimal manner. For cases with smaller max_base_y, it does two rows at a time, essentially doing gathers with the TBX instruction. Relative speedup over the C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z3_w4_8bpc_neon: 3.32 2.89 2.78 3.52 2.52 9.67 intra_pred_z3_w8_8bpc_neon: 6.24 5.55 4.76 5.60 4.11 6.40 intra_pred_z3_w16_8bpc_neon: 7.64 7.07 4.37 6.23 4.18 8.60 intra_pred_z3_w32_8bpc_neon: 7.51 7.21 4.34 5.92 4.27 7.88 intra_pred_z3_w64_8bpc_neon: 6.82 6.25 4.08 5.83 3.52 7.31	2023-01-31 10:16:16 +02:00
Martin Storsjö	fd4f348e70	arm64: ipred: 8 bpc NEON implementation of the Z1 function Relative speedup over the C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z1_w4_8bpc_neon: 4.09 3.15 3.63 4.16 3.27 13.00 intra_pred_z1_w8_8bpc_neon: 6.93 5.66 5.57 6.76 5.51 5.50 intra_pred_z1_w16_8bpc_neon: 7.81 6.85 6.24 7.78 6.59 9.00 intra_pred_z1_w32_8bpc_neon: 10.56 9.95 8.72 10.95 8.28 13.33 intra_pred_z1_w64_8bpc_neon: 11.00 11.38 9.11 11.62 8.65 14.61 (The speedup numbers for M1 are kinda noisy due to the very coarse granularity of the timer used there.)	2023-01-27 23:54:44 +02:00
Martin Storsjö	2e990b370e	checkasm: ipred: Iterate 5 times for each Z1/Z2/Z3 function These functions contain a number of different codepaths; try to make sure that we hit most codepaths for each size combination. This both gives better test coverage in one single run of checkasm, but also should give a better averaged runtime in benchmarks.	2023-01-27 23:54:20 +02:00
Martin Storsjö	8a4932ff03	Implement atomic_compare_exchange_strong in the atomic compat headers This fixes building with MSVC (and older GCC versions) after `3e7886db54`.	2022-10-26 16:14:52 +03:00
Martin Storsjö	345127a795	arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths This fixes conformance with the argon test samples, in particular with these samples: profile0_core/streams/test10100_579_8614.obu profile0_core/streams/test10218_6914.obu This gives a pretty notable slowdown to these transforms - some examples: Before: Cortex A53 A72 A73 Apple M1 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 365.7 290.2 299.8 0.3 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1865.2 1384.1 1457.5 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 33976.3 26817.0 24864.2 40.4 After: inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 397.7 322.2 335.1 0.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 2121.9 1336.7 1664.6 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 38569.4 27622.6 28176.0 51.0 Thus, for the transforms alone, it makes them around 10-13% slower (the Apple M1 measurements are too noisy to be conclusive here). Measured on actual full decoding, it makes decoding of 10 bpc Chimera around maybe 1% slower on an Apple M1 - close to measurement noise anyway.	2022-09-19 20:40:34 +00:00
Martin Storsjö	cc9651f516	Don't use gas-preprocessor with clang-cl for arm targets Since meson 0.58.0 (released in May 2021), meson accepts adding '.S' assembly files as source files to the clang-cl compiler. If using an older version of meson, keep using gas-preprocessor just like for MSVC builds.	2022-09-15 11:25:37 +03:00
Martin Storsjö	08c708015e	tools: Allocate the priv structs with proper alignment Previously, they could be allocated with any random alignment matching the end of the MuxerContext/DemuxerContext. The priv structs themselves can have members that require specific alignment, or at least the default alignment of malloc()/calloc() (which is sufficient for native types such as uint64_t and doubles). This fixes crashes in some arm builds, where GCC (correctly) wants to use 64 bit aligned stores to write to MD5Context.	2022-09-14 15:59:19 +03:00