dav1d

x/dav1d

mirror of https://code.videolan.org/videolan/dav1d synced 2026-06-11 04:03:05 +00:00

Author	SHA1	Message	Date
Pranav KantandHenrik Gramner	7d4b789f55	Mark C globals with small code model We have new option in clang (https://github.com/llvm/llvm-project/pull/124834) to mark globals to be allocated in non-large sections. We can mark all globals that are referenced from hardcoded assembly (which implicitly references globals assuming they are in non-large sections) with this attribute to avoid running into problems when dav1d is built with -mcmodel=medium with clang.	2025-02-21 15:55:00 +00:00
Jean-Baptiste Kempf	42b2b24fb8	Update NEWS for 1.5.1 1.5.1	2025-01-19 22:33:54 +01:00
Wan-Teh ChangandRonald S. Bultje	40ff2a1251	Include <string.h> for memcpy()	2025-01-10 01:54:41 +00:00
Arpad Panyik	edb16889d1	AArch64: Add Neon implementation of load_tmvs This patch adds a vectorised variant of the mv_projection calculation and a faster initialisation of motion vectors for load_tmvs_neon. Checkasm uplifts after this patch on some Neoverse and Cortex CPU cores compared to the C reference compiled with GCC-13 and Clang-19: GCC Clang AWS Graviton 4: 1.62x 1.59x Cortex-X4: 1.45x 1.46x Cortex-X3: 1.68x 1.69x Cortex-X1: 1.55x 1.52x Cortex-A720: 1.54x 1.57x Cortex-A715: 1.47x 1.55x Cortex-A78: 1.21x 1.18x Cortex-A76: 1.38x 1.37x Cortex-A72: 1.08x 1.11x Cortex-A520: 0.97x 1.18x Cortex-A510: 0.99x 1.14x Cortex-A55: 1.16x 1.23x This patch increases the .text by ~660 bytes, but smaller than the reference implementation by about 0.5 KiB.	2025-01-09 14:59:31 +01:00
Martin Storsjö	b129d9f2cb	mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap} For the bilin cases, this seems to make things marginally faster (measured on x86_64; 7-25% faster with compiler autovectorization). For 8tap, it doesn't make much of a difference at all. Before: GCC Clang mc_scaled_8tap_regular_w128_8bpc_c: 115155.5 98549.3 mc_scaled_8tap_regular_w128_8bpc_ssse3: 17936.0 18411.1 mc_scaled_bilinear_w128_8bpc_c: 40290.0 51812.9 mc_scaled_bilinear_w128_8bpc_ssse3: 18243.9 18177.0 After: mc_scaled_8tap_regular_w128_8bpc_c: 116304.3 99453.2 mc_scaled_8tap_regular_w128_8bpc_ssse3: 18387.0 18077.3 mc_scaled_bilinear_w128_8bpc_c: 37381.4 41145.0 mc_scaled_bilinear_w128_8bpc_ssse3: 18423.8 18031.6 (Benchmarked with the seed 0; the total runtime for the scaled benchmarks are significantly affected by the random seed.) This reduces the stack usage of these functions from around 65 KB each, to less than 1 KB for bilin, and around 2 KB for 8tap. With this in place, the required stack space for dav1d should be mostly identical across configurations; on x86_64 (both with and without assembly), it can run with 62 KB of stack, and on arm and aarch64, it can run with 58 KB of stack.	2025-01-02 15:30:21 +00:00
Brad SmithandJean-Baptiste Kempf	cd5bfa124a	riscv: Fix building on non-Linux OS's CLOCK_MONOTONIC_RAW is not POSIX/portable.	2024-12-29 18:32:23 +00:00
James Almer	5ea4939a1d	obu: don't print warnings for Metadata OBUs of types "Unregistered user private"	2024-12-27 13:48:54 -03:00
Martin Storsjö	2ba57aa535	arm32: looprestoration: Rewrite the wiener functions Switch to the same cache-friendly algorithm as was done for arm64 in `2e73051c57` and for the reference C code in `8291a66e50`. Contrary to the arm64 implementation, this uses a main loop in C (very similar to the one in the main C implementation in `8291a66e50`) rather than assembly; this gives a bit more overhead on the call to each function, but it shouldn't affect the big picture much. Performane wise, this doesn't make much of a difference - it makes things a little bit faster on some cores, and a little bit slower on others: Before: Cortex A7 A8 A53 A72 A73 wiener_7tap_8bpc_neon: 269384.4 147730.7 140028.5 92662.5 92929.0 wiener_7tap_10bpc_neon: 352690.2 159970.2 169427.8 116614.9 119371.1 After: wiener_7tap_8bpc_neon: 238328.0 157274.1 134588.6 92200.3 97619.6 wiener_7tap_10bpc_neon: 336369.3 162182.0 161954.4 125521.2 130634.0 This is mostly in line with the results on arm64 in `2e73051c57`. On arm64, there was a bit larger speedup for the 7tap case, mostly attributed to unrolling the vertical filter (and the new filter_hv function) to operate on 16 pixels at a time. On arm32, there's not enough registers to do that, so we can't get such gains from unrolling. (Reducing the unrolling on the arm64 version to match the case on arm32 also shows similar performance numbers as on arm32 here.) In the arm64 version, we also added separate 5tap versions of all functions; not doing that for arm32 at this point. This increases the binary size by 2 KB. This doesn't have any immediate effect on how much stack space dav1d requires in total, since the largest stack users on arm currently are the 8tap_scaled functions.	2024-12-20 14:32:32 +02:00
Martin Storsjö	8291a66e50	looprestoration: Use only 6 row buffer for wiener, like NEON/x86 This uses a separate function for combined horizontal and vertical filtering, without needing to write the intermediate results back to memory inbetween. This mostly serves as an example for how to adjust the logic for that case; unless we actually merge the horizontal and vertical filtering within the _hv function, we still need space for a 7th row on the stack within that function (which means we use just as much stack as before), but we also need one extra memcpy to write it into the right destination. In a build where the compiler is allowed to vectorize and inline the wiener functions into each other, this change actually reduces the final binary size by 4 KB, if the C version of the wiener filter is retained. This change makes the vectorized C code as fast as it was before with Clang 18; on Xcode Clang 16, it's 2x slower than it was before. Unfortunately, with GCC, this change makes the code a bit slower again.	2024-12-19 14:19:19 +02:00
Martin Storsjö	a149f5c3c0	looprestoration: Make the C wiener h filter more optimizable for the compiler This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16, if the C version of the filter is retained (which it isn't by default). This makes the vectorized C code roughly as fast as it was before the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster, while still being slower than it was initially.	2024-12-19 14:19:19 +02:00
Martin Storsjö	9da303e989	looprestoration: Rewrite the C version of the wiener filter This reduces the stack usage of these functions (the C version) significantly. These C versions aren't used on architectures that already have wiener filters implemented in assembly, but they matter both if running e.g. with assembly disabled (e.g. for sanitizer builds), and matter as example for how to do a cache efficient SIMD implementation. This roughly matches how these functions are implemented in the aarch64 assembly (although that assembly function uses a mainloop function written in assembly, and custom calling conventions between the functions). With this in place, dav1d can run with around 76 KB of stack with assembly disabled. This increases the binary size by around 14 KB (in the case of aarch64 with Xcode Clang 16), unless built with (the default) -Dtrim_dsp=true. (By default, the C version of the wiener filter gets skipped entirely.) On 32 bit arm, the assembly wiener function implementation still uses large buffers on the stack though, but due to other functions using less stack there, dav1d can still run with 72 KB of stack there. Unfortunately, this change also makes the functions slower, depending on how well the compiler was able to optimize the previous version. On GCC (which didn't manage to vectorize the functions so well before), it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang (where it was very well vectorized before). Most of this performance can be gained back with later changes on top, though.	2024-12-19 14:19:13 +02:00
Luc Trudeau	d242c47b43	Replace Av1Block with pal_sz in read_pal_indices	2024-12-02 09:32:33 -05:00
Henrik Gramner	9a75cebc36	Explicitly use uint8_t for the order_palette() scratch buffer It previously used 'pixel' which is typedefed to uint8_t in files that aren't bitdepth-templated, but those are indices and not pixels so that was just confusing and misleading.	2024-12-02 13:47:04 +01:00
victorien	575af25859	flush: Reset f->task_thread.error f->task_thread.error can be set during flushing, not resetting this can lead to c->task_thread.first being increased after having already submitted a frame post flushing. That's fine if it happens on the very first frame, but if that's the case on any subsequent frame it will incur a wrong frame ordering. Now that a non-first frame will be considered as such, its tasks won't be able to execute (since they depend on a truly previous frame considered as being after) and c->task_thread.cur will be increased past that frame, with no way of it being reset, eventually leading to a hang.	2024-11-28 17:56:13 +01:00
Wan-Teh ChangandRonald S. Bultje	767efeca06	Fix ClangTidy misc-include-cleaner warnings	2024-11-26 14:26:25 +00:00
Martin StorsjöandJean-Baptiste Kempf	f8d2620d82	checkasm: looprestoration: Do strict bounds checking of the output This would allow to immediately detect unintended writes out of bounds like the ones fixed in `72b5380757` and `1c7433a5eb`. Extend the PIXEL_RECT macro to provide a variable containing the full, padded height of the buffer, for uses that operate on the full buffer. Allow overwriting past the right edge of the target output rectangle, up to an alignment of 64 pixels, but allow no overwrite past the bottom.	2024-11-21 09:05:33 +00:00
Brad SmithandJean-Baptiste Kempf	f15666b703	riscv: Enable FreeBSD / OpenBSD elf_aux_info() support	2024-11-21 08:41:38 +00:00
Martin Storsjö	30c3dd8edd	arm32: looprestoration: Rewrite the SGR functions Switch to the same cache-friendly algorithm as was done for arm64 in `c121b831e2`. This uses much less stack memory, and is much more cache friendly. In this form, most of the individual asm functions only operate on one single row of data at a time. Some of the functions used to be unrolled to operate on two rows at a time, while they now only operate on one at a time. In practice, this is still a large performance win, as data is accessed in a much more cache friendly manner. This gives a 2-37% speedup, and reduces the peak amount of stack used for these functions from 255 KB to 33 KB. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 873990.7 748341.9 543410.2 383200.4 357502.9 sgr_3x3_10bpc_neon: 909728.0 732594.5 560123.6 392765.5 359377.7 sgr_5x5_8bpc_neon: 591597.9 527353.1 350347.4 263464.9 243098.8 sgr_5x5_10bpc_neon: 637958.2 529462.8 364613.3 280664.6 255164.6 sgr_mix_8bpc_neon: 1458977.4 1185423.2 884017.7 632922.5 587395.2 sgr_mix_10bpc_neon: 1532376.5 1259111.4 918729.3 658787.6 600317.0 After: sgr_3x3_8bpc_neon: 836138.7 635556.5 530596.1 335794.6 348209.9 sgr_3x3_10bpc_neon: 850835.4 596445.0 534583.2 342713.4 349713.5 sgr_5x5_8bpc_neon: 577039.7 443916.5 341684.8 223374.0 232841.3 sgr_5x5_10bpc_neon: 600975.7 400041.3 347529.8 234759.9 239351.7 sgr_mix_8bpc_neon: 1297988.7 925739.1 830360.7 545476.1 548706.6 sgr_mix_10bpc_neon: 1340112.6 914395.7 873342.4 574815.7 554681.6 With this change in place, dav1d can run with around 72 KB of stack on arm targets. Not all functions have been merged in the same way as they were for arm64 in `c121b831e2`, so some minor differences remain; it's possible to incrementally optimize this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse finish_filter_row1/2 with sgr_weighted_row1, and make a version of finish_filter_row1 that produces 2 rows, like is done for arm64. It's also possible to rewrite the logic for calculating sgr_x_by_x in the same way as was done for arm64 in `79db162487`.	2024-11-19 11:58:25 +02:00
Martin Storsjö	1b7f126361	arm32: looprestoration: Apply simplifications to align with C code This applies the same simplifications that were done for the C code and the x86 assembly in `4613d3a530`, and the arm64 assembly in `ce80e6daf6`, to the arm32 implementation. This gives a minor speedup of around a couple percent. Before: Cortex A7 A8 A53 A72 A73 sgr_3x3_8bpc_neon: 926600.0 753468.3 553704.1 399379.1 369674.4 sgr_5x5_8bpc_neon: 621722.9 540412.7 357275.9 274474.3 254996.0 sgr_mix_8bpc_neon: 1529715.1 1171282.5 894982.9 659996.6 610407.2 After: sgr_3x3_8bpc_neon: 899020.3 697278.6 541569.9 382824.3 353891.8 sgr_5x5_8bpc_neon: 602183.2 498322.9 348974.5 264833.9 243837.7 sgr_mix_8bpc_neon: 1497870.8 1182121.3 880470.9 635939.3 590909.3	2024-11-18 16:08:00 +02:00
Martin Storsjö	c43debf1b1	arm64: looprestoration: Fix a comment typo	2024-11-18 16:07:40 +02:00
Martin Storsjö	1c7433a5eb	arm: looprestoration: Fix the single line loop in sgr_weighted2 After processing one block, this accidentally jumped to the loop for processing two lines at once. The same bug was replicated in both 32 and 64 bit versions.	2024-11-18 16:07:40 +02:00
Martin Storsjö	f32b314616	looprestoration: Rewrite the C version of the SGR filter This reduces the stack usage of these functions (the C version) significantly, and gives them a 15-40% speedup (on an Apple M3, with Xcode Clang 16). The C versions of this function does matter; even though we have assembly implementations of it on x86 and aarch64, those only covert the 8 and 10 bpc cases, while the C version is used as fallback for 12 bpc. This matches how these functions are implemented in the aarch64 assembly; operate over a window of 3 or 5 lines (of 384 pixels each), instead of doing a full 384 x 64 block. The individual functions for filtering a line each end up much simpler, and closer to how this can be implemented in assembly - but the overall business logic ends up much much more complex. The main difference to the aarch64 assembly implementation, is that any buffer which is of int16_t size in the aarch64 assembly implementation, uses the type "coef" here, which is 32 bit in the 10/12 bpc cases. (This is required for handling the 12 bpc cases.) With this in place, dav1d can run with around 66 KB of stack on x86_64 with assembly enabled, with around 74 KB of stack on aarch64 with assembly enabled, and with 118 KB of stack with assembly disabled. This increases the binary size by around 14 KB (in the case of aarch64 with Xcode Clang 16). On 32 bit arm, dav1d still requires around 270 KB of stack, as that assembly implementation of the SGR filter uses a different algorithm.	2024-11-18 15:57:19 +02:00
Martin Storsjö	01d417c2fa	arm: looprestoration: Give symbols and defines unique names As the machine specific init file is included in the common template, give symbols and defines unique names that won't clash with similar ones in the main template.	2024-11-18 15:39:28 +02:00
Martin Storsjö	847eece170	arm: looprestoration: Add spacing around operators	2024-11-18 15:39:28 +02:00
Martin Storsjö	56a55933b3	arm: looprestoration: Get rid of unnecessary rotate_ab_N intermediate functions	2024-11-18 15:39:28 +02:00
Martin Storsjö	9db59d8904	arm: looprestoration: Apply 'const' more consistently on parameters	2024-11-18 15:39:28 +02:00
Marvin Scholz	c8fdaa8611	checkasm: add loongarch GAS file to checkasm_asm_sources This is not an object so putting it in the objects variable seems wrong and would also break using gaspp for that file.	2024-11-16 14:51:35 +01:00
Maryla Ustarroz	f772f3e678	Fix comments on Dav1dMasteringDisplay The '///<' syntax is used to document a field after the field. Mistakenly using it before the field results in the documentation going to the wrong field, see: https://videolan.videolan.me/dav1d/structDav1dMasteringDisplay.html	2024-11-15 16:55:27 +01:00
Martin Storsjö	72b5380757	arm64: looprestoration: Fix use of the wrong register When renumbering argument registers in `1648c232ee`, this one register reference was missed. The missed register was meant to compare h with 2, but accidentally ended up comparing bitdepth_max to 2. In the case of 8 bpc, there's actually no bitdepth_max parameter, so it ended up comparing an uninitialized value.	2024-11-15 12:23:11 +02:00
Martin StorsjöandJean-Baptiste Kempf	bed3a34365	arm: Use /proc/cpuinfo on linux if getauxval is unavailable On really old libc versions, getauxval isn't available. Fall back on /proc/cpuinfo in those cases, just like we do on android too.	2024-11-14 14:44:21 +00:00
Martin StorsjöandJean-Baptiste Kempf	718b62c8cd	ci: Raise the timeout multipliers for jobs that run in QEMU For individual tests in dav1d-test-data, the default timeout is 30 seconds (which is the Meson default if nothing is specified). Previously it ran with a multiplier of 4, resulting in a total timeout of 120 seconds. When running tests in QEMU, exceeding this 120 second timeout could happen occasionally. Raise the multiplier to 10, allowing each individual job to run for up to 5 minutes. This should hopefully reduce the amount of stray failures in the CI. For tests that already have a higher default timeout set, such as checkasm which has got a 180 second default timeout, this results in a much longer timeout period. However as long as we don't frequently see issues where these actually hang, it should be beneficial to just let them run to completion, rather than aborting early due to a tight timeout.	2024-11-14 13:38:18 +00:00
Martin Storsjö	1648c232ee	arm64: looprestoration: Remove an unnecessary duplicate parameter in dav1d_sgr_weighted2_Xbpc_neon Also fix one case where the 32 bit input parameter w (which was in x6, now in x4) was used without zero extension, by referencing to it as w4 instead.	2024-11-14 11:53:50 +02:00
Martin Storsjö	ce80e6daf6	arm64: looprestoration: Apply simplifications to align with C code This applies the same simplifications that were done for the C code and the x86 assembly in `4613d3a530`, to the arm64 implementation. This gives a minor speedup of around a couple percent. Before: Cortex A53 A55 A72 A73 A76 Apple M3 sgr_3x3_8bpc_neon: 368583.2 363654.2 279958.1 272065.1 169353.3 354.6 sgr_5x5_8bpc_neon: 258570.7 255018.5 200410.6 199478.3 117968.3 260.9 sgr_mix_8bpc_neon: 603698.1 577383.3 482468.3 436540.4 256632.9 541.8 After: sgr_3x3_8bpc_neon: 367873.2 357884.1 275462.4 268363.9 165909.8 346.0 sgr_5x5_8bpc_neon: 254988.4 248184.2 190875.1 196939.1 120517.2 252.1 sgr_mix_8bpc_neon: 589204.7 563565.8 414025.6 427702.2 251651.2 533.4	2024-11-13 23:39:04 +02:00
Martin Storsjö	8bd31a92a5	arm: looprestoration: Split an overly long line	2024-11-13 15:38:20 +02:00
Luca Barbato	330e20672e	x86: Use the decl and init macros for put_8tap and prep_8tap	2024-11-10 14:18:23 +01:00
Luca Barbato	f966172feb	loongarch: Use the decl and init macros for put_8tap and prep_8tap	2024-11-10 14:01:36 +01:00
Luca Barbato	a403b575b1	mc: Factor out the decl and init macros They can be used across arches.	2024-11-10 14:00:19 +01:00
Luca Barbato	ac1fa6cbca	ppc: use a jumptable for the blends It makes the code tidier and the runtime is not slow.	2024-11-10 12:35:04 +01:00
Luca Barbato	4f088e42cb	ppc: blend_h pwr9 implementation blend_h_w2_8bpc_pwr9: 18.4 ( 1.20x) blend_h_w4_8bpc_pwr9: 27.2 ( 1.26x) blend_h_w8_8bpc_pwr9: 27.9 ( 2.22x) blend_h_w16_8bpc_pwr9: 35.1 ( 3.28x) blend_h_w32_8bpc_pwr9: 57.4 ( 3.88x) blend_h_w64_8bpc_pwr9: 97.9 ( 4.70x) blend_h_w128_8bpc_pwr9: 207.6 ( 5.18x)	2024-11-10 12:35:04 +01:00
Luca Barbato	423cf6e2bf	ppc: blend_v pwr9 implementation blend_v_w2_8bpc_pwr9: 25.0 ( 1.12x) blend_v_w4_8bpc_pwr9: 79.3 ( 1.35x) blend_v_w8_8bpc_pwr9: 79.5 ( 2.43x) blend_v_w16_8bpc_pwr9: 108.0 ( 3.58x) blend_v_w32_8bpc_pwr9: 153.5 ( 4.69x)	2024-11-10 12:35:04 +01:00
Luca Barbato	08681fdf13	ppc: blend pwr9 implementation blend_w4_8bpc_pwr9: 14.4 ( 1.90x) blend_w8_8bpc_pwr9: 19.9 ( 3.62x) blend_w16_8bpc_pwr9: 50.6 ( 5.17x) blend_w32_8bpc_pwr9: 125.8 ( 5.33x)	2024-11-10 12:35:04 +01:00
Brad SmithandJean-Baptiste Kempf	93f12c117a	Provide dav1d_getauxval() wrapper for getauxvaul() and elf_aux_info()	2024-11-05 13:10:58 +00:00
Nathan E. Egge	a17c862576	riscv64/mc: Only process w*3/4 elements in blend_v Setting VL for this function only impacts the 16bpc performance and only on the SpacemiT K1 which has two vector units of length 128b each. Kendryte K230 Before After Delta blend_v_w2_8bpc_c: 220.0 ( 1.00x) 221.3 ( 1.00x) 0.59% blend_v_w2_8bpc_rvv: 145.7 ( 1.51x) 148.2 ( 1.49x) 1.72% blend_v_w4_8bpc_c: 942.1 ( 1.00x) 943.7 ( 1.00x) 0.17% blend_v_w4_8bpc_rvv: 240.4 ( 3.92x) 242.9 ( 3.89x) 1.04% blend_v_w8_8bpc_c: 1782.3 ( 1.00x) 1783.8 ( 1.00x) 0.08% blend_v_w8_8bpc_rvv: 252.6 ( 7.06x) 254.9 ( 7.00x) 0.91% blend_v_w16_8bpc_c: 3650.9 ( 1.00x) 3647.0 ( 1.00x) -0.11% blend_v_w16_8bpc_rvv: 495.5 ( 7.37x) 494.4 ( 7.38x) -0.22% blend_v_w32_8bpc_c: 7013.0 ( 1.00x) 7018.2 ( 1.00x) 0.07% blend_v_w32_8bpc_rvv: 807.9 ( 8.68x) 802.0 ( 8.75x) -0.73% blend_v_w2_16bpc_c: 226.1 ( 1.00x) 225.5 ( 1.00x) -0.27% blend_v_w2_16bpc_rvv: 148.6 ( 1.52x) 148.9 ( 1.51x) 0.20% blend_v_w4_16bpc_c: 1010.7 ( 1.00x) 1006.7 ( 1.00x) -0.40% blend_v_w4_16bpc_rvv: 306.7 ( 3.30x) 307.4 ( 3.27x) 0.23% blend_v_w8_16bpc_c: 1990.2 ( 1.00x) 1996.1 ( 1.00x) 0.30% blend_v_w8_16bpc_rvv: 519.5 ( 3.83x) 523.4 ( 3.81x) 0.75% blend_v_w16_16bpc_c: 3744.5 ( 1.00x) 3742.4 ( 1.00x) -0.06% blend_v_w16_16bpc_rvv: 899.6 ( 4.16x) 906.4 ( 4.13x) 0.76% blend_v_w32_16bpc_c: 7047.5 ( 1.00x) 7079.3 ( 1.00x) 0.45% blend_v_w32_16bpc_rvv: 1475.5 ( 4.78x) 1483.3 ( 4.77x) 0.53% SpacemiT K1 Before After Delta blend_v_w2_8bpc_c: 216.3 ( 1.00x) 214.4 ( 1.00x) -0.88% blend_v_w2_8bpc_rvv: 144.0 ( 1.50x) 143.6 ( 1.49x) -0.28% blend_v_w4_8bpc_c: 919.8 ( 1.00x) 918.1 ( 1.00x) -0.18% blend_v_w4_8bpc_rvv: 236.6 ( 3.89x) 236.4 ( 3.88x) -0.08% blend_v_w8_8bpc_c: 1739.3 ( 1.00x) 1736.8 ( 1.00x) -0.14% blend_v_w8_8bpc_rvv: 236.8 ( 7.34x) 236.3 ( 7.35x) -0.21% blend_v_w16_8bpc_c: 3374.7 ( 1.00x) 3374.9 ( 1.00x) 0.01% blend_v_w16_8bpc_rvv: 297.0 (11.36x) 296.8 (11.37x) -0.07% blend_v_w32_8bpc_c: 6647.5 ( 1.00x) 6645.5 ( 1.00x) -0.03% blend_v_w32_8bpc_rvv: 403.3 (16.48x) 402.4 (16.51x) -0.22% blend_v_w2_16bpc_c: 221.4 ( 1.00x) 220.1 ( 1.00x) -0.59% blend_v_w2_16bpc_rvv: 146.3 ( 1.51x) 147.3 ( 1.49x) 0.68% blend_v_w4_16bpc_c: 973.3 ( 1.00x) 972.7 ( 1.00x) -0.06% blend_v_w4_16bpc_rvv: 280.3 ( 3.47x) 282.1 ( 3.45x) 0.64% blend_v_w8_16bpc_c: 1814.8 ( 1.00x) 1816.2 ( 1.00x) 0.08% blend_v_w8_16bpc_rvv: 376.6 ( 4.82x) 376.9 ( 4.82x) 0.08% blend_v_w16_16bpc_c: 3485.5 ( 1.00x) 3485.5 ( 1.00x) 0.00% blend_v_w16_16bpc_rvv: 531.1 ( 6.56x) 525.6 ( 6.63x) -1.04% blend_v_w32_16bpc_c: 6788.3 ( 1.00x) 6778.8 ( 1.00x) -0.14% blend_v_w32_16bpc_rvv: 904.5 ( 7.51x) 854.6 ( 7.93x) -5.52%	2024-11-05 04:11:55 +00:00
Nathan E. Egge	907dd87191	riscv64/mc16: Unroll 16bpc RVV blend_v 2x Kendryte K230 Before After Delta blend_v_w2_16bpc_c: 225.8 ( 1.00x) 225.7 ( 1.00x) -0.04% blend_v_w2_16bpc_rvv: 194.7 ( 1.16x) 148.6 ( 1.52x) -23.68% blend_v_w4_16bpc_c: 1011.3 ( 1.00x) 1005.8 ( 1.00x) -0.54% blend_v_w4_16bpc_rvv: 387.2 ( 2.61x) 305.4 ( 3.29x) -21.13% blend_v_w8_16bpc_c: 1878.5 ( 1.00x) 1872.7 ( 1.00x) -0.31% blend_v_w8_16bpc_rvv: 475.3 ( 3.95x) 435.6 ( 4.30x) -8.35% blend_v_w16_16bpc_c: 3601.9 ( 1.00x) 3601.6 ( 1.00x) -0.01% blend_v_w16_16bpc_rvv: 891.2 ( 4.04x) 892.7 ( 4.03x) 0.17% blend_v_w32_16bpc_c: 7043.7 ( 1.00x) 7058.8 ( 1.00x) 0.21% blend_v_w32_16bpc_rvv: 1384.5 ( 5.09x) 1478.0 ( 4.78x) 6.75% SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 222.6 ( 1.00x) 220.5 ( 1.00x) -0.94% blend_v_w2_16bpc_rvv: 195.7 ( 1.14x) 146.6 ( 1.50x) -25.09% blend_v_w4_16bpc_c: 972.3 ( 1.00x) 972.0 ( 1.00x) -0.03% blend_v_w4_16bpc_rvv: 349.1 ( 2.79x) 281.9 ( 3.45x) -19.25% blend_v_w8_16bpc_c: 1812.1 ( 1.00x) 1813.0 ( 1.00x) 0.05% blend_v_w8_16bpc_rvv: 481.5 ( 3.76x) 376.0 ( 4.82x) -21.91% blend_v_w16_16bpc_c: 3488.4 ( 1.00x) 3484.6 ( 1.00x) -0.11% blend_v_w16_16bpc_rvv: 608.7 ( 5.73x) 523.4 ( 6.66x) -14.01% blend_v_w32_16bpc_c: 6795.3 ( 1.00x) 6792.4 ( 1.00x) -0.04% blend_v_w32_16bpc_rvv: 934.8 ( 7.27x) 907.3 ( 7.49x) -2.94%	2024-11-04 20:20:37 +00:00
Nathan E. Egge	9710e7de9c	riscv64/mc16: Branchless vsetvl in blend_v function Kendryte K230 Before After Delta blend_v_w2_16bpc_c: 226.0 ( 1.00x) 226.1 ( 1.00x) 0.04% blend_v_w2_16bpc_rvv: 194.0 ( 1.16x) 193.9 ( 1.17x) -0.05% blend_v_w4_16bpc_c: 1011.8 ( 1.00x) 1009.4 ( 1.00x) -0.24% blend_v_w4_16bpc_rvv: 392.7 ( 2.58x) 390.8 ( 2.58x) -0.48% blend_v_w8_16bpc_c: 1987.9 ( 1.00x) 1988.0 ( 1.00x) 0.01% blend_v_w8_16bpc_rvv: 561.5 ( 3.54x) 560.2 ( 3.55x) -0.23% blend_v_w16_16bpc_c: 3738.1 ( 1.00x) 3739.1 ( 1.00x) 0.03% blend_v_w16_16bpc_rvv: 934.1 ( 4.00x) 932.2 ( 4.01x) -0.20% blend_v_w32_16bpc_c: 7031.0 ( 1.00x) 7030.1 ( 1.00x) -0.01% blend_v_w32_16bpc_rvv: 1403.3 ( 5.01x) 1395.8 ( 5.04x) -0.53% SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 221.0 ( 1.00x) 221.2 ( 1.00x) 0.09% blend_v_w2_16bpc_rvv: 195.2 ( 1.13x) 196.0 ( 1.13x) 0.41% blend_v_w4_16bpc_c: 969.8 ( 1.00x) 971.9 ( 1.00x) 0.22% blend_v_w4_16bpc_rvv: 348.8 ( 2.78x) 349.1 ( 2.78x) 0.09% blend_v_w8_16bpc_c: 1812.6 ( 1.00x) 1814.9 ( 1.00x) 0.13% blend_v_w8_16bpc_rvv: 486.1 ( 3.73x) 484.3 ( 3.75x) -0.37% blend_v_w16_16bpc_c: 3483.0 ( 1.00x) 3485.1 ( 1.00x) 0.06% blend_v_w16_16bpc_rvv: 608.7 ( 5.72x) 607.4 ( 5.74x) -0.21% blend_v_w32_16bpc_c: 6791.8 ( 1.00x) 6794.2 ( 1.00x) 0.04% blend_v_w32_16bpc_rvv: 940.6 ( 7.22x) 942.1 ( 7.21x) 0.16%	2024-11-04 19:46:26 +00:00
Nathan E. Egge	28d1c21779	riscv64/mc16: Add VLEN=256 8bpc RVV blend_v function SpacemiT K1 Before After Delta blend_v_w2_16bpc_c: 221.5 ( 1.00x) 220.3 ( 1.00x) -0.54% blend_v_w2_16bpc_rvv: 193.5 ( 1.14x) 194.3 ( 1.13x) 0.41% blend_v_w4_16bpc_c: 968.8 ( 1.00x) 967.2 ( 1.00x) -0.17% blend_v_w4_16bpc_rvv: 442.2 ( 2.19x) 347.4 ( 2.78x) -21.44% blend_v_w8_16bpc_c: 1809.4 ( 1.00x) 1811.2 ( 1.00x) 0.10% blend_v_w8_16bpc_rvv: 557.4 ( 3.25x) 483.2 ( 3.75x) -13.31% blend_v_w16_16bpc_c: 3481.4 ( 1.00x) 3473.4 ( 1.00x) -0.23% blend_v_w16_16bpc_rvv: 844.3 ( 4.12x) 603.1 ( 5.76x) -28.57% blend_v_w32_16bpc_c: 6783.1 ( 1.00x) 6749.8 ( 1.00x) -0.49% blend_v_w32_16bpc_rvv: 1406.1 ( 4.82x) 919.4 ( 7.34x) -34.61%	2024-11-04 18:52:32 +00:00
Nathan E. Egge	aa2deb898e	riscv64/mc16: Add 16bpc RVV blend_v function Kendryte K230 blend_v_w2_16bpc_c: 226.5 ( 1.00x) blend_v_w2_16bpc_rvv: 192.2 ( 1.18x) blend_v_w4_16bpc_c: 1010.3 ( 1.00x) blend_v_w4_16bpc_rvv: 390.5 ( 2.59x) blend_v_w8_16bpc_c: 1994.2 ( 1.00x) blend_v_w8_16bpc_rvv: 561.7 ( 3.55x) blend_v_w16_16bpc_c: 3737.9 ( 1.00x) blend_v_w16_16bpc_rvv: 928.0 ( 4.03x) blend_v_w32_16bpc_c: 7064.7 ( 1.00x) blend_v_w32_16bpc_rvv: 1428.9 ( 4.94x) SpacemiT K1 blend_v_w2_16bpc_c: 220.8 ( 1.00x) blend_v_w2_16bpc_rvv: 193.5 ( 1.14x) blend_v_w4_16bpc_c: 967.3 ( 1.00x) blend_v_w4_16bpc_rvv: 439.5 ( 2.20x) blend_v_w8_16bpc_c: 1810.2 ( 1.00x) blend_v_w8_16bpc_rvv: 555.3 ( 3.26x) blend_v_w16_16bpc_c: 3476.4 ( 1.00x) blend_v_w16_16bpc_rvv: 830.9 ( 4.18x) blend_v_w32_16bpc_c: 6772.9 ( 1.00x) blend_v_w32_16bpc_rvv: 1356.3 ( 4.99x)	2024-11-04 18:52:30 +00:00
Nathan E. Egge	c783088fe7	riscv64/mc16: Unroll 16bpc RVV blend 2x Kendryte K230 Before After Delta blend_w4_16bpc_c: 210.0 ( 1.00x) 208.9 ( 1.00x) -0.52% blend_w4_16bpc_rvv: 88.5 ( 2.37x) 66.2 ( 3.15x) -25.20% blend_w8_16bpc_c: 614.1 ( 1.00x) 613.5 ( 1.00x) -0.10% blend_w8_16bpc_rvv: 143.1 ( 4.29x) 126.9 ( 4.83x) -11.32% blend_w16_16bpc_c: 2371.2 ( 1.00x) 2371.3 ( 1.00x) 0.00% blend_w16_16bpc_rvv: 461.1 ( 5.14x) 413.2 ( 5.74x) -10.39% blend_w32_16bpc_c: 5998.4 ( 1.00x) 5998.4 ( 1.00x) 0.00% blend_w32_16bpc_rvv: 978.4 ( 6.13x) 1013.1 ( 5.92x) 3.55% SpacemiT K1 Before After Delta blend_w4_16bpc_c: 205.8 ( 1.00x) 205.9 ( 1.00x) 0.05% blend_w4_16bpc_rvv: 80.9 ( 2.54x) 64.9 ( 3.17x) -19.78% blend_w8_16bpc_c: 599.9 ( 1.00x) 599.9 ( 1.00x) 0.00% blend_w8_16bpc_rvv: 134.4 ( 4.46x) 101.9 ( 5.89x) -24.18% blend_w16_16bpc_c: 2316.5 ( 1.00x) 2316.5 ( 1.00x) 0.00% blend_w16_16bpc_rvv: 302.0 ( 7.67x) 262.8 ( 8.81x) -12.98% blend_w32_16bpc_c: 5861.9 ( 1.00x) 5861.4 ( 1.00x) -0.01% blend_w32_16bpc_rvv: 589.6 ( 9.94x) 602.2 ( 9.73x) 2.14%	2024-10-31 07:11:35 +00:00
Nathan E. Egge	67c60d76e1	riscv64/mc16: Branchless vsetvl in blend function Kendryte K230 Before After Delta blend_w4_16bpc_c: 208.8 ( 1.00x) 209.9 ( 1.00x) 0.53% blend_w4_16bpc_rvv: 85.9 ( 2.43x) 88.6 ( 2.37x) 3.14% blend_w8_16bpc_c: 613.2 ( 1.00x) 614.3 ( 1.00x) 0.18% blend_w8_16bpc_rvv: 145.4 ( 4.22x) 143.1 ( 4.29x) -1.58% blend_w16_16bpc_c: 2371.9 ( 1.00x) 2373.6 ( 1.00x) 0.07% blend_w16_16bpc_rvv: 464.0 ( 5.11x) 461.2 ( 5.15x) -0.60% blend_w32_16bpc_c: 6005.6 ( 1.00x) 6007.7 ( 1.00x) 0.03% blend_w32_16bpc_rvv: 981.6 ( 6.12x) 979.4 ( 6.13x) -0.22% SpacemiT K1 Before After Delta blend_w4_16bpc_c: 206.4 ( 1.00x) 205.7 ( 1.00x) -0.34% blend_w4_16bpc_rvv: 79.5 ( 2.60x) 81.0 ( 2.54x) 1.89% blend_w8_16bpc_c: 600.7 ( 1.00x) 599.7 ( 1.00x) -0.17% blend_w8_16bpc_rvv: 133.3 ( 4.51x) 134.1 ( 4.47x) 0.60% blend_w16_16bpc_c: 2315.9 ( 1.00x) 2315.2 ( 1.00x) -0.03% blend_w16_16bpc_rvv: 305.2 ( 7.59x) 300.7 ( 7.70x) -1.47% blend_w32_16bpc_c: 5861.1 ( 1.00x) 5860.2 ( 1.00x) -0.02% blend_w32_16bpc_rvv: 592.5 ( 9.89x) 589.5 ( 9.94x) -0.51%	2024-10-31 07:11:35 +00:00
Nathan E. Egge	3437a26b3d	riscv64/mc16: Add VLEN=256 8bpc RVV blend function SpacemiT K1 Before After Delta blend_w4_16bpc_c: 206.8 ( 1.00x) 206.0 ( 1.00x) -0.39% blend_w4_16bpc_rvv: 95.8 ( 2.16x) 77.8 ( 2.65x) -18.79% blend_w8_16bpc_c: 600.4 ( 1.00x) 600.1 ( 1.00x) -0.05% blend_w8_16bpc_rvv: 161.7 ( 3.71x) 131.3 ( 4.57x) -18.80% blend_w16_16bpc_c: 2317.6 ( 1.00x) 2316.5 ( 1.00x) -0.05% blend_w16_16bpc_rvv: 459.6 ( 5.04x) 302.9 ( 7.65x) -34.09% blend_w32_16bpc_c: 5863.0 ( 1.00x) 5863.3 ( 1.00x) 0.01% blend_w32_16bpc_rvv: 992.7 ( 5.91x) 578.1 (10.14x) -41.76%	2024-10-31 07:11:35 +00:00

1 2 3 4 5 ...