2799 Commits
Author SHA1 Message Date
Pranav KantandHenrik Gramner 7d4b789f55 Mark C globals with small code model
We have new option in clang (https://github.com/llvm/llvm-project/pull/124834)
to mark globals to be allocated in non-large sections. We can mark all globals
that are referenced from hardcoded assembly (which implicitly references globals
assuming they are in non-large sections) with this attribute to avoid running
into problems when dav1d is built with -mcmodel=medium with clang.
2025-02-21 15:55:00 +00:00
Jean-Baptiste Kempf 42b2b24fb8 Update NEWS for 1.5.1 1.5.1 2025-01-19 22:33:54 +01:00
Wan-Teh ChangandRonald S. Bultje 40ff2a1251 Include <string.h> for memcpy() 2025-01-10 01:54:41 +00:00
Arpad Panyik edb16889d1 AArch64: Add Neon implementation of load_tmvs
This patch adds a vectorised variant of the mv_projection calculation
and a faster initialisation of motion vectors for load_tmvs_neon.

Checkasm uplifts after this patch on some Neoverse and Cortex CPU cores
compared to the C reference compiled with GCC-13 and Clang-19:

                     GCC    Clang
 AWS Graviton 4:   1.62x    1.59x
 Cortex-X4:        1.45x    1.46x
 Cortex-X3:        1.68x    1.69x
 Cortex-X1:        1.55x    1.52x
 Cortex-A720:      1.54x    1.57x
 Cortex-A715:      1.47x    1.55x
 Cortex-A78:       1.21x    1.18x
 Cortex-A76:       1.38x    1.37x
 Cortex-A72:       1.08x    1.11x
 Cortex-A520:      0.97x    1.18x
 Cortex-A510:      0.99x    1.14x
 Cortex-A55:       1.16x    1.23x

This patch increases the .text by ~660 bytes, but smaller than the
reference implementation by about 0.5 KiB.
2025-01-09 14:59:31 +01:00
Martin Storsjö b129d9f2cb mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap}
For the bilin cases, this seems to make things marginally faster
(measured on x86_64; 7-25% faster with compiler autovectorization).
For 8tap, it doesn't make much of a difference at all.

Before:                                      GCC   Clang
mc_scaled_8tap_regular_w128_8bpc_c:     115155.5   98549.3
mc_scaled_8tap_regular_w128_8bpc_ssse3:  17936.0   18411.1
mc_scaled_bilinear_w128_8bpc_c:          40290.0   51812.9
mc_scaled_bilinear_w128_8bpc_ssse3:      18243.9   18177.0
After:
mc_scaled_8tap_regular_w128_8bpc_c:     116304.3   99453.2
mc_scaled_8tap_regular_w128_8bpc_ssse3:  18387.0   18077.3
mc_scaled_bilinear_w128_8bpc_c:          37381.4   41145.0
mc_scaled_bilinear_w128_8bpc_ssse3:      18423.8   18031.6

(Benchmarked with the seed 0; the total runtime for the scaled
benchmarks are significantly affected by the random seed.)

This reduces the stack usage of these functions from around 65 KB
each, to less than 1 KB for bilin, and around 2 KB for 8tap.

With this in place, the required stack space for dav1d should
be mostly identical across configurations; on x86_64 (both with
and without assembly), it can run with 62 KB of stack, and
on arm and aarch64, it can run with 58 KB of stack.
2025-01-02 15:30:21 +00:00
Brad SmithandJean-Baptiste Kempf cd5bfa124a riscv: Fix building on non-Linux OS's
CLOCK_MONOTONIC_RAW is not POSIX/portable.
2024-12-29 18:32:23 +00:00
James Almer 5ea4939a1d obu: don't print warnings for Metadata OBUs of types "Unregistered user private" 2024-12-27 13:48:54 -03:00
Martin Storsjö 2ba57aa535 arm32: looprestoration: Rewrite the wiener functions
Switch to the same cache-friendly algorithm as was done for arm64
in 2e73051c57 and for the reference
C code in 8291a66e50.

Contrary to the arm64 implementation, this uses a main loop in C
(very similar to the one in the main C implementation in
8291a66e50) rather than assembly;
this gives a bit more overhead on the call to each function, but
it shouldn't affect the big picture much.

Performane wise, this doesn't make much of a difference - it makes
things a little bit faster on some cores, and a little bit slower
on others:

Before:                 Cortex A7        A8       A53       A72       A73
wiener_7tap_8bpc_neon:   269384.4  147730.7  140028.5   92662.5   92929.0
wiener_7tap_10bpc_neon:  352690.2  159970.2  169427.8  116614.9  119371.1
After:
wiener_7tap_8bpc_neon:   238328.0  157274.1  134588.6   92200.3   97619.6
wiener_7tap_10bpc_neon:  336369.3  162182.0  161954.4  125521.2  130634.0

This is mostly in line with the results on arm64 in
2e73051c57. On arm64, there was a
bit larger speedup for the 7tap case, mostly attributed to
unrolling the vertical filter (and the new filter_hv function) to
operate on 16 pixels at a time. On arm32, there's not enough
registers to do that, so we can't get such gains from unrolling.
(Reducing the unrolling on the arm64 version to match the case
on arm32 also shows similar performance numbers as on arm32 here.)

In the arm64 version, we also added separate 5tap versions of all
functions; not doing that for arm32 at this point.

This increases the binary size by 2 KB.

This doesn't have any immediate effect on how much stack space
dav1d requires in total, since the largest stack users on arm
currently are the 8tap_scaled functions.
2024-12-20 14:32:32 +02:00
Martin Storsjö 8291a66e50 looprestoration: Use only 6 row buffer for wiener, like NEON/x86
This uses a separate function for combined horizontal and vertical
filtering, without needing to write the intermediate results
back to memory inbetween.

This mostly serves as an example for how to adjust the logic for
that case; unless we actually merge the horizontal and vertical
filtering within the _hv function, we still need space for a
7th row on the stack within that function (which means we use just
as much stack as before), but we also need one extra memcpy to
write it into the right destination.

In a build where the compiler is allowed to vectorize and inline
the wiener functions into each other, this change actually reduces
the final binary size by 4 KB, if the C version of the wiener filter
is retained.

This change makes the vectorized C code as fast as it was before
with Clang 18; on Xcode Clang 16, it's 2x slower than it was before.

Unfortunately, with GCC, this change makes the code a bit slower
again.
2024-12-19 14:19:19 +02:00
Martin Storsjö a149f5c3c0 looprestoration: Make the C wiener h filter more optimizable for the compiler
This increases the binary size by 9 KB, on aarch64 with Xcode Clang 16,
if the C version of the filter is retained (which it isn't
by default).

This makes the vectorized C code roughly as fast as it was before
the rewrite on GCC; on Clang it also becomes 1.3x-2.0x faster,
while still being slower than it was initially.
2024-12-19 14:19:19 +02:00
Martin Storsjö 9da303e989 looprestoration: Rewrite the C version of the wiener filter
This reduces the stack usage of these functions (the C version)
significantly.

These C versions aren't used on architectures that already have
wiener filters implemented in assembly, but they matter both if
running e.g. with assembly disabled (e.g. for sanitizer builds),
and matter as example for how to do a cache efficient SIMD
implementation.

This roughly matches how these functions are implemented in the
aarch64 assembly (although that assembly function uses a mainloop
function written in assembly, and custom calling conventions
between the functions).

With this in place, dav1d can run with around 76 KB of stack
with assembly disabled.

This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16), unless built with (the default)
-Dtrim_dsp=true. (By default, the C version of the wiener filter
gets skipped entirely.)

On 32 bit arm, the assembly wiener function implementation still
uses large buffers on the stack though, but due to other functions
using less stack there, dav1d can still run with 72 KB of stack
there.

Unfortunately, this change also makes the functions slower, depending
on how well the compiler was able to optimize the previous version.
On GCC (which didn't manage to vectorize the functions so well before),
it becomes 1.6x-2.0x slower, while it gets 2.5x-5x slower on Clang
(where it was very well vectorized before).

Most of this performance can be gained back with later changes on
top, though.
2024-12-19 14:19:13 +02:00
Luc Trudeau d242c47b43 Replace Av1Block with pal_sz in read_pal_indices 2024-12-02 09:32:33 -05:00
Henrik Gramner 9a75cebc36 Explicitly use uint8_t for the order_palette() scratch buffer
It previously used 'pixel' which is typedefed to uint8_t in files
that aren't bitdepth-templated, but those are indices and not
pixels so that was just confusing and misleading.
2024-12-02 13:47:04 +01:00
victorien 575af25859 flush: Reset f->task_thread.error
f->task_thread.error can be set during flushing, not resetting this can
lead to c->task_thread.first being increased after having already submitted
a frame post flushing. That's fine if it happens on the very first frame,
but if that's the case on any subsequent frame it will incur a wrong frame
ordering.
Now that a non-first frame will be considered as such, its tasks won't be
able to execute (since they depend on a truly previous frame considered as
being after) and c->task_thread.cur will be increased past that frame, with
no way of it being reset, eventually leading to a hang.
2024-11-28 17:56:13 +01:00
Wan-Teh ChangandRonald S. Bultje 767efeca06 Fix ClangTidy misc-include-cleaner warnings 2024-11-26 14:26:25 +00:00
Martin StorsjöandJean-Baptiste Kempf f8d2620d82 checkasm: looprestoration: Do strict bounds checking of the output
This would allow to immediately detect unintended writes out of
bounds like the ones fixed in
72b5380757 and
1c7433a5eb.

Extend the PIXEL_RECT macro to provide a variable containing the
full, padded height of the buffer, for uses that operate on the
full buffer.

Allow overwriting past the right edge of the target output rectangle,
up to an alignment of 64 pixels, but allow no overwrite past the
bottom.
2024-11-21 09:05:33 +00:00
Brad SmithandJean-Baptiste Kempf f15666b703 riscv: Enable FreeBSD / OpenBSD elf_aux_info() support 2024-11-21 08:41:38 +00:00
Martin Storsjö 30c3dd8edd arm32: looprestoration: Rewrite the SGR functions
Switch to the same cache-friendly algorithm as was done for arm64
in c121b831e2.

This uses much less stack memory, and is much more cache friendly.
In this form, most of the individual asm functions only operate on
one single row of data at a time.

Some of the functions used to be unrolled to operate on two rows
at a time, while they now only operate on one at a time. In practice,
this is still a large performance win, as data is accessed in a
much more cache friendly manner.

This gives a 2-37% speedup, and reduces the peak amount of stack
used for these functions from 255 KB to 33 KB.

Before:              Cortex A7         A8        A53        A72        A73
sgr_3x3_8bpc_neon:    873990.7   748341.9   543410.2   383200.4   357502.9
sgr_3x3_10bpc_neon:   909728.0   732594.5   560123.6   392765.5   359377.7
sgr_5x5_8bpc_neon:    591597.9   527353.1   350347.4   263464.9   243098.8
sgr_5x5_10bpc_neon:   637958.2   529462.8   364613.3   280664.6   255164.6
sgr_mix_8bpc_neon:   1458977.4  1185423.2   884017.7   632922.5   587395.2
sgr_mix_10bpc_neon:  1532376.5  1259111.4   918729.3   658787.6   600317.0
After:
sgr_3x3_8bpc_neon:    836138.7   635556.5   530596.1   335794.6   348209.9
sgr_3x3_10bpc_neon:   850835.4   596445.0   534583.2   342713.4   349713.5
sgr_5x5_8bpc_neon:    577039.7   443916.5   341684.8   223374.0   232841.3
sgr_5x5_10bpc_neon:   600975.7   400041.3   347529.8   234759.9   239351.7
sgr_mix_8bpc_neon:   1297988.7   925739.1   830360.7   545476.1   548706.6
sgr_mix_10bpc_neon:  1340112.6   914395.7   873342.4   574815.7   554681.6

With this change in place, dav1d can run with around 72 KB of stack
on arm targets.

Not all functions have been merged in the same way as they were
for arm64 in c121b831e2, so some
minor differences remain; it's possible to incrementally optimize
this, e.g. to fuse box3/5_row_v with calc_row_ab1/2, fuse
finish_filter_row1/2 with sgr_weighted_row1, and make a version of
finish_filter_row1 that produces 2 rows, like is done for arm64.

It's also possible to rewrite the logic for calculating sgr_x_by_x
in the same way as was done for arm64 in
79db162487.
2024-11-19 11:58:25 +02:00
Martin Storsjö 1b7f126361 arm32: looprestoration: Apply simplifications to align with C code
This applies the same simplifications that were done for the C
code and the x86 assembly in 4613d3a530,
and the arm64 assembly in ce80e6daf6,
to the arm32 implementation.

This gives a minor speedup of around a couple percent.

Before:             Cortex A7         A8        A53        A72        A73
sgr_3x3_8bpc_neon:   926600.0   753468.3   553704.1   399379.1   369674.4
sgr_5x5_8bpc_neon:   621722.9   540412.7   357275.9   274474.3   254996.0
sgr_mix_8bpc_neon:  1529715.1  1171282.5   894982.9   659996.6   610407.2
After:
sgr_3x3_8bpc_neon:   899020.3   697278.6   541569.9   382824.3   353891.8
sgr_5x5_8bpc_neon:   602183.2   498322.9   348974.5   264833.9   243837.7
sgr_mix_8bpc_neon:  1497870.8  1182121.3   880470.9   635939.3   590909.3
2024-11-18 16:08:00 +02:00
Martin Storsjö c43debf1b1 arm64: looprestoration: Fix a comment typo 2024-11-18 16:07:40 +02:00
Martin Storsjö 1c7433a5eb arm: looprestoration: Fix the single line loop in sgr_weighted2
After processing one block, this accidentally jumped to the loop
for processing two lines at once.

The same bug was replicated in both 32 and 64 bit versions.
2024-11-18 16:07:40 +02:00
Martin Storsjö f32b314616 looprestoration: Rewrite the C version of the SGR filter
This reduces the stack usage of these functions (the C version)
significantly, and gives them a 15-40% speedup (on an Apple M3,
with Xcode Clang 16).

The C versions of this function does matter; even though we have
assembly implementations of it on x86 and aarch64, those only
covert the 8 and 10 bpc cases, while the C version is used as
fallback for 12 bpc.

This matches how these functions are implemented in the aarch64
assembly; operate over a window of 3 or 5 lines (of 384 pixels
each), instead of doing a full 384 x 64 block.

The individual functions for filtering a line each end up
much simpler, and closer to how this can be implemented in
assembly - but the overall business logic ends up much much
more complex.

The main difference to the aarch64 assembly implementation,
is that any buffer which is of int16_t size in the aarch64
assembly implementation, uses the type "coef" here, which
is 32 bit in the 10/12 bpc cases. (This is required for handling
the 12 bpc cases.)

With this in place, dav1d can run with around 66 KB of stack
on x86_64 with assembly enabled, with around 74 KB of stack on
aarch64 with assembly enabled, and with 118 KB of stack with
assembly disabled.

This increases the binary size by around 14 KB (in the case of
aarch64 with Xcode Clang 16).

On 32 bit arm, dav1d still requires around 270 KB of stack, as
that assembly implementation of the SGR filter uses a different
algorithm.
2024-11-18 15:57:19 +02:00
Martin Storsjö 01d417c2fa arm: looprestoration: Give symbols and defines unique names
As the machine specific init file is included in the common
template, give symbols and defines unique names that won't
clash with similar ones in the main template.
2024-11-18 15:39:28 +02:00
Martin Storsjö 847eece170 arm: looprestoration: Add spacing around operators 2024-11-18 15:39:28 +02:00
Martin Storsjö 56a55933b3 arm: looprestoration: Get rid of unnecessary rotate_ab_N intermediate functions 2024-11-18 15:39:28 +02:00
Martin Storsjö 9db59d8904 arm: looprestoration: Apply 'const' more consistently on parameters 2024-11-18 15:39:28 +02:00
Marvin Scholz c8fdaa8611 checkasm: add loongarch GAS file to checkasm_asm_sources
This is not an object so putting it in the objects variable seems wrong
and would also break using gaspp for that file.
2024-11-16 14:51:35 +01:00
Maryla Ustarroz f772f3e678 Fix comments on Dav1dMasteringDisplay
The '///<' syntax is used to document a field after the field.
Mistakenly using it before the field results in the documentation
going to the wrong field, see:
https://videolan.videolan.me/dav1d/structDav1dMasteringDisplay.html
2024-11-15 16:55:27 +01:00
Martin Storsjö 72b5380757 arm64: looprestoration: Fix use of the wrong register
When renumbering argument registers in
1648c232ee, this one register
reference was missed.

The missed register was meant to compare h with 2, but accidentally
ended up comparing bitdepth_max to 2. In the case of 8 bpc, there's
actually no bitdepth_max parameter, so it ended up comparing an
uninitialized value.
2024-11-15 12:23:11 +02:00
Martin StorsjöandJean-Baptiste Kempf bed3a34365 arm: Use /proc/cpuinfo on linux if getauxval is unavailable
On really old libc versions, getauxval isn't available. Fall back
on /proc/cpuinfo in those cases, just like we do on android too.
2024-11-14 14:44:21 +00:00
Martin StorsjöandJean-Baptiste Kempf 718b62c8cd ci: Raise the timeout multipliers for jobs that run in QEMU
For individual tests in dav1d-test-data, the default timeout
is 30 seconds (which is the Meson default if nothing is
specified). Previously it ran with a multiplier of 4, resulting
in a total timeout of 120 seconds.

When running tests in QEMU, exceeding this 120 second timeout
could happen occasionally. Raise the multiplier to 10, allowing
each individual job to run for up to 5 minutes.

This should hopefully reduce the amount of stray failures in the
CI.

For tests that already have a higher default timeout set, such
as checkasm which has got a 180 second default timeout, this results
in a much longer timeout period. However as long as we don't
frequently see issues where these actually hang, it should be
beneficial to just let them run to completion, rather than
aborting early due to a tight timeout.
2024-11-14 13:38:18 +00:00
Martin Storsjö 1648c232ee arm64: looprestoration: Remove an unnecessary duplicate parameter in dav1d_sgr_weighted2_Xbpc_neon
Also fix one case where the 32 bit input parameter w (which was in
x6, now in x4) was used without zero extension, by referencing to
it as w4 instead.
2024-11-14 11:53:50 +02:00
Martin Storsjö ce80e6daf6 arm64: looprestoration: Apply simplifications to align with C code
This applies the same simplifications that were done for the C
code and the x86 assembly in 4613d3a530,
to the arm64 implementation.

This gives a minor speedup of around a couple percent.

Before:            Cortex A53        A55        A72        A73       A76  Apple
M3
sgr_3x3_8bpc_neon:   368583.2   363654.2   279958.1   272065.1  169353.3  354.6
sgr_5x5_8bpc_neon:   258570.7   255018.5   200410.6   199478.3  117968.3  260.9
sgr_mix_8bpc_neon:   603698.1   577383.3   482468.3   436540.4  256632.9  541.8
After:
sgr_3x3_8bpc_neon:   367873.2   357884.1   275462.4   268363.9  165909.8  346.0
sgr_5x5_8bpc_neon:   254988.4   248184.2   190875.1   196939.1  120517.2  252.1
sgr_mix_8bpc_neon:   589204.7   563565.8   414025.6   427702.2  251651.2  533.4
2024-11-13 23:39:04 +02:00
Martin Storsjö 8bd31a92a5 arm: looprestoration: Split an overly long line 2024-11-13 15:38:20 +02:00
Luca Barbato 330e20672e x86: Use the decl and init macros for put_8tap and prep_8tap 2024-11-10 14:18:23 +01:00
Luca Barbato f966172feb loongarch: Use the decl and init macros for put_8tap and prep_8tap 2024-11-10 14:01:36 +01:00
Luca Barbato a403b575b1 mc: Factor out the decl and init macros
They can be used across arches.
2024-11-10 14:00:19 +01:00
Luca Barbato ac1fa6cbca ppc: use a jumptable for the blends
It makes the code tidier and the runtime is not slow.
2024-11-10 12:35:04 +01:00
Luca Barbato 4f088e42cb ppc: blend_h pwr9 implementation
blend_h_w2_8bpc_pwr9:       18.4 ( 1.20x)
blend_h_w4_8bpc_pwr9:       27.2 ( 1.26x)
blend_h_w8_8bpc_pwr9:       27.9 ( 2.22x)
blend_h_w16_8bpc_pwr9:      35.1 ( 3.28x)
blend_h_w32_8bpc_pwr9:      57.4 ( 3.88x)
blend_h_w64_8bpc_pwr9:      97.9 ( 4.70x)
blend_h_w128_8bpc_pwr9:    207.6 ( 5.18x)
2024-11-10 12:35:04 +01:00
Luca Barbato 423cf6e2bf ppc: blend_v pwr9 implementation
blend_v_w2_8bpc_pwr9:      25.0 ( 1.12x)
blend_v_w4_8bpc_pwr9:      79.3 ( 1.35x)
blend_v_w8_8bpc_pwr9:      79.5 ( 2.43x)
blend_v_w16_8bpc_pwr9:    108.0 ( 3.58x)
blend_v_w32_8bpc_pwr9:    153.5 ( 4.69x)
2024-11-10 12:35:04 +01:00
Luca Barbato 08681fdf13 ppc: blend pwr9 implementation
blend_w4_8bpc_pwr9:      14.4 ( 1.90x)
blend_w8_8bpc_pwr9:      19.9 ( 3.62x)
blend_w16_8bpc_pwr9:     50.6 ( 5.17x)
blend_w32_8bpc_pwr9:    125.8 ( 5.33x)
2024-11-10 12:35:04 +01:00
Brad SmithandJean-Baptiste Kempf 93f12c117a Provide dav1d_getauxval() wrapper for getauxvaul() and elf_aux_info() 2024-11-05 13:10:58 +00:00
Nathan E. Egge a17c862576 riscv64/mc: Only process w*3/4 elements in blend_v
Setting VL for this function only impacts the 16bpc performance and only
 on the SpacemiT K1 which has two vector units of length 128b each.

Kendryte K230                Before             After         Delta

blend_v_w2_8bpc_c:        220.0 ( 1.00x)    221.3 ( 1.00x)    0.59%
blend_v_w2_8bpc_rvv:      145.7 ( 1.51x)    148.2 ( 1.49x)    1.72%
blend_v_w4_8bpc_c:        942.1 ( 1.00x)    943.7 ( 1.00x)    0.17%
blend_v_w4_8bpc_rvv:      240.4 ( 3.92x)    242.9 ( 3.89x)    1.04%
blend_v_w8_8bpc_c:       1782.3 ( 1.00x)   1783.8 ( 1.00x)    0.08%
blend_v_w8_8bpc_rvv:      252.6 ( 7.06x)    254.9 ( 7.00x)    0.91%
blend_v_w16_8bpc_c:      3650.9 ( 1.00x)   3647.0 ( 1.00x)   -0.11%
blend_v_w16_8bpc_rvv:     495.5 ( 7.37x)    494.4 ( 7.38x)   -0.22%
blend_v_w32_8bpc_c:      7013.0 ( 1.00x)   7018.2 ( 1.00x)    0.07%
blend_v_w32_8bpc_rvv:     807.9 ( 8.68x)    802.0 ( 8.75x)   -0.73%

blend_v_w2_16bpc_c:       226.1 ( 1.00x)    225.5 ( 1.00x)   -0.27%
blend_v_w2_16bpc_rvv:     148.6 ( 1.52x)    148.9 ( 1.51x)    0.20%
blend_v_w4_16bpc_c:      1010.7 ( 1.00x)   1006.7 ( 1.00x)   -0.40%
blend_v_w4_16bpc_rvv:     306.7 ( 3.30x)    307.4 ( 3.27x)    0.23%
blend_v_w8_16bpc_c:      1990.2 ( 1.00x)   1996.1 ( 1.00x)    0.30%
blend_v_w8_16bpc_rvv:     519.5 ( 3.83x)    523.4 ( 3.81x)    0.75%
blend_v_w16_16bpc_c:     3744.5 ( 1.00x)   3742.4 ( 1.00x)   -0.06%
blend_v_w16_16bpc_rvv:    899.6 ( 4.16x)    906.4 ( 4.13x)    0.76%
blend_v_w32_16bpc_c:     7047.5 ( 1.00x)   7079.3 ( 1.00x)    0.45%
blend_v_w32_16bpc_rvv:   1475.5 ( 4.78x)   1483.3 ( 4.77x)    0.53%

SpacemiT K1                  Before             After         Delta

blend_v_w2_8bpc_c:        216.3 ( 1.00x)    214.4 ( 1.00x)   -0.88%
blend_v_w2_8bpc_rvv:      144.0 ( 1.50x)    143.6 ( 1.49x)   -0.28%
blend_v_w4_8bpc_c:        919.8 ( 1.00x)    918.1 ( 1.00x)   -0.18%
blend_v_w4_8bpc_rvv:      236.6 ( 3.89x)    236.4 ( 3.88x)   -0.08%
blend_v_w8_8bpc_c:       1739.3 ( 1.00x)   1736.8 ( 1.00x)   -0.14%
blend_v_w8_8bpc_rvv:      236.8 ( 7.34x)    236.3 ( 7.35x)   -0.21%
blend_v_w16_8bpc_c:      3374.7 ( 1.00x)   3374.9 ( 1.00x)    0.01%
blend_v_w16_8bpc_rvv:     297.0 (11.36x)    296.8 (11.37x)   -0.07%
blend_v_w32_8bpc_c:      6647.5 ( 1.00x)   6645.5 ( 1.00x)   -0.03%
blend_v_w32_8bpc_rvv:     403.3 (16.48x)    402.4 (16.51x)   -0.22%

blend_v_w2_16bpc_c:       221.4 ( 1.00x)    220.1 ( 1.00x)   -0.59%
blend_v_w2_16bpc_rvv:     146.3 ( 1.51x)    147.3 ( 1.49x)    0.68%
blend_v_w4_16bpc_c:       973.3 ( 1.00x)    972.7 ( 1.00x)   -0.06%
blend_v_w4_16bpc_rvv:     280.3 ( 3.47x)    282.1 ( 3.45x)    0.64%
blend_v_w8_16bpc_c:      1814.8 ( 1.00x)   1816.2 ( 1.00x)    0.08%
blend_v_w8_16bpc_rvv:     376.6 ( 4.82x)    376.9 ( 4.82x)    0.08%
blend_v_w16_16bpc_c:     3485.5 ( 1.00x)   3485.5 ( 1.00x)    0.00%
blend_v_w16_16bpc_rvv:    531.1 ( 6.56x)    525.6 ( 6.63x)   -1.04%
blend_v_w32_16bpc_c:     6788.3 ( 1.00x)   6778.8 ( 1.00x)   -0.14%
blend_v_w32_16bpc_rvv:    904.5 ( 7.51x)    854.6 ( 7.93x)   -5.52%
2024-11-05 04:11:55 +00:00
Nathan E. Egge 907dd87191 riscv64/mc16: Unroll 16bpc RVV blend_v 2x
Kendryte K230                Before             After         Delta

blend_v_w2_16bpc_c:       225.8 ( 1.00x)    225.7 ( 1.00x)   -0.04%
blend_v_w2_16bpc_rvv:     194.7 ( 1.16x)    148.6 ( 1.52x)  -23.68%
blend_v_w4_16bpc_c:      1011.3 ( 1.00x)   1005.8 ( 1.00x)   -0.54%
blend_v_w4_16bpc_rvv:     387.2 ( 2.61x)    305.4 ( 3.29x)  -21.13%
blend_v_w8_16bpc_c:      1878.5 ( 1.00x)   1872.7 ( 1.00x)   -0.31%
blend_v_w8_16bpc_rvv:     475.3 ( 3.95x)    435.6 ( 4.30x)   -8.35%
blend_v_w16_16bpc_c:     3601.9 ( 1.00x)   3601.6 ( 1.00x)   -0.01%
blend_v_w16_16bpc_rvv:    891.2 ( 4.04x)    892.7 ( 4.03x)    0.17%
blend_v_w32_16bpc_c:     7043.7 ( 1.00x)   7058.8 ( 1.00x)    0.21%
blend_v_w32_16bpc_rvv:   1384.5 ( 5.09x)   1478.0 ( 4.78x)    6.75%

SpacemiT K1                  Before             After         Delta

blend_v_w2_16bpc_c:       222.6 ( 1.00x)    220.5 ( 1.00x)   -0.94%
blend_v_w2_16bpc_rvv:     195.7 ( 1.14x)    146.6 ( 1.50x)  -25.09%
blend_v_w4_16bpc_c:       972.3 ( 1.00x)    972.0 ( 1.00x)   -0.03%
blend_v_w4_16bpc_rvv:     349.1 ( 2.79x)    281.9 ( 3.45x)  -19.25%
blend_v_w8_16bpc_c:      1812.1 ( 1.00x)   1813.0 ( 1.00x)    0.05%
blend_v_w8_16bpc_rvv:     481.5 ( 3.76x)    376.0 ( 4.82x)  -21.91%
blend_v_w16_16bpc_c:     3488.4 ( 1.00x)   3484.6 ( 1.00x)   -0.11%
blend_v_w16_16bpc_rvv:    608.7 ( 5.73x)    523.4 ( 6.66x)  -14.01%
blend_v_w32_16bpc_c:     6795.3 ( 1.00x)   6792.4 ( 1.00x)   -0.04%
blend_v_w32_16bpc_rvv:    934.8 ( 7.27x)    907.3 ( 7.49x)   -2.94%
2024-11-04 20:20:37 +00:00
Nathan E. Egge 9710e7de9c riscv64/mc16: Branchless vsetvl in blend_v function
Kendryte K230                Before             After         Delta

blend_v_w2_16bpc_c:       226.0 ( 1.00x)    226.1 ( 1.00x)    0.04%
blend_v_w2_16bpc_rvv:     194.0 ( 1.16x)    193.9 ( 1.17x)   -0.05%
blend_v_w4_16bpc_c:      1011.8 ( 1.00x)   1009.4 ( 1.00x)   -0.24%
blend_v_w4_16bpc_rvv:     392.7 ( 2.58x)    390.8 ( 2.58x)   -0.48%
blend_v_w8_16bpc_c:      1987.9 ( 1.00x)   1988.0 ( 1.00x)    0.01%
blend_v_w8_16bpc_rvv:     561.5 ( 3.54x)    560.2 ( 3.55x)   -0.23%
blend_v_w16_16bpc_c:     3738.1 ( 1.00x)   3739.1 ( 1.00x)    0.03%
blend_v_w16_16bpc_rvv:    934.1 ( 4.00x)    932.2 ( 4.01x)   -0.20%
blend_v_w32_16bpc_c:     7031.0 ( 1.00x)   7030.1 ( 1.00x)   -0.01%
blend_v_w32_16bpc_rvv:   1403.3 ( 5.01x)   1395.8 ( 5.04x)   -0.53%

SpacemiT K1                  Before             After         Delta

blend_v_w2_16bpc_c:       221.0 ( 1.00x)    221.2 ( 1.00x)    0.09%
blend_v_w2_16bpc_rvv:     195.2 ( 1.13x)    196.0 ( 1.13x)    0.41%
blend_v_w4_16bpc_c:       969.8 ( 1.00x)    971.9 ( 1.00x)    0.22%
blend_v_w4_16bpc_rvv:     348.8 ( 2.78x)    349.1 ( 2.78x)    0.09%
blend_v_w8_16bpc_c:      1812.6 ( 1.00x)   1814.9 ( 1.00x)    0.13%
blend_v_w8_16bpc_rvv:     486.1 ( 3.73x)    484.3 ( 3.75x)   -0.37%
blend_v_w16_16bpc_c:     3483.0 ( 1.00x)   3485.1 ( 1.00x)    0.06%
blend_v_w16_16bpc_rvv:    608.7 ( 5.72x)    607.4 ( 5.74x)   -0.21%
blend_v_w32_16bpc_c:     6791.8 ( 1.00x)   6794.2 ( 1.00x)    0.04%
blend_v_w32_16bpc_rvv:    940.6 ( 7.22x)    942.1 ( 7.21x)    0.16%
2024-11-04 19:46:26 +00:00
Nathan E. Egge 28d1c21779 riscv64/mc16: Add VLEN=256 8bpc RVV blend_v function
SpacemiT K1                  Before             After         Delta

blend_v_w2_16bpc_c:       221.5 ( 1.00x)    220.3 ( 1.00x)   -0.54%
blend_v_w2_16bpc_rvv:     193.5 ( 1.14x)    194.3 ( 1.13x)    0.41%
blend_v_w4_16bpc_c:       968.8 ( 1.00x)    967.2 ( 1.00x)   -0.17%
blend_v_w4_16bpc_rvv:     442.2 ( 2.19x)    347.4 ( 2.78x)  -21.44%
blend_v_w8_16bpc_c:      1809.4 ( 1.00x)   1811.2 ( 1.00x)    0.10%
blend_v_w8_16bpc_rvv:     557.4 ( 3.25x)    483.2 ( 3.75x)  -13.31%
blend_v_w16_16bpc_c:     3481.4 ( 1.00x)   3473.4 ( 1.00x)   -0.23%
blend_v_w16_16bpc_rvv:    844.3 ( 4.12x)    603.1 ( 5.76x)  -28.57%
blend_v_w32_16bpc_c:     6783.1 ( 1.00x)   6749.8 ( 1.00x)   -0.49%
blend_v_w32_16bpc_rvv:   1406.1 ( 4.82x)    919.4 ( 7.34x)  -34.61%
2024-11-04 18:52:32 +00:00
Nathan E. Egge aa2deb898e riscv64/mc16: Add 16bpc RVV blend_v function
Kendryte K230

blend_v_w2_16bpc_c:       226.5 ( 1.00x)
blend_v_w2_16bpc_rvv:     192.2 ( 1.18x)
blend_v_w4_16bpc_c:      1010.3 ( 1.00x)
blend_v_w4_16bpc_rvv:     390.5 ( 2.59x)
blend_v_w8_16bpc_c:      1994.2 ( 1.00x)
blend_v_w8_16bpc_rvv:     561.7 ( 3.55x)
blend_v_w16_16bpc_c:     3737.9 ( 1.00x)
blend_v_w16_16bpc_rvv:    928.0 ( 4.03x)
blend_v_w32_16bpc_c:     7064.7 ( 1.00x)
blend_v_w32_16bpc_rvv:   1428.9 ( 4.94x)

SpacemiT K1

blend_v_w2_16bpc_c:       220.8 ( 1.00x)
blend_v_w2_16bpc_rvv:     193.5 ( 1.14x)
blend_v_w4_16bpc_c:       967.3 ( 1.00x)
blend_v_w4_16bpc_rvv:     439.5 ( 2.20x)
blend_v_w8_16bpc_c:      1810.2 ( 1.00x)
blend_v_w8_16bpc_rvv:     555.3 ( 3.26x)
blend_v_w16_16bpc_c:     3476.4 ( 1.00x)
blend_v_w16_16bpc_rvv:    830.9 ( 4.18x)
blend_v_w32_16bpc_c:     6772.9 ( 1.00x)
blend_v_w32_16bpc_rvv:   1356.3 ( 4.99x)
2024-11-04 18:52:30 +00:00
Nathan E. Egge c783088fe7 riscv64/mc16: Unroll 16bpc RVV blend 2x
Kendryte K230              Before               After         Delta

blend_w4_16bpc_c:       210.0 ( 1.00x)      208.9 ( 1.00x)   -0.52%
blend_w4_16bpc_rvv:      88.5 ( 2.37x)       66.2 ( 3.15x)  -25.20%
blend_w8_16bpc_c:       614.1 ( 1.00x)      613.5 ( 1.00x)   -0.10%
blend_w8_16bpc_rvv:     143.1 ( 4.29x)      126.9 ( 4.83x)  -11.32%
blend_w16_16bpc_c:     2371.2 ( 1.00x)     2371.3 ( 1.00x)    0.00%
blend_w16_16bpc_rvv:    461.1 ( 5.14x)      413.2 ( 5.74x)  -10.39%
blend_w32_16bpc_c:     5998.4 ( 1.00x)     5998.4 ( 1.00x)    0.00%
blend_w32_16bpc_rvv:    978.4 ( 6.13x)     1013.1 ( 5.92x)    3.55%

SpacemiT K1                Before               After         Delta

blend_w4_16bpc_c:       205.8 ( 1.00x)      205.9 ( 1.00x)    0.05%
blend_w4_16bpc_rvv:      80.9 ( 2.54x)       64.9 ( 3.17x)  -19.78%
blend_w8_16bpc_c:       599.9 ( 1.00x)      599.9 ( 1.00x)    0.00%
blend_w8_16bpc_rvv:     134.4 ( 4.46x)      101.9 ( 5.89x)  -24.18%
blend_w16_16bpc_c:     2316.5 ( 1.00x)     2316.5 ( 1.00x)    0.00%
blend_w16_16bpc_rvv:    302.0 ( 7.67x)      262.8 ( 8.81x)  -12.98%
blend_w32_16bpc_c:     5861.9 ( 1.00x)     5861.4 ( 1.00x)   -0.01%
blend_w32_16bpc_rvv:    589.6 ( 9.94x)      602.2 ( 9.73x)    2.14%
2024-10-31 07:11:35 +00:00
Nathan E. Egge 67c60d76e1 riscv64/mc16: Branchless vsetvl in blend function
Kendryte K230              Before               After         Delta

blend_w4_16bpc_c:       208.8 ( 1.00x)      209.9 ( 1.00x)    0.53%
blend_w4_16bpc_rvv:      85.9 ( 2.43x)       88.6 ( 2.37x)    3.14%
blend_w8_16bpc_c:       613.2 ( 1.00x)      614.3 ( 1.00x)    0.18%
blend_w8_16bpc_rvv:     145.4 ( 4.22x)      143.1 ( 4.29x)   -1.58%
blend_w16_16bpc_c:     2371.9 ( 1.00x)     2373.6 ( 1.00x)    0.07%
blend_w16_16bpc_rvv:    464.0 ( 5.11x)      461.2 ( 5.15x)   -0.60%
blend_w32_16bpc_c:     6005.6 ( 1.00x)     6007.7 ( 1.00x)    0.03%
blend_w32_16bpc_rvv:    981.6 ( 6.12x)      979.4 ( 6.13x)   -0.22%

SpacemiT K1                Before               After         Delta

blend_w4_16bpc_c:       206.4 ( 1.00x)      205.7 ( 1.00x)   -0.34%
blend_w4_16bpc_rvv:      79.5 ( 2.60x)       81.0 ( 2.54x)    1.89%
blend_w8_16bpc_c:       600.7 ( 1.00x)      599.7 ( 1.00x)   -0.17%
blend_w8_16bpc_rvv:     133.3 ( 4.51x)      134.1 ( 4.47x)    0.60%
blend_w16_16bpc_c:     2315.9 ( 1.00x)     2315.2 ( 1.00x)   -0.03%
blend_w16_16bpc_rvv:    305.2 ( 7.59x)      300.7 ( 7.70x)   -1.47%
blend_w32_16bpc_c:     5861.1 ( 1.00x)     5860.2 ( 1.00x)   -0.02%
blend_w32_16bpc_rvv:    592.5 ( 9.89x)      589.5 ( 9.94x)   -0.51%
2024-10-31 07:11:35 +00:00
Nathan E. Egge 3437a26b3d riscv64/mc16: Add VLEN=256 8bpc RVV blend function
SpacemiT K1                Before               After         Delta

blend_w4_16bpc_c:       206.8 ( 1.00x)      206.0 ( 1.00x)   -0.39%
blend_w4_16bpc_rvv:      95.8 ( 2.16x)       77.8 ( 2.65x)  -18.79%
blend_w8_16bpc_c:       600.4 ( 1.00x)      600.1 ( 1.00x)   -0.05%
blend_w8_16bpc_rvv:     161.7 ( 3.71x)      131.3 ( 4.57x)  -18.80%
blend_w16_16bpc_c:     2317.6 ( 1.00x)     2316.5 ( 1.00x)   -0.05%
blend_w16_16bpc_rvv:    459.6 ( 5.04x)      302.9 ( 7.65x)  -34.09%
blend_w32_16bpc_c:     5863.0 ( 1.00x)     5863.3 ( 1.00x)    0.01%
blend_w32_16bpc_rvv:    992.7 ( 5.91x)      578.1 (10.14x)  -41.76%
2024-10-31 07:11:35 +00:00