Optimize the width = 4 case of ipred_v_8bpc_neon by using simple stores
instead of the lane stores which can improve performance on some CPUs.
Relative runtime after this patch on some Cortex CPUs:
ipred_v: w4
Cortex-A55: 1.041x
Cortex-A510: 0.297x
Cortex-A520: 0.748x
Cortex-A76: 0.866x
Cortex-A78: 0.856x
Cortex-A715: 0.874x
Cortex-A720: 0.875x
Cortex-A725: 0.868x
Cortex-X1: 1.013x
Cortex-X3: 1.000x
Cortex-X925: 1.000x
This patch adds a vectorised variant of the mv_projection calculation
and a faster initialisation of motion vectors for load_tmvs_neon.
Checkasm uplifts after this patch on some Neoverse and Cortex CPU cores
compared to the C reference compiled with GCC-13 and Clang-19:
GCC Clang
AWS Graviton 4: 1.62x 1.59x
Cortex-X4: 1.45x 1.46x
Cortex-X3: 1.68x 1.69x
Cortex-X1: 1.55x 1.52x
Cortex-A720: 1.54x 1.57x
Cortex-A715: 1.47x 1.55x
Cortex-A78: 1.21x 1.18x
Cortex-A76: 1.38x 1.37x
Cortex-A72: 1.08x 1.11x
Cortex-A520: 0.97x 1.18x
Cortex-A510: 0.99x 1.14x
Cortex-A55: 1.16x 1.23x
This patch increases the .text by ~660 bytes, but smaller than the
reference implementation by about 0.5 KiB.
There are some instruction sequences we could merge after the lane
load/store patch (ec5c3052cf).
This change will simplify the loading of filter weights to save 288
bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.
The horizontal parts of 6-tap HV subpel filters can be further
improved by some pointer arithmetic and saving some instructions
(EXTs) in their data rearrangement codes.
Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:
HBD mct hv X1 A78 A76 A72 A55
regular w8: 0.952x 0.989x 0.924x 0.973x 0.976x
regular w16: 0.961x 0.993x 0.928x 0.952x 0.971x
regular w32: 0.964x 0.996x 0.930x 0.973x 0.972x
regular w64: 0.963x 0.997x 0.930x 0.969x 0.974x
The 6-tap horizontal subpel filters can be further improved by some
pointer arithmetic and saving some instructions (EXTs) in their data
rearrangement codes.
Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:
regular: X1 A78 A76 A55
mc w8: 0.915x 0.937x 0.900x 0.982x
mc w16: 0.917x 0.947x 0.911x 0.971x
mc w32: 0.914x 0.938x 0.873x 0.961x
mc w64: 0.918x 0.932x 0.882x 0.964x
The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+
SRSHL instruction sequences. In the horizontal case this can be
rewritten using a single SQSHRUN instruction with an additional
rounding value (34 for 10-bit and 40 for 12-bit).
Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:
regular: X1 A78 A76 A55
mc w2: 0.847x 0.864x 0.822x 0.859x
mc w4: 0.889x 0.994x 0.868x 0.917x
mc w8: 0.857x 0.911x 0.915x 0.978x
mc w16: 0.890x 0.982x 0.868x 0.974x
mc w32: 0.904x 0.991x 0.873x 0.967x
mc w64: 0.919x 1.003x 0.860x 0.970x
MS armasm64 cannot compile some SVE instructions with immediate
operands, e.g.:
sub z0.h, z0.h, #8192
The proper form is:
sub z0.h, z0.h, #32, lsl #8
This patch contains the needed fixes.
The macro parameter \xmy of filter_8tap_fn was used incorrectly as a
pointer instead of \lsrc. They refer to the same register but in
different context.
The constants used for the subpel filters were placed in the .text
section for simplicity and peak performance, but this does not work on
systems with execute only .text sections (e.g.: OpenBSD).
The performance cost of moving the constants to the .rodata section
is small and mostly within the measurable noise.
The DotProd/I8MM horizontal and HV/2D subpel filters use -4 offset
for sampling instead of -3 to be better aligned in some cases. This
resulted in an out of bounds access, which led to crashes.
This patch fixes it.
Move the BTI landing pads out of the inner loops of prep_neon
function. Only the width=4 and width=8 cases are affected.
If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the
inner loops we get better execution speed on Cortex-A510 relative to
the original (lower is better):
w4: 0.969x
w8: 0.722x
Out-of-order cores are not affected.
Move the BTI landing pads out of the inner loops of put_neon
function, the only exception is the width=16 case where it is already
outside of the loops.
When BTI is enabled, the relative performance of omitting the
AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower
is better):
w2: 0.981x
w4: 0.991x
w8: 0.612x
w32: 0.687x
w64: 0.813x
w128: 0.892x
Out-of-order CPUs are mostly unaffected.
Rewrite the accumulator initializations of the horizontal part of the
2D filters with zero register fills. It can improve the performance
on out-of-order CPUs which can fill vector registers by zero with
zero latency. Zeroed accumulators imply the usage of the rounding
shifts at the end of filters.
The only exception is the very short *hv_filter4*, where the longer
latency of rounding shift could decrease the performance.
The *filter8* function uses a different (alternating) dot product
computation order for DotProd+ feature level, it gives a better
overall performance for out-of-order and some in-order CPU cores.
The i8mm version does not need to use bias for the loaded samples, so
a different instruction scheduling is beneficial mostly affecting the
order of TBL instructions in the 8-tap case.
Relative performance of micro benchmarks (lower is better):
Cortex-X3:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.982x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.979x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.972x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.969x
mct_8tap_regular_w4_hv_8bpc_i8mm: 0.942x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.935x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.988x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.982x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.981x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.975x
mc_8tap_regular_w4_hv_8bpc_i8mm: 0.998x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.996x
mc_8tap_regular_w2_hv_8bpc_i8mm: 1.006x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.993x
Cortex-A715:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.883x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.931x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.882x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.928x
mct_8tap_regular_w4_hv_8bpc_i8mm: 0.969x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.934x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.881x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.925x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.879x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.925x
mc_8tap_regular_w4_hv_8bpc_i8mm: 0.917x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.976x
mc_8tap_regular_w2_hv_8bpc_i8mm: 0.915x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.972x
Cortex-A510:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.994x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.949x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.987x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.947x
mct_8tap_regular_w4_hv_8bpc_i8mm: 1.002x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.999x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.989x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 1.003x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.986x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 1.000x
mc_8tap_regular_w4_hv_8bpc_i8mm: 1.007x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 1.000x
mc_8tap_regular_w2_hv_8bpc_i8mm: 1.005x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 1.000x
Replace the accumulator initializations of the vertical subpel
filters with register fills by zeros (which are usually zero latency
operations in this feature class), this implies the usage of rounding
shifts at the end in the prep cases. Out-of-order CPU cores can
benefit from this change.
The width=16 case uses a simpler register duplication scheme that
relies on MOV instructions for the subsequent shuffles. This approach
uses a different register to load the data into for better instruction
scheduling and data dependency chain.
Relative performance of micro benchmarks (lower is better):
Cortex-X3:
mct_8tap_sharp_w16_v_8bpc_i8mm: 0.910x
mct_8tap_sharp_w8_v_8bpc_i8mm: 0.986x
mc_8tap_sharp_w16_v_8bpc_i8mm: 0.864x
mc_8tap_sharp_w8_v_8bpc_i8mm: 0.882x
mc_8tap_sharp_w4_v_8bpc_i8mm: 0.933x
mc_8tap_sharp_w2_v_8bpc_i8mm: 0.926x
Cortex-A715:
mct_8tap_sharp_w16_v_8bpc_i8mm: 0.855x
mct_8tap_sharp_w8_v_8bpc_i8mm: 0.784x
mct_8tap_sharp_w4_v_8bpc_i8mm: 1.069x
mc_8tap_sharp_w16_v_8bpc_i8mm: 0.850x
mc_8tap_sharp_w8_v_8bpc_i8mm: 0.779x
mc_8tap_sharp_w4_v_8bpc_i8mm: 0.971x
mc_8tap_sharp_w2_v_8bpc_i8mm: 0.975x
Cortex-A510:
mct_8tap_sharp_w16_v_8bpc_i8mm: 1.001x
mct_8tap_sharp_w8_v_8bpc_i8mm: 0.979x
mct_8tap_sharp_w4_v_8bpc_i8mm: 0.998x
mc_8tap_sharp_w16_v_8bpc_i8mm: 0.998x
mc_8tap_sharp_w8_v_8bpc_i8mm: 1.004x
mc_8tap_sharp_w4_v_8bpc_i8mm: 1.003x
mc_8tap_sharp_w2_v_8bpc_i8mm: 0.996x
Replace the accumulator initializations of the horizontal prep
filters with register fills by zeros. Most i8mm capable CPUs can do
these with zero latency, but we also need to use rounding shifts at
the end of the filter. We can see better performance with this
change on out-of-order CPUs.
Relative performance of micro benchmarks (lower is better):
Cortex-X3:
mct_8tap_sharp_w32_h_8bpc_i8mm: 0.914x
mct_8tap_sharp_w16_h_8bpc_i8mm: 0.906x
mct_8tap_sharp_w8_h_8bpc_i8mm: 0.877x
Cortex-A715:
mct_8tap_sharp_w32_h_8bpc_i8mm: 0.819x
mct_8tap_sharp_w16_h_8bpc_i8mm: 0.805x
mct_8tap_sharp_w8_h_8bpc_i8mm: 0.779x
Cortex-A510:
mct_8tap_sharp_w32_h_8bpc_i8mm: 0.999x
mct_8tap_sharp_w16_h_8bpc_i8mm: 1.001x
mct_8tap_sharp_w8_h_8bpc_i8mm: 0.996x
mct_8tap_sharp_w4_h_8bpc_i8mm: 0.915x
Simplify the TBL usages in small block size (2, 4) parts of the 2D
(horizontal-vertical) put subpel filters. The 2-register TBLs are
replaced with the 1-register form because we only need the lower
64-bits of the result and it can be extracted from only one source
register. Performance is not affected by this change.
Simplify the inner loops of the DotProd code path of horizontal
subpel filters to avoid using 2-register TBL instructions. The
store part of block size 16 of the horizontal put case is also
simplified (str + add -> st1). This patch can improve performance
mostly on small cores like Cortex-A510 and newer. Other CPUs are
mostly unaffected.
Cortex-A510:
mct_8tap_sharp_w16_h_8bpc_dotprod: 2.77x -> 3.13x
mct_8tap_sharp_w32_h_8bpc_dotprod: 2.32x -> 2.56x
Cortex-A55:
mct_8tap_sharp_w16_h_8bpc_dotprod: 3.89x -> 3.89x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.35x -> 3.35x
Cortex-A715:
mct_8tap_sharp_w16_h_8bpc_dotprod: 3.79x -> 3.78x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.30x -> 3.30x
Cortex-A78:
mct_8tap_sharp_w16_h_8bpc_dotprod: 4.30x -> 4.31x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.79x -> 3.80x
Cortex-X3:
mct_8tap_sharp_w16_h_8bpc_dotprod: 4.74x -> 4.75x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.89x -> 3.91x
Cortex-X1:
mct_8tap_sharp_w16_h_8bpc_dotprod: 4.61x -> 4.62x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.67x -> 3.66x
Simplify the accumulator initializations of the DotProd code path of
vertical subpel filters. This also makes it possible for some CPUs to
use zero latency vector register moves. The load is also simplified
(ldr + add -> ld1) in the inner loop of vertical filter for block
size 16.
Add \dot parameter to filter_8tap_fn macro in preparation to extend
it with i8mm code path. This patch also contains string fixes and
some instruction reorderings along with some register renaming to
make it more uniform. These changes don't affect performance but
simplifies the code a bit.
The 8-tap sub-pel filters used for motion vector interpolation are:
regular, smooth, sharp. The regular and smooth filter kernels are
zero-padded, so they are effectively 6-tap filters (some of them are
5-tap or even 4-tap).
This patch specialises the high bit-depth versions of put_8tap_neon
and prep_8tap_neon functions for 6-tap filters, avoiding a lot of
redundant work to multiply by and add zero. Wherever the sharp
filtering is used the 8-tap path will be always selected.
Benchmarks can show a 0.5-10.8% FPS uplift highly depending on the
input video source. Binary size increase is ~8.5 KiB.
Optimize the 6-tap standard bit-depth horizontal-vertical combined
convolution to avoid unnecessary reads and horizontal convolution
steps at the beginning and end of the algorithm. This also saves some
instructions in the final binary.
Performance of this function increases by up to 5.5% depending on
block size.
The 6-tap sub-pel filter specialisation uses different code paths for
sharp (8-tap) and regular/smooth (6-tap) filtering kernels.
This patch enables benchmarking for the different code paths.
The 8-tap sub-pel filters used for motion vector interpolation are:
regular, smooth, sharp. The regular and smooth filter kernels are
zero-padded, so they are effectively 6-tap filters (some of them are
5-tap or even 4-tap).
This patch specialises the put_8tap_neon and prep_8tap_neon functions
for 6-tap filters, avoiding a lot of redundant work to multiply by
and add zero. Wherever the sharp filtering is used the 8-tap path
will be always selected.
Benchmarking this on a broad range of recent CPUs shows a 7-15% FPS
uplift.
Get raw sample video:
https://ultravideo.fi/video/Bosphorus_1920x1080_120fps_420_8bit_YUV_RAW.7z
Encode using:
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m