43 Commits
Author SHA1 Message Date
Arpad PanyikandMartin Storsjö 62501cc7db AArch64: Optimize ipred_smooth_8bpc_neon
Optimize ipred_smooth_8bpc_neon using simpler arithmetic operations and
the removal of jump table.

Relative runtime after this patch on some Cortex CPUs:

ipred_smooth:   w4      w8      w16     w32     w64
Cortex-A55:   1.041x  0.839x  0.705x  0.765x  0.802x
Cortex-A510:  1.055x  0.880x  0.669x  0.694x  0.729x
Cortex-A520:  1.113x  0.922x  0.659x  0.737x  0.783x
Cortex-A76:   0.763x  0.733x  0.608x  0.707x  0.791x
Cortex-A78:   0.840x  0.712x  0.704x  0.748x  0.786x
Cortex-A715:  0.814x  0.655x  0.798x  0.837x  0.858x
Cortex-A725:  0.813x  0.653x  0.791x  0.830x  0.854x
Cortex-X1:    0.825x  0.686x  0.667x  0.729x  0.756x
Cortex-X3:    0.865x  0.617x  0.649x  0.674x  0.688x
Cortex-X925:  0.825x  0.677x  0.641x  0.686x  0.700x
2026-05-26 12:30:26 +00:00
Arpad Panyik dbed372b70 AArch64: Optimize ipred_smooth_v_8bpc_neon further
Optimize ipred_smooth_h_8bpc_neon even further using vertical inner
loop for w >= 16 cases.

Relative runtime after this patch on some Cortex CPUs:

ipred_smooth_v:    w4      w8      w16     w32     w64
Cortex-A55:      0.985x  0.981x  0.810x  0.873x  0.907x
Cortex-A510:     0.966x  0.951x  0.950x  1.013x  1.047x
Cortex-A520:     0.924x  0.924x  0.890x  0.984x  1.030x
Cortex-A76:      0.978x  1.036x  0.899x  0.919x  0.918x
Cortex-A78:      0.997x  0.993x  0.986x  0.972x  0.983x
Cortex-A710:     1.002x  0.973x  0.984x  0.958x  1.002x
Cortex-A715:     1.073x  1.049x  1.005x  1.018x  1.012x
Cortex-A720:     1.001x  1.004x  0.990x  1.007x  1.008x
Cortex-A725:     1.002x  1.001x  0.985x  1.007x  1.006x
Cortex-X1:       0.996x  1.077x  0.927x  0.962x  0.970x
Cortex-X2:       1.012x  0.989x  0.881x  0.971x  0.981x
Cortex-X3:       1.006x  1.034x  0.841x  0.966x  0.962x
Cortex-X4:       1.020x  1.022x  0.915x  0.964x  0.985x
Cortex-X925:     1.000x  0.947x  0.936x  0.982x  0.996x
2026-05-20 13:48:18 +02:00
Arpad Panyik a38236491a AArch64: Optimize ipred_smooth_h_8bpc_neon further
Optimize ipred_smooth_h_8bpc_neon even further using vertical inner
loop for w >= 16 cases. Reorder instructions in the w = 4 handler for
Small CPUs.

Relative runtime after this patch on some Cortex CPUs:

ipred_smooth_h:    w4      w8      w16     w32     w64
Cortex-A55:      0.964x  1.003x  0.891x  0.979x  1.030x
Cortex-A510:     0.952x  0.936x  0.928x  1.004x  1.050x
Cortex-A520:     0.921x  0.925x  0.921x  0.995x  1.032x
Cortex-A76:      0.993x  1.005x  0.977x  0.995x  0.996x
Cortex-A78:      0.991x  0.998x  1.042x  0.978x  1.015x
Cortex-A710:     1.020x  0.966x  1.015x  1.015x  1.008x
Cortex-A715:     1.026x  1.051x  1.039x  1.007x  1.024x
Cortex-A720:     0.954x  0.999x  1.018x  0.999x  1.020x
Cortex-A725:     0.962x  1.000x  1.018x  1.000x  1.021x
Cortex-X1:       1.019x  0.993x  0.924x  0.983x  0.989x
Cortex-X2:       1.013x  0.991x  0.872x  0.964x  1.023x
Cortex-X3:       1.030x  0.996x  0.840x  0.953x  1.024x
Cortex-X4:       1.026x  1.005x  0.952x  0.970x  0.986x
Cortex-X925:     1.000x  0.980x  0.865x  0.899x  0.892x
2026-05-20 13:44:22 +02:00
Arpad PanyikandMartin Storsjö 51b67010e2 AArch64: Optimize ipred_smooth_v_8bpc_neon
Optimize ipred_smooth_v_8bpc_neon using simpler arithmetic operations
and the removal of jump table.

Relative runtime after this patch on some Cortex CPUs:

ipred_smooth_v:    w4      w8     w16     w32     w64
Cortex-A55:     1.025x  0.847x  0.821x  0.830x  0.852x
Cortex-A510:    1.017x  0.923x  0.915x  0.883x  0.840x
Cortex-A520:    1.080x  0.972x  0.999x  0.934x  0.876x
Cortex-A76:     0.818x  0.575x  0.599x  0.723x  0.744x
Cortex-A78:     0.782x  0.571x  0.595x  0.641x  0.685x
Cortex-A715:    0.801x  0.586x  0.593x  0.651x  0.694x
Cortex-A725:    0.801x  0.579x  0.596x  0.649x  0.692x
Cortex-X1:      0.782x  0.560x  0.553x  0.623x  0.682x
Cortex-X3:      0.792x  0.594x  0.526x  0.526x  0.604x
Cortex-X925:    0.757x  0.678x  0.525x  0.554x  0.577x
2026-05-06 20:18:03 +00:00
Arpad PanyikandMartin Storsjö 4db1a05aad AArch64: Optimize ipred_smooth_h_8bpc_neon
Optimize ipred_smooth_h_8bpc_neon using simpler arithmetic operations.

Relative runtime after this patch on some Cortex CPUs:

ipred_smooth_h:    w4      w8     w16     w32     w64
Cortex-A55:     1.015x  0.857x  0.819x  0.835x  0.862x
Cortex-A510:    0.988x  0.860x  0.915x  0.879x  0.837x
Cortex-A520:    0.999x  0.883x  0.967x  0.929x  0.873x
Cortex-A76:     0.804x  0.637x  0.517x  0.573x  0.613x
Cortex-A78:     0.800x  0.586x  0.548x  0.639x  0.640x
Cortex-A715:    0.722x  0.642x  0.563x  0.627x  0.646x
Cortex-A725:    0.710x  0.639x  0.567x  0.622x  0.645x
Cortex-X1:      0.758x  0.570x  0.565x  0.548x  0.557x
Cortex-X3:      0.789x  0.589x  0.528x  0.563x  0.571x
Cortex-X925:    0.855x  0.739x  0.541x  0.551x  0.567x
2026-05-06 20:18:03 +00:00
Arpad Panyik c5726277ff AArch64: Optimize ipred_h_8bpc_neon
Optimize ipred_h_8bpc_neon using simpler stores and simpler indexing.

Relative runtime after this patch on some Cortex CPUs:

ipred_h:        w4      w8      w16     w32     w64
Cortex-A55:   1.054x  1.054x  0.978x  1.149x  1.097x
Cortex-A510:  0.455x  0.970x  0.973x  1.010x  1.002x
Cortex-A520:  0.973x  0.975x  0.979x  1.002x  1.000x
Cortex-A76:   0.791x  0.934x  0.912x  1.010x  0.999x
Cortex-A78:   0.771x  0.933x  0.957x  0.519x  0.510x
Cortex-A715:  0.838x  0.860x  0.893x  0.585x  0.661x
Cortex-A720:  0.839x  0.860x  0.892x  0.580x  0.659x
Cortex-A725:  0.809x  0.837x  0.871x  0.580x  0.660x
Cortex-X1:    0.973x  0.982x  0.989x  0.498x  0.660x
Cortex-X3:    0.971x  0.992x  0.987x  0.495x  0.661x
Cortex-X925:  0.950x  1.000x  1.000x  0.474x  0.655x
2026-04-16 16:02:28 +02:00
Arpad Panyik 47e2607e6c AArch64: Optimize ipred_v_8bpc_neon
Optimize the width = 4 case of ipred_v_8bpc_neon by using simple stores
instead of the lane stores which can improve performance on some CPUs.

Relative runtime after this patch on some Cortex CPUs:

 ipred_v:       w4
Cortex-A55:   1.041x
Cortex-A510:  0.297x
Cortex-A520:  0.748x
Cortex-A76:   0.866x
Cortex-A78:   0.856x
Cortex-A715:  0.874x
Cortex-A720:  0.875x
Cortex-A725:  0.868x
Cortex-X1:    1.013x
Cortex-X3:    1.000x
Cortex-X925:  1.000x
2026-04-15 17:37:46 +02:00
Arpad Panyik edb16889d1 AArch64: Add Neon implementation of load_tmvs
This patch adds a vectorised variant of the mv_projection calculation
and a faster initialisation of motion vectors for load_tmvs_neon.

Checkasm uplifts after this patch on some Neoverse and Cortex CPU cores
compared to the C reference compiled with GCC-13 and Clang-19:

                     GCC    Clang
 AWS Graviton 4:   1.62x    1.59x
 Cortex-X4:        1.45x    1.46x
 Cortex-X3:        1.68x    1.69x
 Cortex-X1:        1.55x    1.52x
 Cortex-A720:      1.54x    1.57x
 Cortex-A715:      1.47x    1.55x
 Cortex-A78:       1.21x    1.18x
 Cortex-A76:       1.38x    1.37x
 Cortex-A72:       1.08x    1.11x
 Cortex-A520:      0.97x    1.18x
 Cortex-A510:      0.99x    1.14x
 Cortex-A55:       1.16x    1.23x

This patch increases the .text by ~660 bytes, but smaller than the
reference implementation by about 0.5 KiB.
2025-01-09 14:59:31 +01:00
Arpad PanyikandMartin Storsjö 82e9155c75 AArch64: Trim Armv8.0 Neon path of 6-tap and 8-tap MC functions
There are some instruction sequences we could merge after the lane
load/store patch (ec5c3052cf).

This change will simplify the loading of filter weights to save 288
bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.
2024-09-12 11:31:07 +00:00
Arpad PanyikandMartin Storsjö ec5c3052cf AArch64: Optimize lane load/store in MC functions
Partial register writes can create long dependency chains, which can
reduce performance on out-of-order CPUs. This patch removes most of
these kinds of problems in MC functions by filling the full register
before other lane loading instructions.

Most lane extracting stores can also be optimized using FP scalar
stores when the 0th lane would be extracted.

Relative runtime of micro benchmarks after this patch on some Neoverse
and Cortex CPU cores:

8bpc neon                V2      V1      X3      X1    A715     A78     A76
 avg        w8:       0.942x  1.030x  0.936x  0.935x  1.000x  0.877x  0.976x
 w_avg      w8:       0.908x  0.913x  0.919x  0.914x  0.999x  0.905x  0.910x
 mask       w8:       0.937x  0.905x  0.929x  0.907x  1.009x  0.921x  0.868x
 w_mask 420 w4:       0.969x  0.968x  0.951x  0.962x  0.995x  0.976x  0.958x
 w_mask 420 w8:       0.979x  0.935x  0.936x  0.935x  0.996x  0.948x  0.959x
 blend      w4:       0.721x  0.841x  0.764x  0.822x  0.772x  0.826x  0.883x
 blend      w8:       0.692x  0.733x  0.686x  0.730x  0.828x  0.723x  0.762x
 blend    h w2:       0.738x  0.776x  0.746x  0.775x  0.683x  0.827x  0.851x
 blend    h w4:       0.858x  0.942x  0.880x  0.933x  0.784x  0.924x  0.965x
 blend    h w8:       0.804x  0.807x  0.806x  0.805x  0.814x  0.810x  0.748x
 blend    v w2:       0.898x  0.931x  0.903x  0.949x  0.784x  0.867x  0.875x
 blend    v w4:       0.935x  0.905x  0.933x  0.922x  0.763x  0.777x  0.807x
 blend    v w8:       0.803x  0.802x  0.804x  0.815x  0.674x  0.677x  0.678x

16bpc neon               V2      V1      X3      X1    A715     A78     A76
 avg        w4:       0.899x  0.967x  0.897x  0.948x  1.002x  0.901x  0.884x
 w_avg      w4:       0.952x  0.951x  0.936x  0.946x  0.997x  0.937x  0.925x
 mask       w4:       0.893x  0.958x  0.887x  0.948x  1.003x  0.938x  0.934x
 w_mask 420 w4:       0.933x  0.932x  0.932x  0.939x  1.000x  0.910x  0.955x
 w_mask 420 w8:       0.966x  0.962x  0.967x  0.961x  1.000x  0.990x  1.010x
 blend      w4:       0.367x  0.361x  0.370x  0.352x  0.418x  0.394x  0.476x
 blend    h w2:       0.365x  0.445x  0.369x  0.437x  0.416x  0.576x  0.699x
 blend    h w4:       0.343x  0.402x  0.342x  0.398x  0.418x  0.525x  0.603x
 blend    v w2:       0.464x  0.460x  0.460x  0.447x  0.494x  0.446x  0.503x
 blend    v w4:       0.432x  0.424x  0.437x  0.416x  0.433x  0.427x  0.534x
 blend    v w8:       0.936x  0.847x  0.949x  0.848x  1.007x  0.811x  0.785x

bilinear 8bpc neon       V2      V1      X3      X1    A715     A78     A76
 mct     w4  0:       0.982x  0.983x  0.955x  1.029x  0.784x  0.817x  0.814x
 mc      w2  h:       0.277x  0.333x  0.275x  0.325x  0.299x  0.435x  0.518x
 mct     w4  h:       0.835x  0.862x  0.814x  0.887x  1.074x  0.899x  0.884x
 mc      w2  v:       0.887x  0.966x  0.894x  0.945x  0.808x  0.953x  0.997x
 mc      w4  v:       0.762x  0.899x  0.766x  0.867x  0.695x  0.915x  1.017x
 mct     w4  v:       0.700x  0.812x  0.740x  0.777x  0.777x  0.824x  0.853x
 mc      w2 hv:       0.928x  0.985x  0.929x  0.978x  0.789x  0.969x  1.010x
 mct     w4 hv:       0.887x  0.913x  0.912x  0.920x  1.001x  0.922x  0.937x

bilinear 16bpc neon      V2      V1      X3      X1    A715     A78     A76
 mc      w2  0:       0.991x  1.032x  0.993x  0.970x  0.878x  0.925x  0.999x
 mct     w4  0:       0.811x  0.730x  0.797x  0.680x  0.808x  0.711x  0.805x
 mc      w4  h:       0.885x  0.901x  0.895x  0.905x  1.003x  0.909x  0.910x
 mct     w4  h:       0.902x  0.914x  0.898x  0.896x  1.000x  0.897x  0.934x
 mc      w2  v:       0.888x  0.966x  0.913x  0.955x  0.824x  0.958x  1.005x
 mc      w4  v:       0.897x  0.894x  0.903x  0.902x  1.001x  0.895x  0.895x
 mct     w4  v:       0.924x  0.908x  0.921x  0.901x  1.001x  0.904x  0.918x
 mc      w4 hv:       0.927x  0.925x  0.924x  0.933x  1.000x  0.936x  0.959x
 mct     w4 hv:       0.923x  0.944x  0.923x  0.944x  0.999x  0.931x  0.956x

8tap 8bpc neon           V2      V1      X3      X1    A715     A78     A76
 mct regular w4  0:   0.829x  0.854x  0.735x  0.861x  0.769x  0.766x  0.840x
 mc  regular w2  h:   0.984x  1.008x  0.983x  1.012x  0.986x  0.989x  0.995x
 mc  sharp   w2  h:   0.987x  1.008x  0.986x  1.011x  0.985x  0.989x  0.995x
 mc  regular w4  h:   0.907x  0.911x  0.916x  0.908x  0.997x  0.936x  0.932x
 mc  sharp   w4  h:   0.916x  0.914x  0.918x  0.913x  0.999x  0.939x  0.905x
 mct regular w4  h:   0.992x  0.979x  0.993x  0.971x  1.000x  0.986x  0.976x
 mct sharp   w4  h:   0.991x  0.979x  0.989x  0.984x  1.001x  0.979x  0.983x
 mc  regular w2  v:   1.002x  1.001x  1.005x  1.000x  1.000x  0.998x  0.983x
 mc  sharp   w2  v:   1.005x  1.001x  1.009x  0.998x  0.994x  0.997x  0.989x
 mc  regular w4  v:   0.985x  0.998x  0.991x  0.998x  1.000x  1.000x  0.983x
 mc  sharp   w4  v:   1.005x  1.002x  1.006x  1.002x  0.998x  0.991x  0.999x
 mct regular w4  v:   0.966x  0.967x  0.961x  0.974x  0.996x  0.954x  0.982x
 mct sharp   w4  v:   0.970x  0.944x  0.967x  0.944x  0.997x  0.951x  0.966x
 mc  regular w2 hv:   0.993x  0.993x  0.994x  0.987x  0.993x  0.985x  0.999x
 mc  sharp   w2 hv:   0.994x  0.996x  0.992x  0.998x  0.997x  0.999x  0.999x
 mc  regular w4 hv:   0.964x  0.958x  0.964x  0.960x  0.982x  0.938x  0.958x
 mc  sharp   w4 hv:   0.982x  0.981x  0.980x  0.982x  0.995x  0.986x  0.941x
 mct regular w4 hv:   0.993x  0.994x  0.992x  0.994x  0.996x  0.992x  0.988x
 mct sharp   w4 hv:   0.993x  0.996x  0.991x  0.996x  0.954x  0.992x  1.011x

8tap 16bpc neon          V2      V1      X3      X1    A715     A78     A76
 mc  regular w2  0:   0.869x  1.059x  0.874x  0.956x  0.883x  0.932x  1.000x
 mct regular w4  0:   0.348x  0.369x  0.354x  0.377x  0.560x  0.409x  0.648x
 mc  regular w2  h:   0.996x  0.988x  0.992x  0.985x  0.989x  0.991x  1.006x
 mc  sharp   w2  h:   0.996x  0.989x  0.979x  0.991x  0.987x  0.988x  0.997x
 mc  regular w4  h:   0.957x  0.937x  0.957x  0.948x  0.961x  0.927x  0.994x
 mc  sharp   w4  h:   0.966x  0.940x  0.962x  0.954x  0.985x  0.929x  0.970x
 mct regular w4  h:   0.922x  0.942x  0.932x  0.933x  1.007x  0.938x  0.905x
 mct sharp   w4  h:   0.919x  0.943x  0.919x  0.931x  0.971x  0.943x  0.929x
 mc  regular w2  v:   1.000x  0.997x  1.001x  1.003x  1.001x  0.999x  0.984x
 mc  sharp   w2  v:   1.000x  0.999x  1.000x  0.999x  1.000x  1.000x  0.993x
 mc  regular w4  v:   0.936x  0.941x  0.936x  0.939x  0.999x  0.928x  0.981x
 mc  sharp   w4  v:   0.955x  0.961x  0.949x  0.956x  0.999x  0.947x  0.953x
 mct regular w4  v:   0.977x  0.966x  0.979x  0.968x  0.990x  0.972x  0.972x
 mct sharp   w4  v:   0.973x  0.965x  0.981x  0.963x  0.994x  0.977x  0.974x
 mc  regular w2 hv:   0.995x  1.001x  0.995x  0.995x  0.995x  1.000x  0.981x
 mc  sharp   w2 hv:   0.993x  1.012x  0.993x  0.988x  0.996x  0.992x  1.008x
 mc  regular w4 hv:   0.938x  0.943x  0.939x  0.943x  0.986x  0.943x  0.997x
 mc  sharp   w4 hv:   0.969x  0.959x  0.970x  0.974x  0.986x  0.993x  0.997x
 mct regular w4 hv:   0.942x  0.970x  0.951x  0.960x  0.977x  0.958x  1.018x
 mct sharp   w4 hv:   0.923x  0.958x  0.934x  0.955x  0.973x  0.946x  0.986x
2024-09-06 11:40:46 +03:00
Arpad PanyikandMartin Storsjö a992a9bede AArch64: Optimize Armv8.0 Neon path of SBD H/HV 6-tap filters
The 6-tap horizontal and the horizontal parts of 6-tap HV subpel
filters can be further improved by some pointer arithmetic and saving
some instructions (EXTs) in their data rearrangement codes.

Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:

SBD mct h         X1     A78     A76     A72     A55
 regular  w8:  0.878x  0.894x  0.990x  0.923x  0.944x
 regular w16:  0.962x  0.931x  0.943x  0.949x  0.949x
 regular w32:  0.937x  0.937x  0.972x  0.938x  0.947x
 regular w64:  0.920x  0.965x  0.992x  0.936x  0.944x

SBD mct hv        X1     A78     A76     A72     A55
 regular  w8:  0.931x  0.970x  0.951x  0.950x  0.971x
 regular w16:  0.940x  0.971x  0.941x  0.952x  0.967x
 regular w32:  0.943x  0.972x  0.946x  0.961x  0.974x
 regular w64:  0.943x  0.973x  0.952x  0.944x  0.975x
2024-09-06 08:08:08 +00:00
Arpad PanyikandMartin Storsjö 2d808de191 AArch64: Optimize Armv8.0 Neon path of HBD HV 6-tap filters
The horizontal parts of 6-tap HV subpel filters can be further
improved by some pointer arithmetic and saving some instructions
(EXTs) in their data rearrangement codes.

Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:

HBD mct hv        X1     A78     A76     A72     A55
 regular  w8:  0.952x  0.989x  0.924x  0.973x  0.976x
 regular w16:  0.961x  0.993x  0.928x  0.952x  0.971x
 regular w32:  0.964x  0.996x  0.930x  0.973x  0.972x
 regular w64:  0.963x  0.997x  0.930x  0.969x  0.974x
2024-09-06 07:50:38 +00:00
Arpad PanyikandMartin Storsjö 93339ce857 AArch64: Optimize Armv8.0 Neon path of HBD horizontal 6-tap filters
The 6-tap horizontal subpel filters can be further improved by some
pointer arithmetic and saving some instructions (EXTs) in their data
rearrangement codes.

Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:

regular:     X1      A78      A76      A55
 mc  w8:  0.915x   0.937x   0.900x   0.982x
 mc w16:  0.917x   0.947x   0.911x   0.971x
 mc w32:  0.914x   0.938x   0.873x   0.961x
 mc w64:  0.918x   0.932x   0.882x   0.964x
2024-09-06 07:38:18 +00:00
Arpad PanyikandMartin Storsjö 109b24277b AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters
The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+
SRSHL instruction sequences. In the horizontal case this can be
rewritten using a single SQSHRUN instruction with an additional
rounding value (34 for 10-bit and 40 for 12-bit).

Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:

regular:     X1      A78      A76      A55
 mc  w2:  0.847x   0.864x   0.822x   0.859x
 mc  w4:  0.889x   0.994x   0.868x   0.917x
 mc  w8:  0.857x   0.911x   0.915x   0.978x
 mc w16:  0.890x   0.982x   0.868x   0.974x
 mc w32:  0.904x   0.991x   0.873x   0.967x
 mc w64:  0.919x   1.003x   0.860x   0.970x
2024-09-06 07:38:18 +00:00
Arpad PanyikandMartin Storsjö 472b31f838 AArch64: SVE MS armasm64 fix of HBD subpel filters
MS armasm64 cannot compile some SVE instructions with immediate
operands, e.g.:
  sub  z0.h, z0.h, #8192

The proper form is:
  sub  z0.h, z0.h, #32, lsl #8

This patch contains the needed fixes.
2024-08-22 19:33:06 +00:00
Arpad PanyikandMartin Storsjö 01558f3f66 AArch64: Add HBD subpel filters using 128-bit SVE2
Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only
2D convolutions have 6-tap specialisations of their vertical passes.
All other convolutions are 4- or 8-tap filters which fit well with
the 4-element 16-bit SDOT instruction of SVE2.

This patch renames HBD prep/put_neon to prep/put_16bpc_neon and
exports put_16bpc_neon.

Benchmarks show up-to 17% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 8 KiB.

Relative performance to the C reference on some Cortex-A/X CPUs:

    regular     A715    A720      X3      X4    A510    A520
 w4 hv neon:    3.93x   4.10x   5.21x   5.17x   3.57x   5.27x
 w4 hv sve2:    4.99x   5.14x   6.00x   6.05x   4.33x   3.99x
 w8 hv neon:    1.72x   1.67x   1.98x   2.18x   2.95x   2.94x
 w8 hv sve2:    2.12x   2.29x   2.52x   2.62x   2.60x   2.60x
w16 hv neon:    1.59x   1.53x   1.83x   1.89x   2.35x   2.24x
w16 hv sve2:    1.94x   2.12x   2.33x   2.18x   2.06x   2.06x
w32 hv neon:    1.49x   1.50x   1.66x   1.76x   2.10x   2.16x
w32 hv sve2:    1.81x   2.09x   2.11x   2.09x   1.84x   1.87x
w64 hv neon:    1.52x   1.50x   1.55x   1.71x   1.95x   2.05x
w64 hv sve2:    1.84x   2.08x   1.97x   1.98x   1.74x   1.77x

 w4 h neon:     5.35x   5.47x   7.39x   5.78x   3.92x   5.19x
 w4 h sve2:     7.91x   8.35x  11.95x  10.33x   5.81x   5.42x
 w8 h neon:     4.49x   4.43x   6.50x   4.87x   7.18x   6.17x
 w8 h sve2:     6.09x   6.22x   9.59x   7.70x   7.89x   6.83x
w16 h neon:     2.53x   2.52x   2.34x   1.86x   2.71x   2.75x
w16 h sve2:     3.41x   3.47x   3.53x   3.25x   2.89x   2.96x
w32 h neon:     2.07x   2.08x   1.97x   1.56x   2.17x   2.21x
w32 h sve2:     2.76x   2.84x   2.94x   2.75x   2.24x   2.29x
w64 h neon:     1.86x   1.86x   1.76x   1.41x   1.87x   1.88x
w64 h sve2:     2.47x   2.54x   2.65x   2.46x   1.94x   1.94x

 w4 v neon:     5.22x   5.17x   6.36x   5.60x   4.23x   7.30x
 w4 v sve2:     5.86x   5.90x   7.81x   7.16x   4.86x   4.15x
 w8 v neon:     4.83x   4.79x   6.96x   6.45x   4.74x   8.40x
 w8 v sve2:     5.25x   5.23x   7.76x   6.79x   4.84x   4.13x
w16 v neon:     2.59x   2.60x   2.93x   2.47x   1.80x   4.16x
w16 v sve2:     2.85x   2.88x   3.36x   2.73x   1.86x   2.00x
w32 v neon:     2.12x   2.13x   2.33x   2.03x   1.34x   3.11x
w32 v sve2:     2.36x   2.40x   2.73x   2.32x   1.41x   1.48x
w64 v neon:     1.94x   1.92x   2.02x   1.78x   1.12x   2.59x
w64 v sve2:     2.16x   2.15x   2.37x   2.03x   1.17x   1.22x

 w4 0 neon:     1.75x   1.71x   1.44x   1.56x   3.18x   2.87x
 w4 0 sve2:     4.28x   4.39x   5.72x   6.42x   5.50x   4.68x
 w8 0 neon:     3.05x   3.04x   4.44x   4.64x   3.84x   3.52x
 w8 0 sve2:     3.85x   3.80x   5.45x   6.01x   4.92x   4.26x
w16 0 neon:     2.92x   2.93x   3.82x   3.23x   4.58x   4.44x
w16 0 sve2:     4.29x   4.27x   4.25x   4.15x   5.58x   5.29x
w32 0 neon:     2.73x   2.76x   3.50x   2.67x   4.44x   4.26x
w32 0 sve2:     4.09x   4.10x   3.75x   3.39x   5.67x   5.22x
w64 0 neon:     2.73x   2.70x   3.27x   3.14x   4.57x   4.68x
w64 0 sve2:     4.06x   3.97x   3.54x   3.18x   6.36x   6.25x

      sharp     A715    A720      X3      X4    A510    A520
 w4 hv neon:    3.54x   3.64x   4.43x   4.45x   3.03x   4.72x
 w4 hv sve2:    4.30x   4.55x   5.38x   5.26x   4.04x   3.76x
 w8 hv neon:    1.30x   1.25x   1.51x   1.60x   2.44x   2.43x
 w8 hv sve2:    1.86x   2.06x   2.09x   2.18x   2.37x   2.39x
w16 hv neon:    1.19x   1.16x   1.43x   1.36x   1.95x   1.98x
w16 hv sve2:    1.68x   1.91x   1.94x   1.84x   1.89x   1.94x
w32 hv neon:    1.13x   1.12x   1.30x   1.29x   1.75x   1.81x
w32 hv sve2:    1.58x   1.84x   1.75x   1.74x   1.70x   1.76x
w64 hv neon:    1.13x   1.13x   1.21x   1.25x   1.65x   1.69x
w64 hv sve2:    1.57x   1.84x   1.62x   1.67x   1.62x   1.65x

 w4 h neon:     5.38x   5.49x   7.46x   5.74x   3.93x   5.23x
 w4 h sve2:     7.86x   8.37x  11.99x  10.38x   5.81x   5.40x
 w8 h neon:     3.46x   3.49x   5.36x   4.64x   6.40x   5.62x
 w8 h sve2:     5.95x   6.23x   9.61x   7.76x   7.86x   6.89x
w16 h neon:     1.99x   1.97x   2.07x   1.91x   2.43x   2.51x
w16 h sve2:     3.42x   3.46x   3.75x   3.23x   2.89x   2.98x
w32 h neon:     1.67x   1.62x   1.66x   1.63x   1.95x   2.01x
w32 h sve2:     2.86x   2.84x   2.94x   2.72x   2.21x   2.29x
w64 h neon:     1.45x   1.45x   1.51x   1.48x   1.69x   1.70x
w64 h sve2:     2.47x   2.54x   2.64x   2.46x   1.93x   1.95x

 w4 v neon:     4.07x   4.01x   5.15x   4.74x   3.38x   6.56x
 w4 v sve2:     5.88x   5.86x   7.81x   7.15x   4.85x   4.39x
 w8 v neon:     3.64x   3.59x   5.38x   4.92x   3.59x   7.23x
 w8 v sve2:     5.23x   5.19x   7.77x   6.66x   4.81x   4.13x
w16 v neon:     1.93x   1.95x   2.25x   1.92x   1.35x   3.46x
w16 v sve2:     2.85x   2.88x   3.36x   2.71x   1.86x   1.94x
w32 v neon:     1.57x   1.58x   1.78x   1.60x   1.01x   2.67x
w32 v sve2:     2.36x   2.39x   2.73x   2.35x   1.41x   1.50x
w64 v neon:     1.44x   1.42x   1.54x   1.43x   0.85x   2.19x
w64 v sve2:     2.17x   2.15x   2.37x   2.06x   1.18x   1.25x
2024-08-22 12:52:56 +00:00
Arpad Panyik 713c076d80 AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters
Add 6-tap variant of standard bit-depth horizontal subpel filters
using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch
also extends the HV filter with 6-tap horizontal pass using USMMLA.

Benchmarks show up-to 6-7% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 1.2 KiB.

Relative runtime of micro benchmarks after this patch on Neoverse
and Cortex CPU cores:

regular      V2      V1      X3    A720    A715    A520    A510
  w8 hv:  0.860x  0.895x  0.870x  0.896x  0.896x  0.938x  0.936x
 w16 hv:  0.829x  0.886x  0.865x  0.908x  0.906x  0.946x  0.944x
 w32 hv:  0.837x  0.883x  0.862x  0.914x  0.915x  0.953x  0.949x
 w64 hv:  0.840x  0.883x  0.862x  0.914x  0.914x  0.955x  0.952x

  w8 h:   0.746x  0.754x  0.747x  0.723x  0.724x  0.874x  0.866x
 w16 h:   0.749x  0.764x  0.745x  0.731x  0.731x  0.858x  0.852x
 w32 h:   0.739x  0.754x  0.738x  0.729x  0.729x  0.839x  0.837x
 w64 h:   0.736x  0.749x  0.733x  0.725x  0.726x  0.847x  0.836x
2024-08-21 23:41:48 +02:00
Arpad Panyik 287e90a3a6 AArch64: Fix typo in SBD 6-tap 2D/HV subpel filter
The macro parameter \xmy of filter_8tap_fn was used incorrectly as a
pointer instead of \lsrc. They refer to the same register but in
different context.
2024-08-12 19:41:45 +02:00
Arpad Panyik 2355eeb8f2 AArch64: Move constants of DotProd subpel filters to .rodata
The constants used for the subpel filters were placed in the .text
section for simplicity and peak performance, but this does not work on
systems with execute only .text sections (e.g.: OpenBSD).

The performance cost of moving the constants to the .rodata section
is small and mostly within the measurable noise.
2024-06-26 11:20:43 +02:00
Arpad Panyik 92f592ed10 AArch64: Fix potential out of bounds access in DotProd H/HV filters
The DotProd/I8MM horizontal and HV/2D subpel filters use -4 offset
for sampling instead of -3 to be better aligned in some cases. This
resulted in an out of bounds access, which led to crashes.

This patch fixes it.
2024-06-05 23:22:36 +02:00
Arpad PanyikandMartin Storsjö d835c6bf69 AArch64: Optimize prep_neon function
Optimize the widening copy part of subpel filters (the prep_neon
function). In this patch we combine widening shifts with widening
multiplications in the inner loops to get maximum throughput.

The change will increase .text by 36 bytes.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  mct_w4:   0.795x
  mct_w8:   0.913x
  mct_w16:  0.912x
  mct_w32:  0.838x
  mct_w64:  1.025x
  mct_w128: 1.002x

Cortex-A510:
  mct_w4:   0.760x
  mct_w8:   0.636x
  mct_w16:  0.640x
  mct_w32:  0.854x
  mct_w64:  0.864x
  mct_w128: 0.995x

Cortex-A72:
  mct_w4:   0.616x
  mct_w8:   0.854x
  mct_w16:  0.756x
  mct_w32:  1.052x
  mct_w64:  1.044x
  mct_w128: 0.702x

Cortex-A76:
  mct_w4:   0.837x
  mct_w8:   0.797x
  mct_w16:  0.841x
  mct_w32:  0.804x
  mct_w64:  0.948x
  mct_w128: 0.904x

Cortex-A78:
  mct_w16:  0.542x
  mct_w32:  0.725x
  mct_w64:  0.741x
  mct_w128: 0.745x

Cortex-A715:
  mct_w16:  0.561x
  mct_w32:  0.720x
  mct_w64:  0.740x
  mct_w128: 0.748x

Cortex-X1:
  mct_w32:  0.886x
  mct_w64:  0.882x
  mct_w128: 0.917x

Cortex-X3:
  mct_w32:  0.835x
  mct_w64:  0.803x
  mct_w128: 0.808x
2024-05-14 15:07:10 +00:00
Arpad PanyikandMartin Storsjö f0e779bc2a AArch64: Optimize jump table calculation of prep_neon
Save a complex arithmetic instruction in the jump table address
calculation of prep_neon function.
2024-05-14 15:07:10 +00:00
Arpad PanyikandMartin Storsjö 1790e1329d AArch64: Optimize BTI landing pads of prep_neon
Move the BTI landing pads out of the inner loops of prep_neon
function. Only the width=4 and width=8 cases are affected.

If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the
inner loops we get better execution speed on Cortex-A510 relative to
the original (lower is better):
  w4: 0.969x
  w8: 0.722x

Out-of-order cores are not affected.
2024-05-14 15:07:10 +00:00
Arpad Panyik 8141546da9 AArch64: Optimize put_neon function
Optimize the copy part of subpel filters (the put_neon function).
For small block sizes (<16) the usage of general purpose registers
is usually the best way to do the copy.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  w2:   0.991x
  w4:   0.992x
  w8:   0.999x
  w16:  0.875x
  w32:  0.775x
  w64:  0.914x
  w128: 0.998x

Cortex-A510:
  w2:   0.159x
  w4:   0.080x
  w8:   0.583x
  w16:  0.588x
  w32:  0.966x
  w64:  1.111x
  w128: 0.957x

Cortex-A76:
  w2:   0.903x
  w4:   0.683x
  w8:   0.944x
  w16:  0.948x
  w32:  0.919x
  w64:  0.855x
  w128: 0.991x

Cortex-A78:
  w32:  0.867x
  w64:  0.820x
  w128: 1.011x

Cortex-A715:
  w32:  0.834x
  w64:  0.778x
  w128: 1.000x

Cortex-X1:
  w32:  0.809x
  w64:  0.762x
  w128: 1.000x

Cortex-X3:
  w32: 0.733x
  w64: 0.720x
  w128: 0.999x
2024-05-13 16:52:21 +02:00
Arpad Panyik 645d1f9fd2 AArch64: Optimize jump table calculation of put_neon
Save a complex arithmetic instruction in the jump table address
calculation of put_neon function.
2024-05-13 16:50:56 +02:00
Arpad Panyik 83452c6e3f AArch64: Optimize BTI landing pads of put_neon
Move the BTI landing pads out of the inner loops of put_neon
function, the only exception is the width=16 case where it is already
outside of the loops.

When BTI is enabled, the relative performance of omitting the
AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower
is better):
  w2:   0.981x
  w4:   0.991x
  w8:   0.612x
  w32:  0.687x
  w64:  0.813x
  w128: 0.892x

Out-of-order CPUs are mostly unaffected.
2024-05-13 16:27:30 +02:00
Arpad PanyikandJean-Baptiste Kempf a6d57b1140 AArch64: Optimize the init of DotProd+ 2D subpel filters
Removed some unnecessary vector register copies from the initial
horizontal filter parts of the HV subpel filters. The performance
improvements are better for the smaller filter block sizes.

The narrowing shifts were also rewritten at the end of the *filter8*
because it was only beneficial for the Cortex-A55 among the DotProd
capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN
instruction combination is better.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  mct regular w4:  0.980x
  mct regular w8:  1.007x
  mct regular w16: 1.007x

  mct sharp w4:    0.983x
  mct sharp w8:    1.012x
  mct sharp w16:   1.005x

Cortex-A510:
  mct regular w4:  0.935x
  mct regular w8:  0.984x
  mct regular w16: 0.986x

  mct sharp w4:    0.927x
  mct sharp w8:    0.983x
  mct sharp w16:   0.987x

Cortex-A78:
  mct regular w4:  0.974x
  mct regular w8:  0.988x
  mct regular w16: 0.991x

  mct sharp w4:    0.971x
  mct sharp w8:    0.987x
  mct sharp w16:   0.979x

Cortex-715:
  mct regular w4:  0.958x
  mct regular w8:  0.993x
  mct regular w16: 0.998x

  mct sharp w4:    0.974x
  mct sharp w8:    0.991x
  mct sharp w16:   0.997x

Cortex-X1:
  mct regular w4:  0.983x
  mct regular w8:  0.993x
  mct regular w16: 0.996x

  mct sharp w4:    0.974x
  mct sharp w8:    0.990x
  mct sharp w16:   0.995x

Cortex-X3:
  mct regular w4:  0.953x
  mct regular w8:  0.993x
  mct regular w16: 0.997x

  mct sharp w4:    0.981x
  mct sharp w8:    0.993x
  mct sharp w16:   0.995x
2024-05-12 14:33:03 +00:00
Arpad Panyik 643195f546 AArch64: Optimize 2D i8mm subpel filters
Rewrite the accumulator initializations of the horizontal part of the
2D filters with zero register fills. It can improve the performance
on out-of-order CPUs which can fill vector registers by zero with
zero latency. Zeroed accumulators imply the usage of the rounding
shifts at the end of filters.

The only exception is the very short *hv_filter4*, where the longer
latency of rounding shift could decrease the performance.

The *filter8* function uses a different (alternating) dot product
computation order for DotProd+ feature level, it gives a better
overall performance for out-of-order and some in-order CPU cores.

The i8mm version does not need to use bias for the loaded samples, so
a different instruction scheduling is beneficial mostly affecting the
order of TBL instructions in the 8-tap case.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:
  mct_8tap_regular_w16_hv_8bpc_i8mm:  0.982x
  mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.979x
  mct_8tap_regular_w8_hv_8bpc_i8mm:   0.972x
  mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.969x
  mct_8tap_regular_w4_hv_8bpc_i8mm:   0.942x
  mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.935x
  mc_8tap_regular_w16_hv_8bpc_i8mm:   0.988x
  mc_8tap_sharp_w16_hv_8bpc_i8mm:     0.982x
  mc_8tap_regular_w8_hv_8bpc_i8mm:    0.981x
  mc_8tap_sharp_w8_hv_8bpc_i8mm:      0.975x
  mc_8tap_regular_w4_hv_8bpc_i8mm:    0.998x
  mc_8tap_sharp_w4_hv_8bpc_i8mm:      0.996x
  mc_8tap_regular_w2_hv_8bpc_i8mm:    1.006x
  mc_8tap_sharp_w2_hv_8bpc_i8mm:      0.993x

Cortex-A715:
  mct_8tap_regular_w16_hv_8bpc_i8mm:  0.883x
  mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.931x
  mct_8tap_regular_w8_hv_8bpc_i8mm:   0.882x
  mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.928x
  mct_8tap_regular_w4_hv_8bpc_i8mm:   0.969x
  mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.934x
  mc_8tap_regular_w16_hv_8bpc_i8mm:   0.881x
  mc_8tap_sharp_w16_hv_8bpc_i8mm:     0.925x
  mc_8tap_regular_w8_hv_8bpc_i8mm:    0.879x
  mc_8tap_sharp_w8_hv_8bpc_i8mm:      0.925x
  mc_8tap_regular_w4_hv_8bpc_i8mm:    0.917x
  mc_8tap_sharp_w4_hv_8bpc_i8mm:      0.976x
  mc_8tap_regular_w2_hv_8bpc_i8mm:    0.915x
  mc_8tap_sharp_w2_hv_8bpc_i8mm:      0.972x

Cortex-A510:
  mct_8tap_regular_w16_hv_8bpc_i8mm:  0.994x
  mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.949x
  mct_8tap_regular_w8_hv_8bpc_i8mm:   0.987x
  mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.947x
  mct_8tap_regular_w4_hv_8bpc_i8mm:   1.002x
  mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.999x
  mc_8tap_regular_w16_hv_8bpc_i8mm:   0.989x
  mc_8tap_sharp_w16_hv_8bpc_i8mm:     1.003x
  mc_8tap_regular_w8_hv_8bpc_i8mm:    0.986x
  mc_8tap_sharp_w8_hv_8bpc_i8mm:      1.000x
  mc_8tap_regular_w4_hv_8bpc_i8mm:    1.007x
  mc_8tap_sharp_w4_hv_8bpc_i8mm:      1.000x
  mc_8tap_regular_w2_hv_8bpc_i8mm:    1.005x
  mc_8tap_sharp_w2_hv_8bpc_i8mm:      1.000x
2024-05-09 09:53:05 +02:00
Arpad Panyik b2eca1aca7 AArch64: Optimize vertical i8mm subpel filters
Replace the accumulator initializations of the vertical subpel
filters with register fills by zeros (which are usually zero latency
operations in this feature class), this implies the usage of rounding
shifts at the end in the prep cases. Out-of-order CPU cores can
benefit from this change.

The width=16 case uses a simpler register duplication scheme that
relies on MOV instructions for the subsequent shuffles. This approach
uses a different register to load the data into for better instruction
scheduling and data dependency chain.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:
mct_8tap_sharp_w16_v_8bpc_i8mm:	0.910x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.986x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.864x
mc_8tap_sharp_w8_v_8bpc_i8mm:  	0.882x
mc_8tap_sharp_w4_v_8bpc_i8mm:  	0.933x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.926x

Cortex-A715:
mct_8tap_sharp_w16_v_8bpc_i8mm:	0.855x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.784x
mct_8tap_sharp_w4_v_8bpc_i8mm:  1.069x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.850x
mc_8tap_sharp_w8_v_8bpc_i8mm:  	0.779x
mc_8tap_sharp_w4_v_8bpc_i8mm:  	0.971x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.975x

Cortex-A510:
mct_8tap_sharp_w16_v_8bpc_i8mm: 1.001x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.979x
mct_8tap_sharp_w4_v_8bpc_i8mm: 	0.998x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.998x
mc_8tap_sharp_w8_v_8bpc_i8mm:   1.004x
mc_8tap_sharp_w4_v_8bpc_i8mm:   1.003x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.996x
2024-05-08 23:28:52 +02:00
Arpad PanyikandJean-Baptiste Kempf d1bdf4f1ff AArch64: Optimize horizontal i8mm prep filters
Replace the accumulator initializations of the horizontal prep
filters with register fills by zeros. Most i8mm capable CPUs can do
these with zero latency, but we also need to use rounding shifts at
the end of the filter. We can see better performance with this
change on out-of-order CPUs.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:
mct_8tap_sharp_w32_h_8bpc_i8mm:  0.914x
mct_8tap_sharp_w16_h_8bpc_i8mm:  0.906x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.877x

Cortex-A715:
mct_8tap_sharp_w32_h_8bpc_i8mm:  0.819x
mct_8tap_sharp_w16_h_8bpc_i8mm:  0.805x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.779x

Cortex-A510:
mct_8tap_sharp_w32_h_8bpc_i8mm:  0.999x
mct_8tap_sharp_w16_h_8bpc_i8mm:  1.001x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.996x
mct_8tap_sharp_w4_h_8bpc_i8mm:   0.915x
2024-05-08 20:16:13 +00:00
Arpad Panyik 1776c45a08 AArch64: Add basic i8mm support for convolutions
Add an Armv8.6-A i8mm code path for standard bitdepth convolutions.
Only horizontal-vertical (HV) convolutions have 6-tap specialisations
of their vertical passes. All other convolutions are 4- or 8-tap
filters which fit well with the 4-element USDOT instruction.

Benchmarks show 4-9% FPS increase relative to the Armv8.4-A
code path depending on the input video and the CPU used.

This patch will increase the .text by around 5.7 KiB.

Relative performance to the C reference on some Cortex CPU cores:

                       Cortex-A715   Cortex-X3  Cortex-A510
regular w4 hv neon:          7.20x      11.20x        4.40x
regular w4 hv dotprod:      12.77x      18.35x        6.21x
regular w4 hv i8mm:         14.50x      21.42x        6.16x

  sharp w4 hv neon:          6.24x       9.77x        3.96x
  sharp w4 hv dotprod:       9.76x      14.02x        5.20x
  sharp w4 hv i8mm:         10.84x      16.09x        5.42x

regular w8 hv neon:          2.17x       2.46x        3.17x
regular w8 hv dotprod:       3.04x       3.11x        3.03x
regular w8 hv i8mm:          3.57x       3.40x        3.27x

  sharp w8 hv neon:          1.72x       1.93x        2.75x
  sharp w8 hv dotprod:       2.49x       2.54x        2.62x
  sharp w8 hv i8mm:          2.80x       2.79x        2.70x

regular w16 hv neon:         1.90x       2.17x        2.02x
regular w16 hv dotprod:      2.59x       2.64x        1.93x
regular w16 hv i8mm:         3.01x       2.85x        2.05x

  sharp w16 hv neon:         1.51x       1.72x        1.74x
  sharp w16 hv dotprod:      2.17x       2.22x        1.70x
  sharp w16 hv i8mm:         2.42x       2.42x        1.72x

regular w32 hv neon:         1.80x       1.96x        1.81x
regular w32 hv dotprod:      2.43x       2.36x        1.74x
regular w32 hv i8mm:         2.83x       2.51x        1.83x

  sharp w32 hv neon:         1.42x       1.54x        1.56x
  sharp w32 hv dotprod:      2.07x       2.00x        1.55x
  sharp w32 hv i8mm:         2.29x       2.16x        1.55x

regular w64 hv neon:         1.82x       1.89x        1.70x
regular w64 hv dotprod:      2.43x       2.25x        1.65x
regular w64 hv i8mm:         2.84x       2.39x        1.73x

  sharp w64 hv neon:         1.43x       1.47x        1.49x
  sharp w64 hv dotprod:      2.08x       1.91x        1.49x
  sharp w64 hv i8mm:         2.30x       2.07x        1.48x

regular w128 hv neon:        1.77x       1.84x        1.75x
regular w128 hv dotprod:     2.37x       2.18x        1.70x
regular w128 hv i8mm:        2.76x       2.33x        1.78x

  sharp w128 hv neon:        1.40x       1.45x        1.42x
  sharp w128 hv dotprod:     2.04x       1.87x        1.43x
  sharp w128 hv i8mm:        2.24x       2.02x        1.42x

regular w8 h neon:           3.16x       3.51x        3.43x
regular w8 h dotprod:        4.97x       7.43x        4.95x
regular w8 h i8mm:           7.28x      10.38x        5.69x

  sharp w8 h neon:           2.71x       2.77x        3.10x
  sharp w8 h dotprod:        4.92x       7.14x        4.94x
  sharp w8 h i8mm:           7.21x      10.11x        5.70x

regular w16 h neon:          2.79x       2.76x        3.53x
regular w16 h dotprod:       3.81x       4.77x        3.13x
regular w16 h i8mm:          5.21x       6.04x        3.56x

  sharp w16 h neon:          2.31x       2.38x        3.12x
  sharp w16 h dotprod:       3.80x       4.74x        3.13x
  sharp w16 h i8mm:          5.20x       5.98x        3.56x

regular w64 h neon:          2.49x       2.46x        2.94x
regular w64 h dotprod:       3.17x       3.60x        2.41x
regular w64 h i8mm:          4.22x       4.40x        2.72x

  sharp w64 h neon:          2.07x       2.06x        2.60x
  sharp w64 h dotprod:       3.16x       3.58x        2.40x
  sharp w64 h i8mm:          4.20x       4.38x        2.71x

regular w8 v neon:           6.11x       8.05x        4.07x
regular w8 v dotprod:        5.45x       8.15x        4.01x
regular w8 v i8mm:           7.30x       9.46x        4.19x

  sharp w8 v neon:           4.23x       5.46x        3.09x
  sharp w8 v dotprod:        5.43x       7.96x        4.01x
  sharp w8 v i8mm:           7.26x       9.12x        4.19x

regular w16 v neon:          3.44x       4.33x        2.40x
regular w16 v dotprod:       3.20x       4.53x        2.85x
regular w16 v i8mm:          4.09x       5.27x        2.87x

  sharp w16 v neon:          2.50x       3.14x        1.82x
  sharp w16 v dotprod:       3.20x       4.52x        2.86x
  sharp w16 v i8mm:          4.09x       5.15x        2.86x

regular w64 v neon:          2.74x       3.11x        1.53x
regular w64 v dotprod:       2.63x       3.30x        1.84x
regular w64 v i8mm:          3.31x       3.73x        1.84x

  sharp w64 v neon:          2.01x       2.29x        1.16x
  sharp w64 v dotprod:       2.61x       3.27x        1.83x
  sharp w64 v i8mm:          3.29x       3.68x        1.84x
2024-04-26 14:04:18 +02:00
Arpad Panyik fbf23637ce AArch64: Simplify DotProd path of 2D subpel filters
Simplify the DotProd code path of the 2D (horizontal-vertical) subpel
filters. It contains some instruction reordering and some macro
simplifications to be more similar to the upcoming i8mm version.

These changes have negligible effect on performance.

Cortex-A510:
mc_8tap_regular_w2_hv_8bpc_dotprod:   8.3769 ->  8.3380
mc_8tap_sharp_w2_hv_8bpc_dotprod:     9.5441 ->  9.5457
mc_8tap_regular_w4_hv_8bpc_dotprod:   8.3422 ->  8.3444
mc_8tap_sharp_w4_hv_8bpc_dotprod:     9.5441 ->  9.5367
mc_8tap_regular_w8_hv_8bpc_dotprod:   9.9852 ->  9.9666
mc_8tap_sharp_w8_hv_8bpc_dotprod:    12.5554 -> 12.5314

Cortex-A55:
mc_8tap_regular_w2_hv_8bpc_dotprod:  6.4504  ->  6.4892
mc_8tap_sharp_w2_hv_8bpc_dotprod:    7.5732  ->  7.6078
mc_8tap_regular_w4_hv_8bpc_dotprod:  6.5088  ->  6.4760
mc_8tap_sharp_w4_hv_8bpc_dotprod:    7.5796  ->  7.5763
mc_8tap_regular_w8_hv_8bpc_dotprod:  9.3384  ->  9.3078
mc_8tap_sharp_w8_hv_8bpc_dotprod:   11.1159  -> 11.1401

Cortex-A78:
mc_8tap_regular_w2_hv_8bpc_dotprod:  1.4122  ->  1.4250
mc_8tap_sharp_w2_hv_8bpc_dotprod:    1.7696  ->  1.7821
mc_8tap_regular_w4_hv_8bpc_dotprod:  1.4243  ->  1.4243
mc_8tap_sharp_w4_hv_8bpc_dotprod:    1.7866  ->  1.7863
mc_8tap_regular_w8_hv_8bpc_dotprod:  2.5304  ->  2.5171
mc_8tap_sharp_w8_hv_8bpc_dotprod:    3.0815  ->  3.0632

Cortex-X1:
mc_8tap_regular_w2_hv_8bpc_dotprod:  0.8195  ->  0.8194
mc_8tap_sharp_w2_hv_8bpc_dotprod:    1.0092  ->  1.0081
mc_8tap_regular_w4_hv_8bpc_dotprod:  0.8197  ->  0.8166
mc_8tap_sharp_w4_hv_8bpc_dotprod:    1.0089  ->  1.0068
mc_8tap_regular_w8_hv_8bpc_dotprod:  1.5230  ->  1.5166
mc_8tap_sharp_w8_hv_8bpc_dotprod:    1.8683  ->  1.8625
2024-04-25 17:02:09 +02:00
Arpad Panyik a40301b33f AArch64: Simplify loads in *hv_filter* of DotProd path
Simplify the load sequences in *hv_filter* functions (ldr + add -> ld1)
to be more uniform and smaller. Performance is not affected.
2024-04-25 17:02:09 +02:00
Arpad Panyik b0685c387d AArch64: Simplify TBL usage in 2D DotProd filters
Simplify the TBL usages in small block size (2, 4) parts of the 2D
(horizontal-vertical) put subpel filters. The 2-register TBLs are
replaced with the 1-register form because we only need the lower
64-bits of the result and it can be extracted from only one source
register. Performance is not affected by this change.
2024-04-25 17:02:09 +02:00
Arpad Panyik ad7938d517 AArch64: Simplify DotProd path of horizontal subpel filters
Simplify the inner loops of the DotProd code path of horizontal
subpel filters to avoid using 2-register TBL instructions. The
store part of block size 16 of the horizontal put case is also
simplified (str + add -> st1). This patch can improve performance
mostly on small cores like Cortex-A510 and newer. Other CPUs are
mostly unaffected.

Cortex-A510:
mct_8tap_sharp_w16_h_8bpc_dotprod:  2.77x -> 3.13x
mct_8tap_sharp_w32_h_8bpc_dotprod:  2.32x -> 2.56x

Cortex-A55:
mct_8tap_sharp_w16_h_8bpc_dotprod:  3.89x -> 3.89x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.35x -> 3.35x

Cortex-A715:
mct_8tap_sharp_w16_h_8bpc_dotprod:  3.79x -> 3.78x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.30x -> 3.30x

Cortex-A78:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.30x -> 4.31x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.79x -> 3.80x

Cortex-X3:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.74x -> 4.75x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.89x -> 3.91x

Cortex-X1:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.61x -> 4.62x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.67x -> 3.66x
2024-04-25 16:59:53 +02:00
Arpad Panyik 317a94c6bb AArch64: Simplify DotProd path of vertical subpel filters
Simplify the accumulator initializations of the DotProd code path of
vertical subpel filters. This also makes it possible for some CPUs to
use zero latency vector register moves. The load is also simplified
(ldr + add -> ld1) in the inner loop of vertical filter for block
size 16.
2024-04-25 16:59:13 +02:00
Arpad Panyik 7eee4a2059 AArch64: Add \dot parameter to filter_8tap_fn macro
Add \dot parameter to filter_8tap_fn macro in preparation to extend
it with i8mm code path. This patch also contains string fixes and
some instruction reorderings along with some register renaming to
make it more uniform. These changes don't affect performance but
simplifies the code a bit.
2024-04-25 16:58:11 +02:00
Arpad Panyik 9d77b6336a AArch64: Add DotProd support for convolutions
Add an Armv8.4-A DotProd code path for standard bitdepth convolutions.
Only horizontal-vertical (HV) convolutions have 6-tap specialisations
of their vertical passes. All other convolutions are 4- or 8-tap
filters which fit well with the 4-element SDOT instruction.

Benchmarks show up-to 7-29% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 6.5 KiB.

Performance highly depends on the SDOT and MLA throughput ratio, this
can be seen on the vertical filter cases. Small cores are also
affected by the TBL execution latencies:

Relative performance to the C reference on some CPUs:

                          A76      A78       X1      A55
regular w4 hv neon:      5.52x    5.78x   10.75x    8.27x
regular w4 hv dotprod:   7.94x    8.49x   16.84x    8.09x
sharp w4 hv neon:        5.27x    5.22x    9.06x    7.87x
sharp w4 hv dotprod:     6.61x    6.73x   12.64x    6.89x

regular w8 hv neon:      1.95x    2.19x    2.56x    3.16x
regular w8 hv dotprod:   3.23x    2.81x    3.20x    3.26x
sharp w8 hv neon:        1.61x    1.79x    2.05x    2.72x
sharp w8 hv dotprod:     2.72x    2.29x    2.66x    2.76x

regular w16 hv neon:     1.63x    2.04x    2.16x    2.73x
regular w16 hv dotprod:  2.72x    2.57x    2.67x    2.80x
sharp w16 hv neon:       1.33x    1.67x    1.74x    2.34x
sharp w16 hv dotprod:    2.31x    2.14x    2.26x    2.39x

regular w32 hv neon:     1.48x    1.92x    1.94x    2.51x
regular w32 hv dotprod:  2.49x    2.40x    2.33x    2.58x
sharp w32 hv neon:       1.21x    1.56x    1.53x    2.14x
sharp w32 hv dotprod:    2.12x    2.02x    2.00x    2.22x

regular w64 hv neon:     1.42x    1.87x    1.85x    2.40x
regular w64 hv dotprod:  2.40x    2.32x    2.21x    2.46x
sharp w64 hv neon:       1.16x    1.52x    1.46x    2.04x
sharp w64 hv dotprod:    2.02x    1.96x    1.90x    2.11x

regular w128 hv neon:    1.39x    1.84x    1.80x    2.27x
regular w128 hv dotprod: 2.33x    2.28x    2.14x    2.35x
sharp w128 hv neon:      1.14x    1.50x    1.42x    1.94x
sharp w128 hv dotprod:   1.98x    1.93x    1.84x    2.03x

regular w8 h neon:       2.61x    3.20x    3.51x    3.55x
regular w8 h dotprod:    4.43x    5.17x    6.26x    4.30x
sharp w8 h neon:         2.01x    2.80x    2.89x    3.12x
sharp w8 h dotprod:      4.42x    5.16x    6.27x    4.28x

regular w16 h neon:      2.17x    3.13x    2.92x    3.35x
regular w16 h dotprod:   4.38x    4.27x    4.53x    3.90x
sharp w16 h neon:        1.74x    2.65x    2.48x    2.92x
sharp w16 h dotprod:     4.33x    4.27x    4.53x    3.91x

regular w64 h neon:      1.92x    2.82x    2.39x    2.96x
regular w64 h dotprod:   3.68x    3.60x    3.40x    3.18x
sharp w64 h neon:        1.47x    2.33x    2.05x    2.54x
sharp w64 h dotprod:     3.68x    3.60x    3.40x    3.17x

regular w4 v neon:       5.39x    7.38x   10.27x   11.41x
regular w4 v dotprod:    9.46x   14.15x   18.72x    9.84x
sharp w4 v neon:         4.51x    6.39x    8.17x   10.70x
sharp w4 v dotprod:      9.35x   14.20x   18.63x    9.78x

regular w16 v neon:      3.03x    4.03x    4.65x    6.28x
regular w16 v dotprod:   4.64x    3.75x    4.78x    3.89x
sharp w16 v neon:        2.29x    3.09x    3.44x    5.52x
sharp w16 v dotprod:     4.62x    3.74x    4.77x    3.89x

regular w64 v neon:      2.17x    3.14x    3.19x    4.46x
regular w64 v dotprod:   3.43x    3.00x    3.31x    2.74x
sharp w64 v neon:        1.61x    2.42x    2.34x    3.89x
sharp w64 v dotprod:     3.38x    3.00x    3.29x    2.73x
2024-04-11 19:03:58 +02:00
Arpad PanyikandMartin Storsjö 932b323c3e AArch64: Specialise HBD Neon convolutions for 6-tap filters
The 8-tap sub-pel filters used for motion vector interpolation are:
regular, smooth, sharp. The regular and smooth filter kernels are
zero-padded, so they are effectively 6-tap filters (some of them are
5-tap or even 4-tap).

This patch specialises the high bit-depth versions of put_8tap_neon
and prep_8tap_neon functions for 6-tap filters, avoiding a lot of
redundant work to multiply by and add zero. Wherever the sharp
filtering is used the 8-tap path will be always selected.

Benchmarks can show a 0.5-10.8% FPS uplift highly depending on the
input video source. Binary size increase is ~8.5 KiB.
2024-03-05 11:45:55 +00:00
Arpad Panyik b0a329d6a6 AArch64: Optimize 6-tap SBD HV Neon convolution
Optimize the 6-tap standard bit-depth horizontal-vertical combined
convolution to avoid unnecessary reads and horizontal convolution
steps at the beginning and end of the algorithm. This also saves some
instructions in the final binary.

Performance of this function increases by up to 5.5% depending on
block size.
2024-03-05 11:25:33 +00:00
Arpad PanyikandMartin Storsjö acc1121d2f Extend Arm and AArch64 run-time CPU feature detection
Add run-time CPU feature detection for DotProd, i8mm, SVE and SVE2.
SVE and SVE2 are AArch64-only features.
2024-02-28 16:32:28 +00:00
Arpad Panyik f1d42ae8f1 AArch64: Enable benchmarks for 8-tap sharp filters
The 6-tap sub-pel filter specialisation uses different code paths for
sharp (8-tap) and regular/smooth (6-tap) filtering kernels.

This patch enables benchmarking for the different code paths.
2024-02-22 08:58:17 +01:00
Arpad Panyik e51f4377fb AArch64: Specialise Neon convolutions for 6-tap filters
The 8-tap sub-pel filters used for motion vector interpolation are:
regular, smooth, sharp. The regular and smooth filter kernels are
zero-padded, so they are effectively 6-tap filters (some of them are
5-tap or even 4-tap).

This patch specialises the put_8tap_neon and prep_8tap_neon functions
for 6-tap filters, avoiding a lot of redundant work to multiply by
and add zero. Wherever the sharp filtering is used the 8-tap path
will be always selected.

Benchmarking this on a broad range of recent CPUs shows a 7-15% FPS
uplift.

Get raw sample video:
https://ultravideo.fi/video/Bosphorus_1920x1080_120fps_420_8bit_YUV_RAW.7z

Encode using:
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
2024-02-22 08:58:17 +01:00