dav1d

x/dav1d

mirror of https://code.videolan.org/videolan/dav1d synced 2026-06-11 04:03:05 +00:00

Author	SHA1	Message	Date
Arpad PanyikandMartin Storsjö	62501cc7db	AArch64: Optimize ipred_smooth_8bpc_neon Optimize ipred_smooth_8bpc_neon using simpler arithmetic operations and the removal of jump table. Relative runtime after this patch on some Cortex CPUs: ipred_smooth: w4 w8 w16 w32 w64 Cortex-A55: 1.041x 0.839x 0.705x 0.765x 0.802x Cortex-A510: 1.055x 0.880x 0.669x 0.694x 0.729x Cortex-A520: 1.113x 0.922x 0.659x 0.737x 0.783x Cortex-A76: 0.763x 0.733x 0.608x 0.707x 0.791x Cortex-A78: 0.840x 0.712x 0.704x 0.748x 0.786x Cortex-A715: 0.814x 0.655x 0.798x 0.837x 0.858x Cortex-A725: 0.813x 0.653x 0.791x 0.830x 0.854x Cortex-X1: 0.825x 0.686x 0.667x 0.729x 0.756x Cortex-X3: 0.865x 0.617x 0.649x 0.674x 0.688x Cortex-X925: 0.825x 0.677x 0.641x 0.686x 0.700x	2026-05-26 12:30:26 +00:00
Arpad Panyik	dbed372b70	AArch64: Optimize ipred_smooth_v_8bpc_neon further Optimize ipred_smooth_h_8bpc_neon even further using vertical inner loop for w >= 16 cases. Relative runtime after this patch on some Cortex CPUs: ipred_smooth_v: w4 w8 w16 w32 w64 Cortex-A55: 0.985x 0.981x 0.810x 0.873x 0.907x Cortex-A510: 0.966x 0.951x 0.950x 1.013x 1.047x Cortex-A520: 0.924x 0.924x 0.890x 0.984x 1.030x Cortex-A76: 0.978x 1.036x 0.899x 0.919x 0.918x Cortex-A78: 0.997x 0.993x 0.986x 0.972x 0.983x Cortex-A710: 1.002x 0.973x 0.984x 0.958x 1.002x Cortex-A715: 1.073x 1.049x 1.005x 1.018x 1.012x Cortex-A720: 1.001x 1.004x 0.990x 1.007x 1.008x Cortex-A725: 1.002x 1.001x 0.985x 1.007x 1.006x Cortex-X1: 0.996x 1.077x 0.927x 0.962x 0.970x Cortex-X2: 1.012x 0.989x 0.881x 0.971x 0.981x Cortex-X3: 1.006x 1.034x 0.841x 0.966x 0.962x Cortex-X4: 1.020x 1.022x 0.915x 0.964x 0.985x Cortex-X925: 1.000x 0.947x 0.936x 0.982x 0.996x	2026-05-20 13:48:18 +02:00
Arpad Panyik	a38236491a	AArch64: Optimize ipred_smooth_h_8bpc_neon further Optimize ipred_smooth_h_8bpc_neon even further using vertical inner loop for w >= 16 cases. Reorder instructions in the w = 4 handler for Small CPUs. Relative runtime after this patch on some Cortex CPUs: ipred_smooth_h: w4 w8 w16 w32 w64 Cortex-A55: 0.964x 1.003x 0.891x 0.979x 1.030x Cortex-A510: 0.952x 0.936x 0.928x 1.004x 1.050x Cortex-A520: 0.921x 0.925x 0.921x 0.995x 1.032x Cortex-A76: 0.993x 1.005x 0.977x 0.995x 0.996x Cortex-A78: 0.991x 0.998x 1.042x 0.978x 1.015x Cortex-A710: 1.020x 0.966x 1.015x 1.015x 1.008x Cortex-A715: 1.026x 1.051x 1.039x 1.007x 1.024x Cortex-A720: 0.954x 0.999x 1.018x 0.999x 1.020x Cortex-A725: 0.962x 1.000x 1.018x 1.000x 1.021x Cortex-X1: 1.019x 0.993x 0.924x 0.983x 0.989x Cortex-X2: 1.013x 0.991x 0.872x 0.964x 1.023x Cortex-X3: 1.030x 0.996x 0.840x 0.953x 1.024x Cortex-X4: 1.026x 1.005x 0.952x 0.970x 0.986x Cortex-X925: 1.000x 0.980x 0.865x 0.899x 0.892x	2026-05-20 13:44:22 +02:00
Arpad PanyikandMartin Storsjö	51b67010e2	AArch64: Optimize ipred_smooth_v_8bpc_neon Optimize ipred_smooth_v_8bpc_neon using simpler arithmetic operations and the removal of jump table. Relative runtime after this patch on some Cortex CPUs: ipred_smooth_v: w4 w8 w16 w32 w64 Cortex-A55: 1.025x 0.847x 0.821x 0.830x 0.852x Cortex-A510: 1.017x 0.923x 0.915x 0.883x 0.840x Cortex-A520: 1.080x 0.972x 0.999x 0.934x 0.876x Cortex-A76: 0.818x 0.575x 0.599x 0.723x 0.744x Cortex-A78: 0.782x 0.571x 0.595x 0.641x 0.685x Cortex-A715: 0.801x 0.586x 0.593x 0.651x 0.694x Cortex-A725: 0.801x 0.579x 0.596x 0.649x 0.692x Cortex-X1: 0.782x 0.560x 0.553x 0.623x 0.682x Cortex-X3: 0.792x 0.594x 0.526x 0.526x 0.604x Cortex-X925: 0.757x 0.678x 0.525x 0.554x 0.577x	2026-05-06 20:18:03 +00:00
Arpad PanyikandMartin Storsjö	4db1a05aad	AArch64: Optimize ipred_smooth_h_8bpc_neon Optimize ipred_smooth_h_8bpc_neon using simpler arithmetic operations. Relative runtime after this patch on some Cortex CPUs: ipred_smooth_h: w4 w8 w16 w32 w64 Cortex-A55: 1.015x 0.857x 0.819x 0.835x 0.862x Cortex-A510: 0.988x 0.860x 0.915x 0.879x 0.837x Cortex-A520: 0.999x 0.883x 0.967x 0.929x 0.873x Cortex-A76: 0.804x 0.637x 0.517x 0.573x 0.613x Cortex-A78: 0.800x 0.586x 0.548x 0.639x 0.640x Cortex-A715: 0.722x 0.642x 0.563x 0.627x 0.646x Cortex-A725: 0.710x 0.639x 0.567x 0.622x 0.645x Cortex-X1: 0.758x 0.570x 0.565x 0.548x 0.557x Cortex-X3: 0.789x 0.589x 0.528x 0.563x 0.571x Cortex-X925: 0.855x 0.739x 0.541x 0.551x 0.567x	2026-05-06 20:18:03 +00:00
Arpad Panyik	c5726277ff	AArch64: Optimize ipred_h_8bpc_neon Optimize ipred_h_8bpc_neon using simpler stores and simpler indexing. Relative runtime after this patch on some Cortex CPUs: ipred_h: w4 w8 w16 w32 w64 Cortex-A55: 1.054x 1.054x 0.978x 1.149x 1.097x Cortex-A510: 0.455x 0.970x 0.973x 1.010x 1.002x Cortex-A520: 0.973x 0.975x 0.979x 1.002x 1.000x Cortex-A76: 0.791x 0.934x 0.912x 1.010x 0.999x Cortex-A78: 0.771x 0.933x 0.957x 0.519x 0.510x Cortex-A715: 0.838x 0.860x 0.893x 0.585x 0.661x Cortex-A720: 0.839x 0.860x 0.892x 0.580x 0.659x Cortex-A725: 0.809x 0.837x 0.871x 0.580x 0.660x Cortex-X1: 0.973x 0.982x 0.989x 0.498x 0.660x Cortex-X3: 0.971x 0.992x 0.987x 0.495x 0.661x Cortex-X925: 0.950x 1.000x 1.000x 0.474x 0.655x	2026-04-16 16:02:28 +02:00
Arpad Panyik	47e2607e6c	AArch64: Optimize ipred_v_8bpc_neon Optimize the width = 4 case of ipred_v_8bpc_neon by using simple stores instead of the lane stores which can improve performance on some CPUs. Relative runtime after this patch on some Cortex CPUs: ipred_v: w4 Cortex-A55: 1.041x Cortex-A510: 0.297x Cortex-A520: 0.748x Cortex-A76: 0.866x Cortex-A78: 0.856x Cortex-A715: 0.874x Cortex-A720: 0.875x Cortex-A725: 0.868x Cortex-X1: 1.013x Cortex-X3: 1.000x Cortex-X925: 1.000x	2026-04-15 17:37:46 +02:00
Arpad Panyik	edb16889d1	AArch64: Add Neon implementation of load_tmvs This patch adds a vectorised variant of the mv_projection calculation and a faster initialisation of motion vectors for load_tmvs_neon. Checkasm uplifts after this patch on some Neoverse and Cortex CPU cores compared to the C reference compiled with GCC-13 and Clang-19: GCC Clang AWS Graviton 4: 1.62x 1.59x Cortex-X4: 1.45x 1.46x Cortex-X3: 1.68x 1.69x Cortex-X1: 1.55x 1.52x Cortex-A720: 1.54x 1.57x Cortex-A715: 1.47x 1.55x Cortex-A78: 1.21x 1.18x Cortex-A76: 1.38x 1.37x Cortex-A72: 1.08x 1.11x Cortex-A520: 0.97x 1.18x Cortex-A510: 0.99x 1.14x Cortex-A55: 1.16x 1.23x This patch increases the .text by ~660 bytes, but smaller than the reference implementation by about 0.5 KiB.	2025-01-09 14:59:31 +01:00
Arpad PanyikandMartin Storsjö	82e9155c75	AArch64: Trim Armv8.0 Neon path of 6-tap and 8-tap MC functions There are some instruction sequences we could merge after the lane load/store patch (`ec5c3052cf`). This change will simplify the loading of filter weights to save 288 bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.	2024-09-12 11:31:07 +00:00
Arpad PanyikandMartin Storsjö	ec5c3052cf	AArch64: Optimize lane load/store in MC functions Partial register writes can create long dependency chains, which can reduce performance on out-of-order CPUs. This patch removes most of these kinds of problems in MC functions by filling the full register before other lane loading instructions. Most lane extracting stores can also be optimized using FP scalar stores when the 0th lane would be extracted. Relative runtime of micro benchmarks after this patch on some Neoverse and Cortex CPU cores: 8bpc neon V2 V1 X3 X1 A715 A78 A76 avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x 16bpc neon V2 V1 X3 X1 A715 A78 A76 avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x 8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x 8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x	2024-09-06 11:40:46 +03:00
Arpad PanyikandMartin Storsjö	a992a9bede	AArch64: Optimize Armv8.0 Neon path of SBD H/HV 6-tap filters The 6-tap horizontal and the horizontal parts of 6-tap HV subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on Cortex CPU cores: SBD mct h X1 A78 A76 A72 A55 regular w8: 0.878x 0.894x 0.990x 0.923x 0.944x regular w16: 0.962x 0.931x 0.943x 0.949x 0.949x regular w32: 0.937x 0.937x 0.972x 0.938x 0.947x regular w64: 0.920x 0.965x 0.992x 0.936x 0.944x SBD mct hv X1 A78 A76 A72 A55 regular w8: 0.931x 0.970x 0.951x 0.950x 0.971x regular w16: 0.940x 0.971x 0.941x 0.952x 0.967x regular w32: 0.943x 0.972x 0.946x 0.961x 0.974x regular w64: 0.943x 0.973x 0.952x 0.944x 0.975x	2024-09-06 08:08:08 +00:00
Arpad PanyikandMartin Storsjö	2d808de191	AArch64: Optimize Armv8.0 Neon path of HBD HV 6-tap filters The horizontal parts of 6-tap HV subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on Cortex CPU cores: HBD mct hv X1 A78 A76 A72 A55 regular w8: 0.952x 0.989x 0.924x 0.973x 0.976x regular w16: 0.961x 0.993x 0.928x 0.952x 0.971x regular w32: 0.964x 0.996x 0.930x 0.973x 0.972x regular w64: 0.963x 0.997x 0.930x 0.969x 0.974x	2024-09-06 07:50:38 +00:00
Arpad PanyikandMartin Storsjö	93339ce857	AArch64: Optimize Armv8.0 Neon path of HBD horizontal 6-tap filters The 6-tap horizontal subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on some Cortex CPU cores: regular: X1 A78 A76 A55 mc w8: 0.915x 0.937x 0.900x 0.982x mc w16: 0.917x 0.947x 0.911x 0.971x mc w32: 0.914x 0.938x 0.873x 0.961x mc w64: 0.918x 0.932x 0.882x 0.964x	2024-09-06 07:38:18 +00:00
Arpad PanyikandMartin Storsjö	109b24277b	AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+ SRSHL instruction sequences. In the horizontal case this can be rewritten using a single SQSHRUN instruction with an additional rounding value (34 for 10-bit and 40 for 12-bit). Relative runtime of micro benchmarks after this patch on some Cortex CPU cores: regular: X1 A78 A76 A55 mc w2: 0.847x 0.864x 0.822x 0.859x mc w4: 0.889x 0.994x 0.868x 0.917x mc w8: 0.857x 0.911x 0.915x 0.978x mc w16: 0.890x 0.982x 0.868x 0.974x mc w32: 0.904x 0.991x 0.873x 0.967x mc w64: 0.919x 1.003x 0.860x 0.970x	2024-09-06 07:38:18 +00:00
Arpad PanyikandMartin Storsjö	472b31f838	AArch64: SVE MS armasm64 fix of HBD subpel filters MS armasm64 cannot compile some SVE instructions with immediate operands, e.g.: sub z0.h, z0.h, #8192 The proper form is: sub z0.h, z0.h, #32, lsl #8 This patch contains the needed fixes.	2024-08-22 19:33:06 +00:00
Arpad PanyikandMartin Storsjö	01558f3f66	AArch64: Add HBD subpel filters using 128-bit SVE2 Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only 2D convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element 16-bit SDOT instruction of SVE2. This patch renames HBD prep/put_neon to prep/put_16bpc_neon and exports put_16bpc_neon. Benchmarks show up-to 17% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 8 KiB. Relative performance to the C reference on some Cortex-A/X CPUs: regular A715 A720 X3 X4 A510 A520 w4 hv neon: 3.93x 4.10x 5.21x 5.17x 3.57x 5.27x w4 hv sve2: 4.99x 5.14x 6.00x 6.05x 4.33x 3.99x w8 hv neon: 1.72x 1.67x 1.98x 2.18x 2.95x 2.94x w8 hv sve2: 2.12x 2.29x 2.52x 2.62x 2.60x 2.60x w16 hv neon: 1.59x 1.53x 1.83x 1.89x 2.35x 2.24x w16 hv sve2: 1.94x 2.12x 2.33x 2.18x 2.06x 2.06x w32 hv neon: 1.49x 1.50x 1.66x 1.76x 2.10x 2.16x w32 hv sve2: 1.81x 2.09x 2.11x 2.09x 1.84x 1.87x w64 hv neon: 1.52x 1.50x 1.55x 1.71x 1.95x 2.05x w64 hv sve2: 1.84x 2.08x 1.97x 1.98x 1.74x 1.77x w4 h neon: 5.35x 5.47x 7.39x 5.78x 3.92x 5.19x w4 h sve2: 7.91x 8.35x 11.95x 10.33x 5.81x 5.42x w8 h neon: 4.49x 4.43x 6.50x 4.87x 7.18x 6.17x w8 h sve2: 6.09x 6.22x 9.59x 7.70x 7.89x 6.83x w16 h neon: 2.53x 2.52x 2.34x 1.86x 2.71x 2.75x w16 h sve2: 3.41x 3.47x 3.53x 3.25x 2.89x 2.96x w32 h neon: 2.07x 2.08x 1.97x 1.56x 2.17x 2.21x w32 h sve2: 2.76x 2.84x 2.94x 2.75x 2.24x 2.29x w64 h neon: 1.86x 1.86x 1.76x 1.41x 1.87x 1.88x w64 h sve2: 2.47x 2.54x 2.65x 2.46x 1.94x 1.94x w4 v neon: 5.22x 5.17x 6.36x 5.60x 4.23x 7.30x w4 v sve2: 5.86x 5.90x 7.81x 7.16x 4.86x 4.15x w8 v neon: 4.83x 4.79x 6.96x 6.45x 4.74x 8.40x w8 v sve2: 5.25x 5.23x 7.76x 6.79x 4.84x 4.13x w16 v neon: 2.59x 2.60x 2.93x 2.47x 1.80x 4.16x w16 v sve2: 2.85x 2.88x 3.36x 2.73x 1.86x 2.00x w32 v neon: 2.12x 2.13x 2.33x 2.03x 1.34x 3.11x w32 v sve2: 2.36x 2.40x 2.73x 2.32x 1.41x 1.48x w64 v neon: 1.94x 1.92x 2.02x 1.78x 1.12x 2.59x w64 v sve2: 2.16x 2.15x 2.37x 2.03x 1.17x 1.22x w4 0 neon: 1.75x 1.71x 1.44x 1.56x 3.18x 2.87x w4 0 sve2: 4.28x 4.39x 5.72x 6.42x 5.50x 4.68x w8 0 neon: 3.05x 3.04x 4.44x 4.64x 3.84x 3.52x w8 0 sve2: 3.85x 3.80x 5.45x 6.01x 4.92x 4.26x w16 0 neon: 2.92x 2.93x 3.82x 3.23x 4.58x 4.44x w16 0 sve2: 4.29x 4.27x 4.25x 4.15x 5.58x 5.29x w32 0 neon: 2.73x 2.76x 3.50x 2.67x 4.44x 4.26x w32 0 sve2: 4.09x 4.10x 3.75x 3.39x 5.67x 5.22x w64 0 neon: 2.73x 2.70x 3.27x 3.14x 4.57x 4.68x w64 0 sve2: 4.06x 3.97x 3.54x 3.18x 6.36x 6.25x sharp A715 A720 X3 X4 A510 A520 w4 hv neon: 3.54x 3.64x 4.43x 4.45x 3.03x 4.72x w4 hv sve2: 4.30x 4.55x 5.38x 5.26x 4.04x 3.76x w8 hv neon: 1.30x 1.25x 1.51x 1.60x 2.44x 2.43x w8 hv sve2: 1.86x 2.06x 2.09x 2.18x 2.37x 2.39x w16 hv neon: 1.19x 1.16x 1.43x 1.36x 1.95x 1.98x w16 hv sve2: 1.68x 1.91x 1.94x 1.84x 1.89x 1.94x w32 hv neon: 1.13x 1.12x 1.30x 1.29x 1.75x 1.81x w32 hv sve2: 1.58x 1.84x 1.75x 1.74x 1.70x 1.76x w64 hv neon: 1.13x 1.13x 1.21x 1.25x 1.65x 1.69x w64 hv sve2: 1.57x 1.84x 1.62x 1.67x 1.62x 1.65x w4 h neon: 5.38x 5.49x 7.46x 5.74x 3.93x 5.23x w4 h sve2: 7.86x 8.37x 11.99x 10.38x 5.81x 5.40x w8 h neon: 3.46x 3.49x 5.36x 4.64x 6.40x 5.62x w8 h sve2: 5.95x 6.23x 9.61x 7.76x 7.86x 6.89x w16 h neon: 1.99x 1.97x 2.07x 1.91x 2.43x 2.51x w16 h sve2: 3.42x 3.46x 3.75x 3.23x 2.89x 2.98x w32 h neon: 1.67x 1.62x 1.66x 1.63x 1.95x 2.01x w32 h sve2: 2.86x 2.84x 2.94x 2.72x 2.21x 2.29x w64 h neon: 1.45x 1.45x 1.51x 1.48x 1.69x 1.70x w64 h sve2: 2.47x 2.54x 2.64x 2.46x 1.93x 1.95x w4 v neon: 4.07x 4.01x 5.15x 4.74x 3.38x 6.56x w4 v sve2: 5.88x 5.86x 7.81x 7.15x 4.85x 4.39x w8 v neon: 3.64x 3.59x 5.38x 4.92x 3.59x 7.23x w8 v sve2: 5.23x 5.19x 7.77x 6.66x 4.81x 4.13x w16 v neon: 1.93x 1.95x 2.25x 1.92x 1.35x 3.46x w16 v sve2: 2.85x 2.88x 3.36x 2.71x 1.86x 1.94x w32 v neon: 1.57x 1.58x 1.78x 1.60x 1.01x 2.67x w32 v sve2: 2.36x 2.39x 2.73x 2.35x 1.41x 1.50x w64 v neon: 1.44x 1.42x 1.54x 1.43x 0.85x 2.19x w64 v sve2: 2.17x 2.15x 2.37x 2.06x 1.18x 1.25x	2024-08-22 12:52:56 +00:00
Arpad Panyik	713c076d80	AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters Add 6-tap variant of standard bit-depth horizontal subpel filters using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch also extends the HV filter with 6-tap horizontal pass using USMMLA. Benchmarks show up-to 6-7% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 1.2 KiB. Relative runtime of micro benchmarks after this patch on Neoverse and Cortex CPU cores: regular V2 V1 X3 A720 A715 A520 A510 w8 hv: 0.860x 0.895x 0.870x 0.896x 0.896x 0.938x 0.936x w16 hv: 0.829x 0.886x 0.865x 0.908x 0.906x 0.946x 0.944x w32 hv: 0.837x 0.883x 0.862x 0.914x 0.915x 0.953x 0.949x w64 hv: 0.840x 0.883x 0.862x 0.914x 0.914x 0.955x 0.952x w8 h: 0.746x 0.754x 0.747x 0.723x 0.724x 0.874x 0.866x w16 h: 0.749x 0.764x 0.745x 0.731x 0.731x 0.858x 0.852x w32 h: 0.739x 0.754x 0.738x 0.729x 0.729x 0.839x 0.837x w64 h: 0.736x 0.749x 0.733x 0.725x 0.726x 0.847x 0.836x	2024-08-21 23:41:48 +02:00
Arpad Panyik	287e90a3a6	AArch64: Fix typo in SBD 6-tap 2D/HV subpel filter The macro parameter \xmy of filter_8tap_fn was used incorrectly as a pointer instead of \lsrc. They refer to the same register but in different context.	2024-08-12 19:41:45 +02:00
Arpad Panyik	2355eeb8f2	AArch64: Move constants of DotProd subpel filters to .rodata The constants used for the subpel filters were placed in the .text section for simplicity and peak performance, but this does not work on systems with execute only .text sections (e.g.: OpenBSD). The performance cost of moving the constants to the .rodata section is small and mostly within the measurable noise.	2024-06-26 11:20:43 +02:00
Arpad Panyik	92f592ed10	AArch64: Fix potential out of bounds access in DotProd H/HV filters The DotProd/I8MM horizontal and HV/2D subpel filters use -4 offset for sampling instead of -3 to be better aligned in some cases. This resulted in an out of bounds access, which led to crashes. This patch fixes it.	2024-06-05 23:22:36 +02:00
Arpad PanyikandMartin Storsjö	d835c6bf69	AArch64: Optimize prep_neon function Optimize the widening copy part of subpel filters (the prep_neon function). In this patch we combine widening shifts with widening multiplications in the inner loops to get maximum throughput. The change will increase .text by 36 bytes. Relative performance of micro benchmarks (lower is better): Cortex-A55: mct_w4: 0.795x mct_w8: 0.913x mct_w16: 0.912x mct_w32: 0.838x mct_w64: 1.025x mct_w128: 1.002x Cortex-A510: mct_w4: 0.760x mct_w8: 0.636x mct_w16: 0.640x mct_w32: 0.854x mct_w64: 0.864x mct_w128: 0.995x Cortex-A72: mct_w4: 0.616x mct_w8: 0.854x mct_w16: 0.756x mct_w32: 1.052x mct_w64: 1.044x mct_w128: 0.702x Cortex-A76: mct_w4: 0.837x mct_w8: 0.797x mct_w16: 0.841x mct_w32: 0.804x mct_w64: 0.948x mct_w128: 0.904x Cortex-A78: mct_w16: 0.542x mct_w32: 0.725x mct_w64: 0.741x mct_w128: 0.745x Cortex-A715: mct_w16: 0.561x mct_w32: 0.720x mct_w64: 0.740x mct_w128: 0.748x Cortex-X1: mct_w32: 0.886x mct_w64: 0.882x mct_w128: 0.917x Cortex-X3: mct_w32: 0.835x mct_w64: 0.803x mct_w128: 0.808x	2024-05-14 15:07:10 +00:00
Arpad PanyikandMartin Storsjö	f0e779bc2a	AArch64: Optimize jump table calculation of prep_neon Save a complex arithmetic instruction in the jump table address calculation of prep_neon function.	2024-05-14 15:07:10 +00:00
Arpad PanyikandMartin Storsjö	1790e1329d	AArch64: Optimize BTI landing pads of prep_neon Move the BTI landing pads out of the inner loops of prep_neon function. Only the width=4 and width=8 cases are affected. If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the inner loops we get better execution speed on Cortex-A510 relative to the original (lower is better): w4: 0.969x w8: 0.722x Out-of-order cores are not affected.	2024-05-14 15:07:10 +00:00
Arpad Panyik	8141546da9	AArch64: Optimize put_neon function Optimize the copy part of subpel filters (the put_neon function). For small block sizes (<16) the usage of general purpose registers is usually the best way to do the copy. Relative performance of micro benchmarks (lower is better): Cortex-A55: w2: 0.991x w4: 0.992x w8: 0.999x w16: 0.875x w32: 0.775x w64: 0.914x w128: 0.998x Cortex-A510: w2: 0.159x w4: 0.080x w8: 0.583x w16: 0.588x w32: 0.966x w64: 1.111x w128: 0.957x Cortex-A76: w2: 0.903x w4: 0.683x w8: 0.944x w16: 0.948x w32: 0.919x w64: 0.855x w128: 0.991x Cortex-A78: w32: 0.867x w64: 0.820x w128: 1.011x Cortex-A715: w32: 0.834x w64: 0.778x w128: 1.000x Cortex-X1: w32: 0.809x w64: 0.762x w128: 1.000x Cortex-X3: w32: 0.733x w64: 0.720x w128: 0.999x	2024-05-13 16:52:21 +02:00
Arpad Panyik	645d1f9fd2	AArch64: Optimize jump table calculation of put_neon Save a complex arithmetic instruction in the jump table address calculation of put_neon function.	2024-05-13 16:50:56 +02:00
Arpad Panyik	83452c6e3f	AArch64: Optimize BTI landing pads of put_neon Move the BTI landing pads out of the inner loops of put_neon function, the only exception is the width=16 case where it is already outside of the loops. When BTI is enabled, the relative performance of omitting the AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower is better): w2: 0.981x w4: 0.991x w8: 0.612x w32: 0.687x w64: 0.813x w128: 0.892x Out-of-order CPUs are mostly unaffected.	2024-05-13 16:27:30 +02:00
Arpad PanyikandJean-Baptiste Kempf	a6d57b1140	AArch64: Optimize the init of DotProd+ 2D subpel filters Removed some unnecessary vector register copies from the initial horizontal filter parts of the HV subpel filters. The performance improvements are better for the smaller filter block sizes. The narrowing shifts were also rewritten at the end of the filter8 because it was only beneficial for the Cortex-A55 among the DotProd capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN instruction combination is better. Relative performance of micro benchmarks (lower is better): Cortex-A55: mct regular w4: 0.980x mct regular w8: 1.007x mct regular w16: 1.007x mct sharp w4: 0.983x mct sharp w8: 1.012x mct sharp w16: 1.005x Cortex-A510: mct regular w4: 0.935x mct regular w8: 0.984x mct regular w16: 0.986x mct sharp w4: 0.927x mct sharp w8: 0.983x mct sharp w16: 0.987x Cortex-A78: mct regular w4: 0.974x mct regular w8: 0.988x mct regular w16: 0.991x mct sharp w4: 0.971x mct sharp w8: 0.987x mct sharp w16: 0.979x Cortex-715: mct regular w4: 0.958x mct regular w8: 0.993x mct regular w16: 0.998x mct sharp w4: 0.974x mct sharp w8: 0.991x mct sharp w16: 0.997x Cortex-X1: mct regular w4: 0.983x mct regular w8: 0.993x mct regular w16: 0.996x mct sharp w4: 0.974x mct sharp w8: 0.990x mct sharp w16: 0.995x Cortex-X3: mct regular w4: 0.953x mct regular w8: 0.993x mct regular w16: 0.997x mct sharp w4: 0.981x mct sharp w8: 0.993x mct sharp w16: 0.995x	2024-05-12 14:33:03 +00:00
Arpad Panyik	643195f546	AArch64: Optimize 2D i8mm subpel filters Rewrite the accumulator initializations of the horizontal part of the 2D filters with zero register fills. It can improve the performance on out-of-order CPUs which can fill vector registers by zero with zero latency. Zeroed accumulators imply the usage of the rounding shifts at the end of filters. The only exception is the very short hv_filter4, where the longer latency of rounding shift could decrease the performance. The filter8 function uses a different (alternating) dot product computation order for DotProd+ feature level, it gives a better overall performance for out-of-order and some in-order CPU cores. The i8mm version does not need to use bias for the loaded samples, so a different instruction scheduling is beneficial mostly affecting the order of TBL instructions in the 8-tap case. Relative performance of micro benchmarks (lower is better): Cortex-X3: mct_8tap_regular_w16_hv_8bpc_i8mm: 0.982x mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.979x mct_8tap_regular_w8_hv_8bpc_i8mm: 0.972x mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.969x mct_8tap_regular_w4_hv_8bpc_i8mm: 0.942x mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.935x mc_8tap_regular_w16_hv_8bpc_i8mm: 0.988x mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.982x mc_8tap_regular_w8_hv_8bpc_i8mm: 0.981x mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.975x mc_8tap_regular_w4_hv_8bpc_i8mm: 0.998x mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.996x mc_8tap_regular_w2_hv_8bpc_i8mm: 1.006x mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.993x Cortex-A715: mct_8tap_regular_w16_hv_8bpc_i8mm: 0.883x mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.931x mct_8tap_regular_w8_hv_8bpc_i8mm: 0.882x mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.928x mct_8tap_regular_w4_hv_8bpc_i8mm: 0.969x mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.934x mc_8tap_regular_w16_hv_8bpc_i8mm: 0.881x mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.925x mc_8tap_regular_w8_hv_8bpc_i8mm: 0.879x mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.925x mc_8tap_regular_w4_hv_8bpc_i8mm: 0.917x mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.976x mc_8tap_regular_w2_hv_8bpc_i8mm: 0.915x mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.972x Cortex-A510: mct_8tap_regular_w16_hv_8bpc_i8mm: 0.994x mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.949x mct_8tap_regular_w8_hv_8bpc_i8mm: 0.987x mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.947x mct_8tap_regular_w4_hv_8bpc_i8mm: 1.002x mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.999x mc_8tap_regular_w16_hv_8bpc_i8mm: 0.989x mc_8tap_sharp_w16_hv_8bpc_i8mm: 1.003x mc_8tap_regular_w8_hv_8bpc_i8mm: 0.986x mc_8tap_sharp_w8_hv_8bpc_i8mm: 1.000x mc_8tap_regular_w4_hv_8bpc_i8mm: 1.007x mc_8tap_sharp_w4_hv_8bpc_i8mm: 1.000x mc_8tap_regular_w2_hv_8bpc_i8mm: 1.005x mc_8tap_sharp_w2_hv_8bpc_i8mm: 1.000x	2024-05-09 09:53:05 +02:00
Arpad Panyik	b2eca1aca7	AArch64: Optimize vertical i8mm subpel filters Replace the accumulator initializations of the vertical subpel filters with register fills by zeros (which are usually zero latency operations in this feature class), this implies the usage of rounding shifts at the end in the prep cases. Out-of-order CPU cores can benefit from this change. The width=16 case uses a simpler register duplication scheme that relies on MOV instructions for the subsequent shuffles. This approach uses a different register to load the data into for better instruction scheduling and data dependency chain. Relative performance of micro benchmarks (lower is better): Cortex-X3: mct_8tap_sharp_w16_v_8bpc_i8mm: 0.910x mct_8tap_sharp_w8_v_8bpc_i8mm: 0.986x mc_8tap_sharp_w16_v_8bpc_i8mm: 0.864x mc_8tap_sharp_w8_v_8bpc_i8mm: 0.882x mc_8tap_sharp_w4_v_8bpc_i8mm: 0.933x mc_8tap_sharp_w2_v_8bpc_i8mm: 0.926x Cortex-A715: mct_8tap_sharp_w16_v_8bpc_i8mm: 0.855x mct_8tap_sharp_w8_v_8bpc_i8mm: 0.784x mct_8tap_sharp_w4_v_8bpc_i8mm: 1.069x mc_8tap_sharp_w16_v_8bpc_i8mm: 0.850x mc_8tap_sharp_w8_v_8bpc_i8mm: 0.779x mc_8tap_sharp_w4_v_8bpc_i8mm: 0.971x mc_8tap_sharp_w2_v_8bpc_i8mm: 0.975x Cortex-A510: mct_8tap_sharp_w16_v_8bpc_i8mm: 1.001x mct_8tap_sharp_w8_v_8bpc_i8mm: 0.979x mct_8tap_sharp_w4_v_8bpc_i8mm: 0.998x mc_8tap_sharp_w16_v_8bpc_i8mm: 0.998x mc_8tap_sharp_w8_v_8bpc_i8mm: 1.004x mc_8tap_sharp_w4_v_8bpc_i8mm: 1.003x mc_8tap_sharp_w2_v_8bpc_i8mm: 0.996x	2024-05-08 23:28:52 +02:00
Arpad PanyikandJean-Baptiste Kempf	d1bdf4f1ff	AArch64: Optimize horizontal i8mm prep filters Replace the accumulator initializations of the horizontal prep filters with register fills by zeros. Most i8mm capable CPUs can do these with zero latency, but we also need to use rounding shifts at the end of the filter. We can see better performance with this change on out-of-order CPUs. Relative performance of micro benchmarks (lower is better): Cortex-X3: mct_8tap_sharp_w32_h_8bpc_i8mm: 0.914x mct_8tap_sharp_w16_h_8bpc_i8mm: 0.906x mct_8tap_sharp_w8_h_8bpc_i8mm: 0.877x Cortex-A715: mct_8tap_sharp_w32_h_8bpc_i8mm: 0.819x mct_8tap_sharp_w16_h_8bpc_i8mm: 0.805x mct_8tap_sharp_w8_h_8bpc_i8mm: 0.779x Cortex-A510: mct_8tap_sharp_w32_h_8bpc_i8mm: 0.999x mct_8tap_sharp_w16_h_8bpc_i8mm: 1.001x mct_8tap_sharp_w8_h_8bpc_i8mm: 0.996x mct_8tap_sharp_w4_h_8bpc_i8mm: 0.915x	2024-05-08 20:16:13 +00:00
Arpad Panyik	1776c45a08	AArch64: Add basic i8mm support for convolutions Add an Armv8.6-A i8mm code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element USDOT instruction. Benchmarks show 4-9% FPS increase relative to the Armv8.4-A code path depending on the input video and the CPU used. This patch will increase the .text by around 5.7 KiB. Relative performance to the C reference on some Cortex CPU cores: Cortex-A715 Cortex-X3 Cortex-A510 regular w4 hv neon: 7.20x 11.20x 4.40x regular w4 hv dotprod: 12.77x 18.35x 6.21x regular w4 hv i8mm: 14.50x 21.42x 6.16x sharp w4 hv neon: 6.24x 9.77x 3.96x sharp w4 hv dotprod: 9.76x 14.02x 5.20x sharp w4 hv i8mm: 10.84x 16.09x 5.42x regular w8 hv neon: 2.17x 2.46x 3.17x regular w8 hv dotprod: 3.04x 3.11x 3.03x regular w8 hv i8mm: 3.57x 3.40x 3.27x sharp w8 hv neon: 1.72x 1.93x 2.75x sharp w8 hv dotprod: 2.49x 2.54x 2.62x sharp w8 hv i8mm: 2.80x 2.79x 2.70x regular w16 hv neon: 1.90x 2.17x 2.02x regular w16 hv dotprod: 2.59x 2.64x 1.93x regular w16 hv i8mm: 3.01x 2.85x 2.05x sharp w16 hv neon: 1.51x 1.72x 1.74x sharp w16 hv dotprod: 2.17x 2.22x 1.70x sharp w16 hv i8mm: 2.42x 2.42x 1.72x regular w32 hv neon: 1.80x 1.96x 1.81x regular w32 hv dotprod: 2.43x 2.36x 1.74x regular w32 hv i8mm: 2.83x 2.51x 1.83x sharp w32 hv neon: 1.42x 1.54x 1.56x sharp w32 hv dotprod: 2.07x 2.00x 1.55x sharp w32 hv i8mm: 2.29x 2.16x 1.55x regular w64 hv neon: 1.82x 1.89x 1.70x regular w64 hv dotprod: 2.43x 2.25x 1.65x regular w64 hv i8mm: 2.84x 2.39x 1.73x sharp w64 hv neon: 1.43x 1.47x 1.49x sharp w64 hv dotprod: 2.08x 1.91x 1.49x sharp w64 hv i8mm: 2.30x 2.07x 1.48x regular w128 hv neon: 1.77x 1.84x 1.75x regular w128 hv dotprod: 2.37x 2.18x 1.70x regular w128 hv i8mm: 2.76x 2.33x 1.78x sharp w128 hv neon: 1.40x 1.45x 1.42x sharp w128 hv dotprod: 2.04x 1.87x 1.43x sharp w128 hv i8mm: 2.24x 2.02x 1.42x regular w8 h neon: 3.16x 3.51x 3.43x regular w8 h dotprod: 4.97x 7.43x 4.95x regular w8 h i8mm: 7.28x 10.38x 5.69x sharp w8 h neon: 2.71x 2.77x 3.10x sharp w8 h dotprod: 4.92x 7.14x 4.94x sharp w8 h i8mm: 7.21x 10.11x 5.70x regular w16 h neon: 2.79x 2.76x 3.53x regular w16 h dotprod: 3.81x 4.77x 3.13x regular w16 h i8mm: 5.21x 6.04x 3.56x sharp w16 h neon: 2.31x 2.38x 3.12x sharp w16 h dotprod: 3.80x 4.74x 3.13x sharp w16 h i8mm: 5.20x 5.98x 3.56x regular w64 h neon: 2.49x 2.46x 2.94x regular w64 h dotprod: 3.17x 3.60x 2.41x regular w64 h i8mm: 4.22x 4.40x 2.72x sharp w64 h neon: 2.07x 2.06x 2.60x sharp w64 h dotprod: 3.16x 3.58x 2.40x sharp w64 h i8mm: 4.20x 4.38x 2.71x regular w8 v neon: 6.11x 8.05x 4.07x regular w8 v dotprod: 5.45x 8.15x 4.01x regular w8 v i8mm: 7.30x 9.46x 4.19x sharp w8 v neon: 4.23x 5.46x 3.09x sharp w8 v dotprod: 5.43x 7.96x 4.01x sharp w8 v i8mm: 7.26x 9.12x 4.19x regular w16 v neon: 3.44x 4.33x 2.40x regular w16 v dotprod: 3.20x 4.53x 2.85x regular w16 v i8mm: 4.09x 5.27x 2.87x sharp w16 v neon: 2.50x 3.14x 1.82x sharp w16 v dotprod: 3.20x 4.52x 2.86x sharp w16 v i8mm: 4.09x 5.15x 2.86x regular w64 v neon: 2.74x 3.11x 1.53x regular w64 v dotprod: 2.63x 3.30x 1.84x regular w64 v i8mm: 3.31x 3.73x 1.84x sharp w64 v neon: 2.01x 2.29x 1.16x sharp w64 v dotprod: 2.61x 3.27x 1.83x sharp w64 v i8mm: 3.29x 3.68x 1.84x	2024-04-26 14:04:18 +02:00
Arpad Panyik	fbf23637ce	AArch64: Simplify DotProd path of 2D subpel filters Simplify the DotProd code path of the 2D (horizontal-vertical) subpel filters. It contains some instruction reordering and some macro simplifications to be more similar to the upcoming i8mm version. These changes have negligible effect on performance. Cortex-A510: mc_8tap_regular_w2_hv_8bpc_dotprod: 8.3769 -> 8.3380 mc_8tap_sharp_w2_hv_8bpc_dotprod: 9.5441 -> 9.5457 mc_8tap_regular_w4_hv_8bpc_dotprod: 8.3422 -> 8.3444 mc_8tap_sharp_w4_hv_8bpc_dotprod: 9.5441 -> 9.5367 mc_8tap_regular_w8_hv_8bpc_dotprod: 9.9852 -> 9.9666 mc_8tap_sharp_w8_hv_8bpc_dotprod: 12.5554 -> 12.5314 Cortex-A55: mc_8tap_regular_w2_hv_8bpc_dotprod: 6.4504 -> 6.4892 mc_8tap_sharp_w2_hv_8bpc_dotprod: 7.5732 -> 7.6078 mc_8tap_regular_w4_hv_8bpc_dotprod: 6.5088 -> 6.4760 mc_8tap_sharp_w4_hv_8bpc_dotprod: 7.5796 -> 7.5763 mc_8tap_regular_w8_hv_8bpc_dotprod: 9.3384 -> 9.3078 mc_8tap_sharp_w8_hv_8bpc_dotprod: 11.1159 -> 11.1401 Cortex-A78: mc_8tap_regular_w2_hv_8bpc_dotprod: 1.4122 -> 1.4250 mc_8tap_sharp_w2_hv_8bpc_dotprod: 1.7696 -> 1.7821 mc_8tap_regular_w4_hv_8bpc_dotprod: 1.4243 -> 1.4243 mc_8tap_sharp_w4_hv_8bpc_dotprod: 1.7866 -> 1.7863 mc_8tap_regular_w8_hv_8bpc_dotprod: 2.5304 -> 2.5171 mc_8tap_sharp_w8_hv_8bpc_dotprod: 3.0815 -> 3.0632 Cortex-X1: mc_8tap_regular_w2_hv_8bpc_dotprod: 0.8195 -> 0.8194 mc_8tap_sharp_w2_hv_8bpc_dotprod: 1.0092 -> 1.0081 mc_8tap_regular_w4_hv_8bpc_dotprod: 0.8197 -> 0.8166 mc_8tap_sharp_w4_hv_8bpc_dotprod: 1.0089 -> 1.0068 mc_8tap_regular_w8_hv_8bpc_dotprod: 1.5230 -> 1.5166 mc_8tap_sharp_w8_hv_8bpc_dotprod: 1.8683 -> 1.8625	2024-04-25 17:02:09 +02:00
Arpad Panyik	a40301b33f	AArch64: Simplify loads in hv_filter of DotProd path Simplify the load sequences in hv_filter functions (ldr + add -> ld1) to be more uniform and smaller. Performance is not affected.	2024-04-25 17:02:09 +02:00
Arpad Panyik	b0685c387d	AArch64: Simplify TBL usage in 2D DotProd filters Simplify the TBL usages in small block size (2, 4) parts of the 2D (horizontal-vertical) put subpel filters. The 2-register TBLs are replaced with the 1-register form because we only need the lower 64-bits of the result and it can be extracted from only one source register. Performance is not affected by this change.	2024-04-25 17:02:09 +02:00
Arpad Panyik	ad7938d517	AArch64: Simplify DotProd path of horizontal subpel filters Simplify the inner loops of the DotProd code path of horizontal subpel filters to avoid using 2-register TBL instructions. The store part of block size 16 of the horizontal put case is also simplified (str + add -> st1). This patch can improve performance mostly on small cores like Cortex-A510 and newer. Other CPUs are mostly unaffected. Cortex-A510: mct_8tap_sharp_w16_h_8bpc_dotprod: 2.77x -> 3.13x mct_8tap_sharp_w32_h_8bpc_dotprod: 2.32x -> 2.56x Cortex-A55: mct_8tap_sharp_w16_h_8bpc_dotprod: 3.89x -> 3.89x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.35x -> 3.35x Cortex-A715: mct_8tap_sharp_w16_h_8bpc_dotprod: 3.79x -> 3.78x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.30x -> 3.30x Cortex-A78: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.30x -> 4.31x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.79x -> 3.80x Cortex-X3: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.74x -> 4.75x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.89x -> 3.91x Cortex-X1: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.61x -> 4.62x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.67x -> 3.66x	2024-04-25 16:59:53 +02:00
Arpad Panyik	317a94c6bb	AArch64: Simplify DotProd path of vertical subpel filters Simplify the accumulator initializations of the DotProd code path of vertical subpel filters. This also makes it possible for some CPUs to use zero latency vector register moves. The load is also simplified (ldr + add -> ld1) in the inner loop of vertical filter for block size 16.	2024-04-25 16:59:13 +02:00
Arpad Panyik	7eee4a2059	AArch64: Add \dot parameter to filter_8tap_fn macro Add \dot parameter to filter_8tap_fn macro in preparation to extend it with i8mm code path. This patch also contains string fixes and some instruction reorderings along with some register renaming to make it more uniform. These changes don't affect performance but simplifies the code a bit.	2024-04-25 16:58:11 +02:00
Arpad Panyik	9d77b6336a	AArch64: Add DotProd support for convolutions Add an Armv8.4-A DotProd code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element SDOT instruction. Benchmarks show up-to 7-29% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 6.5 KiB. Performance highly depends on the SDOT and MLA throughput ratio, this can be seen on the vertical filter cases. Small cores are also affected by the TBL execution latencies: Relative performance to the C reference on some CPUs: A76 A78 X1 A55 regular w4 hv neon: 5.52x 5.78x 10.75x 8.27x regular w4 hv dotprod: 7.94x 8.49x 16.84x 8.09x sharp w4 hv neon: 5.27x 5.22x 9.06x 7.87x sharp w4 hv dotprod: 6.61x 6.73x 12.64x 6.89x regular w8 hv neon: 1.95x 2.19x 2.56x 3.16x regular w8 hv dotprod: 3.23x 2.81x 3.20x 3.26x sharp w8 hv neon: 1.61x 1.79x 2.05x 2.72x sharp w8 hv dotprod: 2.72x 2.29x 2.66x 2.76x regular w16 hv neon: 1.63x 2.04x 2.16x 2.73x regular w16 hv dotprod: 2.72x 2.57x 2.67x 2.80x sharp w16 hv neon: 1.33x 1.67x 1.74x 2.34x sharp w16 hv dotprod: 2.31x 2.14x 2.26x 2.39x regular w32 hv neon: 1.48x 1.92x 1.94x 2.51x regular w32 hv dotprod: 2.49x 2.40x 2.33x 2.58x sharp w32 hv neon: 1.21x 1.56x 1.53x 2.14x sharp w32 hv dotprod: 2.12x 2.02x 2.00x 2.22x regular w64 hv neon: 1.42x 1.87x 1.85x 2.40x regular w64 hv dotprod: 2.40x 2.32x 2.21x 2.46x sharp w64 hv neon: 1.16x 1.52x 1.46x 2.04x sharp w64 hv dotprod: 2.02x 1.96x 1.90x 2.11x regular w128 hv neon: 1.39x 1.84x 1.80x 2.27x regular w128 hv dotprod: 2.33x 2.28x 2.14x 2.35x sharp w128 hv neon: 1.14x 1.50x 1.42x 1.94x sharp w128 hv dotprod: 1.98x 1.93x 1.84x 2.03x regular w8 h neon: 2.61x 3.20x 3.51x 3.55x regular w8 h dotprod: 4.43x 5.17x 6.26x 4.30x sharp w8 h neon: 2.01x 2.80x 2.89x 3.12x sharp w8 h dotprod: 4.42x 5.16x 6.27x 4.28x regular w16 h neon: 2.17x 3.13x 2.92x 3.35x regular w16 h dotprod: 4.38x 4.27x 4.53x 3.90x sharp w16 h neon: 1.74x 2.65x 2.48x 2.92x sharp w16 h dotprod: 4.33x 4.27x 4.53x 3.91x regular w64 h neon: 1.92x 2.82x 2.39x 2.96x regular w64 h dotprod: 3.68x 3.60x 3.40x 3.18x sharp w64 h neon: 1.47x 2.33x 2.05x 2.54x sharp w64 h dotprod: 3.68x 3.60x 3.40x 3.17x regular w4 v neon: 5.39x 7.38x 10.27x 11.41x regular w4 v dotprod: 9.46x 14.15x 18.72x 9.84x sharp w4 v neon: 4.51x 6.39x 8.17x 10.70x sharp w4 v dotprod: 9.35x 14.20x 18.63x 9.78x regular w16 v neon: 3.03x 4.03x 4.65x 6.28x regular w16 v dotprod: 4.64x 3.75x 4.78x 3.89x sharp w16 v neon: 2.29x 3.09x 3.44x 5.52x sharp w16 v dotprod: 4.62x 3.74x 4.77x 3.89x regular w64 v neon: 2.17x 3.14x 3.19x 4.46x regular w64 v dotprod: 3.43x 3.00x 3.31x 2.74x sharp w64 v neon: 1.61x 2.42x 2.34x 3.89x sharp w64 v dotprod: 3.38x 3.00x 3.29x 2.73x	2024-04-11 19:03:58 +02:00
Arpad PanyikandMartin Storsjö	932b323c3e	AArch64: Specialise HBD Neon convolutions for 6-tap filters The 8-tap sub-pel filters used for motion vector interpolation are: regular, smooth, sharp. The regular and smooth filter kernels are zero-padded, so they are effectively 6-tap filters (some of them are 5-tap or even 4-tap). This patch specialises the high bit-depth versions of put_8tap_neon and prep_8tap_neon functions for 6-tap filters, avoiding a lot of redundant work to multiply by and add zero. Wherever the sharp filtering is used the 8-tap path will be always selected. Benchmarks can show a 0.5-10.8% FPS uplift highly depending on the input video source. Binary size increase is ~8.5 KiB.	2024-03-05 11:45:55 +00:00
Arpad Panyik	b0a329d6a6	AArch64: Optimize 6-tap SBD HV Neon convolution Optimize the 6-tap standard bit-depth horizontal-vertical combined convolution to avoid unnecessary reads and horizontal convolution steps at the beginning and end of the algorithm. This also saves some instructions in the final binary. Performance of this function increases by up to 5.5% depending on block size.	2024-03-05 11:25:33 +00:00
Arpad PanyikandMartin Storsjö	acc1121d2f	Extend Arm and AArch64 run-time CPU feature detection Add run-time CPU feature detection for DotProd, i8mm, SVE and SVE2. SVE and SVE2 are AArch64-only features.	2024-02-28 16:32:28 +00:00
Arpad Panyik	f1d42ae8f1	AArch64: Enable benchmarks for 8-tap sharp filters The 6-tap sub-pel filter specialisation uses different code paths for sharp (8-tap) and regular/smooth (6-tap) filtering kernels. This patch enables benchmarking for the different code paths.	2024-02-22 08:58:17 +01:00
Arpad Panyik	e51f4377fb	AArch64: Specialise Neon convolutions for 6-tap filters The 8-tap sub-pel filters used for motion vector interpolation are: regular, smooth, sharp. The regular and smooth filter kernels are zero-padded, so they are effectively 6-tap filters (some of them are 5-tap or even 4-tap). This patch specialises the put_8tap_neon and prep_8tap_neon functions for 6-tap filters, avoiding a lot of redundant work to multiply by and add zero. Wherever the sharp filtering is used the 8-tap path will be always selected. Benchmarking this on a broad range of recent CPUs shows a 7-15% FPS uplift. Get raw sample video: https://ultravideo.fi/video/Bosphorus_1920x1080_120fps_420_8bit_YUV_RAW.7z Encode using: aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m	2024-02-22 08:58:17 +01:00