12 Commits
Author SHA1 Message Date
DROOdotFOOandRamiro Polla cc7c567920 swscale/aarch64/yuv2rgb_neon: add BE 16bpp output formats
BE counterparts to the LE paths in 2e142e52ae; pack adds rev16 before
store. nv12/nv21 paths are added but bench-only (no C ref, same as
2e142e52ae).

Test Name                              A55-gcc           M1-clang             A76-gcc
-------------------------------------------------------------------------------------
yuv420p_rgb565be_1920_neon    15086.1 ( 3.91x)    5507.0 ( 4.34x)    19229.1 ( 2.02x)
yuv420p_bgr565be_1920_neon    15291.7 ( 3.84x)    5476.9 ( 4.37x)    19229.4 ( 2.02x)
yuv420p_rgb555be_1920_neon    15091.5 ( 3.67x)    5569.0 ( 3.97x)    19229.3 ( 1.90x)
yuv420p_bgr555be_1920_neon    15298.6 ( 3.62x)    5600.6 ( 3.98x)    19228.8 ( 1.90x)
yuv422p_rgb565be_1920_neon    16862.3 ( 4.00x)    6378.8 ( 4.64x)    22110.3 ( 2.07x)
yuv422p_bgr565be_1920_neon    17139.3 ( 3.93x)    6448.1 ( 4.50x)    22104.1 ( 2.07x)
yuv422p_rgb555be_1920_neon    16853.3 ( 3.98x)    6468.8 ( 4.12x)    22106.4 ( 1.98x)
yuv422p_bgr555be_1920_neon    17202.2 ( 3.89x)    6467.0 ( 4.12x)    22110.2 ( 1.98x)
yuva420p_rgb565be_1920_neon   15050.2 ( 3.92x)    5452.5 ( 4.39x)    19229.5 ( 2.02x)
yuva420p_bgr565be_1920_neon   15346.6 ( 3.84x)    5462.4 ( 4.36x)    19228.9 ( 2.02x)
yuva420p_rgb555be_1920_neon   15050.8 ( 3.69x)    5463.3 ( 3.95x)    19228.6 ( 1.90x)
yuva420p_bgr555be_1920_neon   15352.8 ( 3.61x)    5543.6 ( 3.89x)    19228.6 ( 1.90x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-10 17:54:20 +00:00
DROOdotFOOandRamiro Polla 7ab5aebc08 swscale/yuv2rgb: add explicit BE/LE 565/555 cases
ff_yuv2rgb_get_func_ptr() now returns the C reference for explicit
BE/LE 16bpp formats, not only the NE alias.

Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-10 17:54:20 +00:00
DROOdotFOOandRamiro Polla d9e2239f3c swscale/aarch64/yuv2rgb_neon: 2 lines at a time, yuva420p
Alpha is full resolution, so each row loads its own 16 alpha bytes
via process_row's \rsrcA arg.

Test Name                                A55-gcc            M1-clang             A76-gcc
----------------------------------------------------------------------------------------
yuva420p_to_argb_neon            22607.6 (1.16x)        39.2 (1.24x)     13631.6 (1.12x)
yuva420p_to_rgba_neon            22608.2 (1.16x)        38.3 (1.21x)     13912.8 (1.12x)
yuva420p_to_abgr_neon            23074.6 (1.16x)        38.8 (1.22x)     14492.1 (1.08x)
yuva420p_to_bgra_neon            23079.7 (1.16x)        39.9 (1.19x)     14472.6 (1.08x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla 4b6f7c2a05 swscale/aarch64/yuv2rgb_neon: 2 lines at a time, rgb16
pack_rgb16_2l uses v26-v29 as scratch (luma temps, dead by then)
instead of v20-v23, so v20-v25 chroma survives the pack step. A
.error trips if yuva420p hits rgb16 (v28/v29 would clobber alpha);
the dispatcher routes that combination through yuv420p anyway.

Test Name                                A55-gcc            M1-clang             A76-gcc
----------------------------------------------------------------------------------------
nv12_to_rgb565le_neon            28531.9 (1.12x)        46.8 (1.28x)     19252.9 (1.09x)
nv12_to_bgr565le_neon            29018.1 (1.12x)        48.1 (1.17x)     19252.0 (1.09x)
nv12_to_rgb555le_neon            28531.3 (1.12x)        47.2 (1.24x)     19253.6 (1.09x)
nv12_to_bgr555le_neon            29012.1 (1.12x)        45.8 (1.22x)     19252.5 (1.09x)
nv21_to_rgb565le_neon            28532.3 (1.12x)        48.4 (1.15x)     19430.0 (1.09x)
nv21_to_bgr565le_neon            29013.8 (1.12x)        47.2 (1.21x)     19428.8 (1.09x)
nv21_to_rgb555le_neon            28533.3 (1.12x)        49.7 (1.16x)     19430.5 (1.09x)
nv21_to_bgr555le_neon            29011.4 (1.12x)        48.5 (1.18x)     19428.7 (1.09x)
yuv420p_to_rgb565le_neon         28351.9 (1.11x)        46.4 (1.18x)     19635.3 (1.08x)
yuv420p_to_bgr565le_neon         28831.8 (1.11x)        50.8 (1.09x)     19634.5 (1.08x)
yuv420p_to_rgb555le_neon         28351.3 (1.11x)        46.3 (1.23x)     19634.2 (1.08x)
yuv420p_to_bgr555le_neon         28829.1 (1.11x)        46.5 (1.21x)     19634.3 (1.08x)
yuva420p_to_rgb565le_neon        28349.5 (1.11x)        51.2 (1.06x)     19634.7 (1.08x)
yuva420p_to_bgr565le_neon        28833.1 (1.11x)        48.6 (1.17x)     19633.9 (1.08x)
yuva420p_to_rgb555le_neon        28351.6 (1.11x)        47.8 (1.16x)     19635.2 (1.08x)
yuva420p_to_bgr555le_neon        28831.5 (1.11x)        46.4 (1.14x)     19634.8 (1.08x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla dad212060c swscale/aarch64/yuv2rgb_neon: 2 lines at a time, gbrp
Six dst pointers exhaust the caller-saved registers; spill x19/x20.
yuva420p_to_gbrp_neon is routed through the yuv420p path by the
dispatcher (gbrp has no alpha channel).

Test Name                                A55-gcc            M1-clang             A76-gcc
----------------------------------------------------------------------------------------
nv12_to_gbrp_neon                20017.8 (1.15x)        32.8 (1.34x)     10658.0 (1.27x)
nv21_to_gbrp_neon                20020.9 (1.15x)        32.5 (1.36x)     10691.1 (1.26x)
yuv420p_to_gbrp_neon             19856.3 (1.14x)        31.4 (1.34x)     10348.0 (1.37x)
yuva420p_to_gbrp_neon            19859.8 (1.14x)        30.9 (1.27x)     10350.9 (1.37x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla 4bfe7efd0c swscale/aarch64/yuv2rgb_neon: 2 lines at a time, packed RGB
Vertically-subsampled inputs (nv12, nv21, yuv420p) share a chroma
row across two output rows; compute the chroma -> RGB offsets once
and apply to both luma rows. Covers argb/rgba/abgr/bgra/rgb24/bgr24.

Test Name                                A55-gcc            M1-clang             A76-gcc
----------------------------------------------------------------------------------------
nv12_to_argb_neon                21647.2 (1.16x)        40.1 (1.24x)     13813.3 (1.16x)
nv12_to_rgba_neon                21653.7 (1.16x)        40.8 (1.32x)     14105.0 (1.13x)
nv12_to_abgr_neon                22122.2 (1.15x)        40.3 (1.27x)     14100.2 (1.16x)
nv12_to_bgra_neon                22121.6 (1.15x)        39.6 (1.24x)     14125.9 (1.16x)
nv12_to_rgb24_neon               19842.0 (1.18x)        33.4 (1.28x)     12868.9 (1.17x)
nv12_to_bgr24_neon               20318.0 (1.18x)        34.6 (1.23x)     12868.8 (1.17x)
nv21_to_argb_neon                21648.5 (1.16x)        41.0 (1.29x)     13978.5 (1.14x)
nv21_to_rgba_neon                21653.0 (1.16x)        41.3 (1.21x)     14173.5 (1.11x)
nv21_to_abgr_neon                22120.6 (1.15x)        41.1 (1.20x)     14505.4 (1.14x)
nv21_to_bgra_neon                22120.8 (1.15x)        41.0 (1.22x)     14520.1 (1.14x)
nv21_to_rgb24_neon               19830.5 (1.19x)        35.1 (1.28x)     12832.4 (1.17x)
nv21_to_bgr24_neon               20317.1 (1.18x)        34.6 (1.27x)     12833.1 (1.17x)
yuv420p_to_argb_neon             21450.2 (1.15x)        39.2 (1.19x)     14118.3 (1.12x)
yuv420p_to_rgba_neon             21447.2 (1.15x)        38.8 (1.24x)     14326.0 (1.14x)
yuv420p_to_abgr_neon             21927.0 (1.15x)        38.9 (1.25x)     14826.6 (1.13x)
yuv420p_to_bgra_neon             21930.8 (1.15x)        41.4 (1.18x)     14822.9 (1.13x)
yuv420p_to_rgb24_neon            19365.5 (1.17x)        33.5 (1.25x)     13291.8 (1.16x)
yuv420p_to_bgr24_neon            19848.8 (1.16x)        34.1 (1.35x)     13292.8 (1.16x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla 11b1721b11 swscale/aarch64/yuv2rgb_neon: reorder params, unify signature
Pass src[]/srcStride[] as arrays (x5/x6), move y_offset/y_coeff into
register args (w2/w3). Only int-after-pointer stack args remain, so
Apple and AAPCS64 lay them out identically; every __APPLE__ is gone.
nv12/nv21/yuv420p/yuv422p/yuva420p share one signature.

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla 8dbc729950 swscale/aarch64/yuv2rgb_neon: name registers
the loop body. .text byte-identical.

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla e0fa641240 swscale/aarch64/yuv2rgb_neon: chroma-preserve compute_rgb
Macro writes per-luma sums into the destination registers, leaving
v20-v25 (chroma -> RGB offsets) intact for the 2-line callers. Takes
bare register names. compute_rgba and compute_rgba_alpha follow suit.

Single-row callers reload v20-v25 each iteration via
chroma_to_rgb_offsets, so the change is a no-op for them: Apple M1
width=1920 mean -0.54% across 55 paths, within bench noise.

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOO 30595cbc5d swscale/aarch64/yuv2rgb_neon: aggregate 16bpp predicates
The six .ifc cascades that gate 16bpp behavior in yuv2rgb_neon.S
(linesize padding in three load_args macros, d8/d9 save/restore,
main-loop pack dispatch) all branch on the same four output formats.
Aggregate the predicate into four GAS .set symbols emitted once per
declare_func via a new set_rgb16_predicates macro:

  rgb16   - 1 for *565le and *555le outputs; 0 otherwise
  r_first - 1 for rgb*le (R high); 0 for bgr*le (B high)
  gshift  - 2 for 565, 3 for 555 (passed as pack_rgb16's g_shr)
  hshift  - 11 for 565, 10 for 555 (passed as pack_rgb16's high_shl)

Call sites become a flat ".if rgb16" gate (five places) plus a 2-way
".if r_first" inside ".if rgb16" for the pack dispatch (one place).
.if/.endif count drops from 46 to 33; -88/+49 lines net.

Pure source-level refactor: the full object disassembly is byte-for-byte
identical to the pre-refactor build (MD5 2a6ac497cabc81849e0c80ec0fde0550
on Apple M1, clang). checkasm --test=sw_yuv2rgb 110/110, full checkasm
7657/7657.

Signed-off-by: DROOdotFOO <drew@axol.io>
2026-05-26 19:26:28 +02:00
DROOdotFOOandMartin Storsjö 2e142e52ae swscale/aarch64: add NEON yuv->rgb16 fast paths
Add NEON unscaled converters for {yuv420p, yuv422p, yuva420p, nv12, nv21}
to {rgb565le, bgr565le, rgb555le, bgr555le}.

The 16bpp packing uses v8/v9 as the output accumulator. Since AAPCS-64
requires d8-d15 to be callee-saved, declare_func now wraps a
stp d8, d9 / ldp d8, d9 around 16bpp paths only (gated by .ifc on the
output format). Pattern matches libswscale/aarch64/hscale.S.

yuva420p -> 16bpp drops alpha and routes through the yuv420p wrappers,
mirroring how yuva420p -> rgb24/bgr24 already work in tree.

Speedup vs C at width=1920 on Apple M1 (checkasm --bench):

  | input    | rgb565le | bgr565le | rgb555le | bgr555le |
  |----------|----------|----------|----------|----------|
  | yuv420p  | 3.69x    | 3.68x    | 3.28x    | 3.31x    |
  | yuv422p  | 4.70x    | 4.70x    | 4.32x    | 4.35x    |
  | yuva420p | 3.67x    | 3.66x    | 3.32x    | 3.27x    |

NEON cycles are ~48 for planar and ~50.5 for semi-planar across all
four outputs. yuv422p shows the biggest speedup because its C
reference is the most expensive. 555 ratios trail 565 because the C
reference is faster for 555 (one fewer mask bit); NEON cycles are the
same. nv12/nv21 are bench-only (see the preceding checkasm commit) and
run at the same ~50.5 cycles.

This only handles the little endian forms of the 16 bit RGB formats.

Verified with checkasm --test=sw_yuv2rgb (110/110) and the full
checkasm regression (7657/7657) on Apple M1.

Signed-off-by: DROOdotFOO <drew@axol.io>
2026-05-22 10:03:07 +00:00
DROOdotFOOandMartin Storsjö 34501921fd tests/checkasm/sw_yuv2rgb: cover nv12 and nv21
The previous chroma stride formula (width >> log2_chroma_w) is correct
for planar yuv but wrong for semi-planar nv12/nv21, where the UV plane
is interleaved at width bytes per row (width/2 UV pairs of 2 bytes
each). Use av_image_get_linesize() so the test feeds a valid stride to
libswscale regardless of input format; for the existing planar suites
the value is unchanged.

With the stride fixed, add nv12 and nv21 to check_yuv2rgb() so the
upcoming NEON 16bpp paths get bench coverage. ff_get_unscaled_swscale
does not wire a C yuv2rgb fast path for these inputs, so the suites
report bench-only (no correctness reference); they still run clobber
detection and cycle counts.

Signed-off-by: DROOdotFOO <drew@axol.io>
2026-05-22 10:03:07 +00:00