ffmpeg

124,952 Commits 39 Branches 423 Tags

12 Commits

Author	SHA1	Message	Date
DROOdotFOOandRamiro Polla	cc7c567920	swscale/aarch64/yuv2rgb_neon: add BE 16bpp output formats BE counterparts to the LE paths in 2e142e52ae; pack adds rev16 before store. nv12/nv21 paths are added but bench-only (no C ref, same as `2e142e52ae`). Test Name A55-gcc M1-clang A76-gcc ------------------------------------------------------------------------------------- yuv420p_rgb565be_1920_neon 15086.1 ( 3.91x) 5507.0 ( 4.34x) 19229.1 ( 2.02x) yuv420p_bgr565be_1920_neon 15291.7 ( 3.84x) 5476.9 ( 4.37x) 19229.4 ( 2.02x) yuv420p_rgb555be_1920_neon 15091.5 ( 3.67x) 5569.0 ( 3.97x) 19229.3 ( 1.90x) yuv420p_bgr555be_1920_neon 15298.6 ( 3.62x) 5600.6 ( 3.98x) 19228.8 ( 1.90x) yuv422p_rgb565be_1920_neon 16862.3 ( 4.00x) 6378.8 ( 4.64x) 22110.3 ( 2.07x) yuv422p_bgr565be_1920_neon 17139.3 ( 3.93x) 6448.1 ( 4.50x) 22104.1 ( 2.07x) yuv422p_rgb555be_1920_neon 16853.3 ( 3.98x) 6468.8 ( 4.12x) 22106.4 ( 1.98x) yuv422p_bgr555be_1920_neon 17202.2 ( 3.89x) 6467.0 ( 4.12x) 22110.2 ( 1.98x) yuva420p_rgb565be_1920_neon 15050.2 ( 3.92x) 5452.5 ( 4.39x) 19229.5 ( 2.02x) yuva420p_bgr565be_1920_neon 15346.6 ( 3.84x) 5462.4 ( 4.36x) 19228.9 ( 2.02x) yuva420p_rgb555be_1920_neon 15050.8 ( 3.69x) 5463.3 ( 3.95x) 19228.6 ( 1.90x) yuva420p_bgr555be_1920_neon 15352.8 ( 3.61x) 5543.6 ( 3.89x) 19228.6 ( 1.90x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-10 17:54:20 +00:00
DROOdotFOOandRamiro Polla	7ab5aebc08	swscale/yuv2rgb: add explicit BE/LE 565/555 cases ff_yuv2rgb_get_func_ptr() now returns the C reference for explicit BE/LE 16bpp formats, not only the NE alias. Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-10 17:54:20 +00:00
DROOdotFOOandRamiro Polla	d9e2239f3c	swscale/aarch64/yuv2rgb_neon: 2 lines at a time, yuva420p Alpha is full resolution, so each row loads its own 16 alpha bytes via process_row's \rsrcA arg. Test Name A55-gcc M1-clang A76-gcc ---------------------------------------------------------------------------------------- yuva420p_to_argb_neon 22607.6 (1.16x) 39.2 (1.24x) 13631.6 (1.12x) yuva420p_to_rgba_neon 22608.2 (1.16x) 38.3 (1.21x) 13912.8 (1.12x) yuva420p_to_abgr_neon 23074.6 (1.16x) 38.8 (1.22x) 14492.1 (1.08x) yuva420p_to_bgra_neon 23079.7 (1.16x) 39.9 (1.19x) 14472.6 (1.08x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	4b6f7c2a05	swscale/aarch64/yuv2rgb_neon: 2 lines at a time, rgb16 pack_rgb16_2l uses v26-v29 as scratch (luma temps, dead by then) instead of v20-v23, so v20-v25 chroma survives the pack step. A .error trips if yuva420p hits rgb16 (v28/v29 would clobber alpha); the dispatcher routes that combination through yuv420p anyway. Test Name A55-gcc M1-clang A76-gcc ---------------------------------------------------------------------------------------- nv12_to_rgb565le_neon 28531.9 (1.12x) 46.8 (1.28x) 19252.9 (1.09x) nv12_to_bgr565le_neon 29018.1 (1.12x) 48.1 (1.17x) 19252.0 (1.09x) nv12_to_rgb555le_neon 28531.3 (1.12x) 47.2 (1.24x) 19253.6 (1.09x) nv12_to_bgr555le_neon 29012.1 (1.12x) 45.8 (1.22x) 19252.5 (1.09x) nv21_to_rgb565le_neon 28532.3 (1.12x) 48.4 (1.15x) 19430.0 (1.09x) nv21_to_bgr565le_neon 29013.8 (1.12x) 47.2 (1.21x) 19428.8 (1.09x) nv21_to_rgb555le_neon 28533.3 (1.12x) 49.7 (1.16x) 19430.5 (1.09x) nv21_to_bgr555le_neon 29011.4 (1.12x) 48.5 (1.18x) 19428.7 (1.09x) yuv420p_to_rgb565le_neon 28351.9 (1.11x) 46.4 (1.18x) 19635.3 (1.08x) yuv420p_to_bgr565le_neon 28831.8 (1.11x) 50.8 (1.09x) 19634.5 (1.08x) yuv420p_to_rgb555le_neon 28351.3 (1.11x) 46.3 (1.23x) 19634.2 (1.08x) yuv420p_to_bgr555le_neon 28829.1 (1.11x) 46.5 (1.21x) 19634.3 (1.08x) yuva420p_to_rgb565le_neon 28349.5 (1.11x) 51.2 (1.06x) 19634.7 (1.08x) yuva420p_to_bgr565le_neon 28833.1 (1.11x) 48.6 (1.17x) 19633.9 (1.08x) yuva420p_to_rgb555le_neon 28351.6 (1.11x) 47.8 (1.16x) 19635.2 (1.08x) yuva420p_to_bgr555le_neon 28831.5 (1.11x) 46.4 (1.14x) 19634.8 (1.08x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	dad212060c	swscale/aarch64/yuv2rgb_neon: 2 lines at a time, gbrp Six dst pointers exhaust the caller-saved registers; spill x19/x20. yuva420p_to_gbrp_neon is routed through the yuv420p path by the dispatcher (gbrp has no alpha channel). Test Name A55-gcc M1-clang A76-gcc ---------------------------------------------------------------------------------------- nv12_to_gbrp_neon 20017.8 (1.15x) 32.8 (1.34x) 10658.0 (1.27x) nv21_to_gbrp_neon 20020.9 (1.15x) 32.5 (1.36x) 10691.1 (1.26x) yuv420p_to_gbrp_neon 19856.3 (1.14x) 31.4 (1.34x) 10348.0 (1.37x) yuva420p_to_gbrp_neon 19859.8 (1.14x) 30.9 (1.27x) 10350.9 (1.37x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	4bfe7efd0c	swscale/aarch64/yuv2rgb_neon: 2 lines at a time, packed RGB Vertically-subsampled inputs (nv12, nv21, yuv420p) share a chroma row across two output rows; compute the chroma -> RGB offsets once and apply to both luma rows. Covers argb/rgba/abgr/bgra/rgb24/bgr24. Test Name A55-gcc M1-clang A76-gcc ---------------------------------------------------------------------------------------- nv12_to_argb_neon 21647.2 (1.16x) 40.1 (1.24x) 13813.3 (1.16x) nv12_to_rgba_neon 21653.7 (1.16x) 40.8 (1.32x) 14105.0 (1.13x) nv12_to_abgr_neon 22122.2 (1.15x) 40.3 (1.27x) 14100.2 (1.16x) nv12_to_bgra_neon 22121.6 (1.15x) 39.6 (1.24x) 14125.9 (1.16x) nv12_to_rgb24_neon 19842.0 (1.18x) 33.4 (1.28x) 12868.9 (1.17x) nv12_to_bgr24_neon 20318.0 (1.18x) 34.6 (1.23x) 12868.8 (1.17x) nv21_to_argb_neon 21648.5 (1.16x) 41.0 (1.29x) 13978.5 (1.14x) nv21_to_rgba_neon 21653.0 (1.16x) 41.3 (1.21x) 14173.5 (1.11x) nv21_to_abgr_neon 22120.6 (1.15x) 41.1 (1.20x) 14505.4 (1.14x) nv21_to_bgra_neon 22120.8 (1.15x) 41.0 (1.22x) 14520.1 (1.14x) nv21_to_rgb24_neon 19830.5 (1.19x) 35.1 (1.28x) 12832.4 (1.17x) nv21_to_bgr24_neon 20317.1 (1.18x) 34.6 (1.27x) 12833.1 (1.17x) yuv420p_to_argb_neon 21450.2 (1.15x) 39.2 (1.19x) 14118.3 (1.12x) yuv420p_to_rgba_neon 21447.2 (1.15x) 38.8 (1.24x) 14326.0 (1.14x) yuv420p_to_abgr_neon 21927.0 (1.15x) 38.9 (1.25x) 14826.6 (1.13x) yuv420p_to_bgra_neon 21930.8 (1.15x) 41.4 (1.18x) 14822.9 (1.13x) yuv420p_to_rgb24_neon 19365.5 (1.17x) 33.5 (1.25x) 13291.8 (1.16x) yuv420p_to_bgr24_neon 19848.8 (1.16x) 34.1 (1.35x) 13292.8 (1.16x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	11b1721b11	swscale/aarch64/yuv2rgb_neon: reorder params, unify signature Pass src[]/srcStride[] as arrays (x5/x6), move y_offset/y_coeff into register args (w2/w3). Only int-after-pointer stack args remain, so Apple and AAPCS64 lay them out identically; every __APPLE__ is gone. nv12/nv21/yuv420p/yuv422p/yuva420p share one signature. Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	8dbc729950	swscale/aarch64/yuv2rgb_neon: name registers the loop body. .text byte-identical. Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	e0fa641240	swscale/aarch64/yuv2rgb_neon: chroma-preserve compute_rgb Macro writes per-luma sums into the destination registers, leaving v20-v25 (chroma -> RGB offsets) intact for the 2-line callers. Takes bare register names. compute_rgba and compute_rgba_alpha follow suit. Single-row callers reload v20-v25 each iteration via chroma_to_rgb_offsets, so the change is a no-op for them: Apple M1 width=1920 mean -0.54% across 55 paths, within bench noise. Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOO	30595cbc5d	swscale/aarch64/yuv2rgb_neon: aggregate 16bpp predicates The six .ifc cascades that gate 16bpp behavior in yuv2rgb_neon.S (linesize padding in three load_args macros, d8/d9 save/restore, main-loop pack dispatch) all branch on the same four output formats. Aggregate the predicate into four GAS .set symbols emitted once per declare_func via a new set_rgb16_predicates macro: rgb16 - 1 for 565le and 555le outputs; 0 otherwise r_first - 1 for rgble (R high); 0 for bgrle (B high) gshift - 2 for 565, 3 for 555 (passed as pack_rgb16's g_shr) hshift - 11 for 565, 10 for 555 (passed as pack_rgb16's high_shl) Call sites become a flat ".if rgb16" gate (five places) plus a 2-way ".if r_first" inside ".if rgb16" for the pack dispatch (one place). .if/.endif count drops from 46 to 33; -88/+49 lines net. Pure source-level refactor: the full object disassembly is byte-for-byte identical to the pre-refactor build (MD5 2a6ac497cabc81849e0c80ec0fde0550 on Apple M1, clang). checkasm --test=sw_yuv2rgb 110/110, full checkasm 7657/7657. Signed-off-by: DROOdotFOO <drew@axol.io>	2026-05-26 19:26:28 +02:00
DROOdotFOOandMartin Storsjö	2e142e52ae	swscale/aarch64: add NEON yuv->rgb16 fast paths Add NEON unscaled converters for {yuv420p, yuv422p, yuva420p, nv12, nv21} to {rgb565le, bgr565le, rgb555le, bgr555le}. The 16bpp packing uses v8/v9 as the output accumulator. Since AAPCS-64 requires d8-d15 to be callee-saved, declare_func now wraps a stp d8, d9 / ldp d8, d9 around 16bpp paths only (gated by .ifc on the output format). Pattern matches libswscale/aarch64/hscale.S. yuva420p -> 16bpp drops alpha and routes through the yuv420p wrappers, mirroring how yuva420p -> rgb24/bgr24 already work in tree. Speedup vs C at width=1920 on Apple M1 (checkasm --bench): \| input \| rgb565le \| bgr565le \| rgb555le \| bgr555le \| \|----------\|----------\|----------\|----------\|----------\| \| yuv420p \| 3.69x \| 3.68x \| 3.28x \| 3.31x \| \| yuv422p \| 4.70x \| 4.70x \| 4.32x \| 4.35x \| \| yuva420p \| 3.67x \| 3.66x \| 3.32x \| 3.27x \| NEON cycles are ~48 for planar and ~50.5 for semi-planar across all four outputs. yuv422p shows the biggest speedup because its C reference is the most expensive. 555 ratios trail 565 because the C reference is faster for 555 (one fewer mask bit); NEON cycles are the same. nv12/nv21 are bench-only (see the preceding checkasm commit) and run at the same ~50.5 cycles. This only handles the little endian forms of the 16 bit RGB formats. Verified with checkasm --test=sw_yuv2rgb (110/110) and the full checkasm regression (7657/7657) on Apple M1. Signed-off-by: DROOdotFOO <drew@axol.io>	2026-05-22 10:03:07 +00:00
DROOdotFOOandMartin Storsjö	34501921fd	tests/checkasm/sw_yuv2rgb: cover nv12 and nv21 The previous chroma stride formula (width >> log2_chroma_w) is correct for planar yuv but wrong for semi-planar nv12/nv21, where the UV plane is interleaved at width bytes per row (width/2 UV pairs of 2 bytes each). Use av_image_get_linesize() so the test feeds a valid stride to libswscale regardless of input format; for the existing planar suites the value is unchanged. With the stride fixed, add nv12 and nv21 to check_yuv2rgb() so the upcoming NEON 16bpp paths get bench coverage. ff_get_unscaled_swscale does not wire a C yuv2rgb fast path for these inputs, so the suites report bench-only (no correctness reference); they still run clobber detection and cycle counts. Signed-off-by: DROOdotFOO <drew@axol.io>	2026-05-22 10:03:07 +00:00