ff_yuv2rgb_get_func_ptr() now returns the C reference for explicit
BE/LE 16bpp formats, not only the NE alias.
Signed-off-by: DROOdotFOO <drew@axol.io>
Pass src[]/srcStride[] as arrays (x5/x6), move y_offset/y_coeff into
register args (w2/w3). Only int-after-pointer stack args remain, so
Apple and AAPCS64 lay them out identically; every __APPLE__ is gone.
nv12/nv21/yuv420p/yuv422p/yuva420p share one signature.
Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
Macro writes per-luma sums into the destination registers, leaving
v20-v25 (chroma -> RGB offsets) intact for the 2-line callers. Takes
bare register names. compute_rgba and compute_rgba_alpha follow suit.
Single-row callers reload v20-v25 each iteration via
chroma_to_rgb_offsets, so the change is a no-op for them: Apple M1
width=1920 mean -0.54% across 55 paths, within bench noise.
Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
The six .ifc cascades that gate 16bpp behavior in yuv2rgb_neon.S
(linesize padding in three load_args macros, d8/d9 save/restore,
main-loop pack dispatch) all branch on the same four output formats.
Aggregate the predicate into four GAS .set symbols emitted once per
declare_func via a new set_rgb16_predicates macro:
rgb16 - 1 for *565le and *555le outputs; 0 otherwise
r_first - 1 for rgb*le (R high); 0 for bgr*le (B high)
gshift - 2 for 565, 3 for 555 (passed as pack_rgb16's g_shr)
hshift - 11 for 565, 10 for 555 (passed as pack_rgb16's high_shl)
Call sites become a flat ".if rgb16" gate (five places) plus a 2-way
".if r_first" inside ".if rgb16" for the pack dispatch (one place).
.if/.endif count drops from 46 to 33; -88/+49 lines net.
Pure source-level refactor: the full object disassembly is byte-for-byte
identical to the pre-refactor build (MD5 2a6ac497cabc81849e0c80ec0fde0550
on Apple M1, clang). checkasm --test=sw_yuv2rgb 110/110, full checkasm
7657/7657.
Signed-off-by: DROOdotFOO <drew@axol.io>
Add NEON unscaled converters for {yuv420p, yuv422p, yuva420p, nv12, nv21}
to {rgb565le, bgr565le, rgb555le, bgr555le}.
The 16bpp packing uses v8/v9 as the output accumulator. Since AAPCS-64
requires d8-d15 to be callee-saved, declare_func now wraps a
stp d8, d9 / ldp d8, d9 around 16bpp paths only (gated by .ifc on the
output format). Pattern matches libswscale/aarch64/hscale.S.
yuva420p -> 16bpp drops alpha and routes through the yuv420p wrappers,
mirroring how yuva420p -> rgb24/bgr24 already work in tree.
Speedup vs C at width=1920 on Apple M1 (checkasm --bench):
| input | rgb565le | bgr565le | rgb555le | bgr555le |
|----------|----------|----------|----------|----------|
| yuv420p | 3.69x | 3.68x | 3.28x | 3.31x |
| yuv422p | 4.70x | 4.70x | 4.32x | 4.35x |
| yuva420p | 3.67x | 3.66x | 3.32x | 3.27x |
NEON cycles are ~48 for planar and ~50.5 for semi-planar across all
four outputs. yuv422p shows the biggest speedup because its C
reference is the most expensive. 555 ratios trail 565 because the C
reference is faster for 555 (one fewer mask bit); NEON cycles are the
same. nv12/nv21 are bench-only (see the preceding checkasm commit) and
run at the same ~50.5 cycles.
This only handles the little endian forms of the 16 bit RGB formats.
Verified with checkasm --test=sw_yuv2rgb (110/110) and the full
checkasm regression (7657/7657) on Apple M1.
Signed-off-by: DROOdotFOO <drew@axol.io>
The previous chroma stride formula (width >> log2_chroma_w) is correct
for planar yuv but wrong for semi-planar nv12/nv21, where the UV plane
is interleaved at width bytes per row (width/2 UV pairs of 2 bytes
each). Use av_image_get_linesize() so the test feeds a valid stride to
libswscale regardless of input format; for the existing planar suites
the value is unchanged.
With the stride fixed, add nv12 and nv21 to check_yuv2rgb() so the
upcoming NEON 16bpp paths get bench coverage. ff_get_unscaled_swscale
does not wire a C yuv2rgb fast path for these inputs, so the suites
report bench-only (no correctness reference); they still run clobber
detection and cycle counts.
Signed-off-by: DROOdotFOO <drew@axol.io>