124952 Commits
Author SHA1 Message Date
Niklas Haas 02a168a576 swscale/uops: keep track of input range during op translation
Needed for the FMA decision logic.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas 3f9219d605 swscale/uops: add SwsUOpFlags to ff_sws_ops_translate()
These will be used to e.g. enable extra uops during translation.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas b7a80a9f0d swscale/ops_backend: delete ops-based C backend
And make uops_backend.c the new reference.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas 100ce4ac41 tests/checkasm/sw_ops: rewrite using uops_macros.h
This ensures 100% coverage of all uop primitives by generating the set of
tests exactly from the list of seen primitives, using the uops macros.

There are some annoying quirks still because of the fact that we have to
essentially "untranslate" the UOPs back to SwsOps that result back in the
intended uop after the translation, but overall it's not too bad and still
much better than the status quo of hand-rolling the list of test cases.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas 636b9eda75 swscale/ops_tmpl_float: allow arbitrary values for 1x1 dither
Removes the 1x1 dither fast path, mirroring the previous commit.

This is not really needed nor useful but it will make the transition to
the uops architecture slightly easier, as 1x1 dither gets reinterpreted
as SWS_UOP_ADD there.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas ca8774b9d6 swscale/x86: remove broken and unnecessary 1x1 dither fast path
This is broken because it fails to check dither.y_offset[] to determine if
dithering for a channel is requested or not.

This is unnecessary because the generic dither code already jumps over unused
components, which is cheap enough not to worry about this special case for
now.

This code will, in any case, soon be replaced by a uops_macros.h-derived
approach. This commit is only needed as a stopgap to make checkasm continue
working after the sws_uops refactor.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas 19652a83a2 swscale/x86/ops_include: use %assign instead of %xdefine
For numeric 1/0 constants. As an aside, fix the broken comment.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas b328e152a4 swscale/x86: move entry points to ops_common.asm
As well as the packed shuffle solver. These don't really interact with
the rest of the code in ops_int.asm, which is, by name at least, intended for
integer op kernels.

More importantly, these functions will be shared with the uops rewrite.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas c5c9c6d996 swscale/x86: rename ops_common.asm to ops_include.asm
Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas 8118e964bb swscale/uops: auto-generate reference C backend from uops_macros.h
Instead of choosing by hand which kernels to implement, this rewrite focuses
on leveraging the power of uops_macros.h to auto-generate all needed kernels.
This not only simplifies maintenance, but also improves performance.

I have decided to develop the replacement backend as a separate file, under
a separate prefix, for the explicit purpose of being able to verify the
correctness of the rewrite using the current backend as a checkasm reference.

The code for the kernels themselves has been largely copied from the old
C backend, modified slightly to conform to the uop template style. This does
result in some code duplication, but a following commit will clean it up.
I nonetheless want to preserve this commit for bisection purposes, to ensure
we have one commit that contains both backends side-by-side.

Overall speedup=1.182x faster, min=0.197x max=3.450x

The big slowdowns are flukes caused by tiny deviations in the runtime of
a noop memcpy conversion.

As a nice side benefit, the compiled binary is now also ~10% smaller, and
the code ~50% smaller.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas 1e268fbedf swscale/ops_chain: add uop-based helpers to assemble SwsOpChain
This will eventually replace the existing op_match() and
ff_sws_op_compile_tables(), but I've decided to introduce it separately first
so that I can incrementally update the backends to use the new API, at the
cost of some temporary code duplication.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas adaf142647 swscale/uops: generate uop helper macros
This follows the same approach as is used currently by ops_entries_aarch64,
except I decided to have the generation logic live directly in uops.c
to allow re-using internal helpers and move it closer to the other helpers
that depend on the exact set of uops and their fields.

Unlike libswscale/tests/sws_ops.c, we make an effort to actually test all
relevant flag combinations, since these can affect the generated op lists.

I will use these macros to auto-generate both the C template-based kernels,
as well as the entire x86 backend, in the near future, hence their excessive
flexibility.

Re-use the libswscale/tests/sws_ops.c that we already compile. We could put it
in its own file but this is just as convenient, and it's easily moved anyways.
Having it be a FATE test ensures that it is always up-to-date.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 18:27:20 +02:00
Niklas Haas 8ad7cc6ccd swscale/tests/sws_ops: also print/test micro-op list
Tests for changes or regressions in the generated micro-ops. This will be
instrumental in my development of the micro-ops optimizer, and my plans to
phase out some of the macro-op optimization passes in favor of doing those
optimizations on the uop level instead.

 rgb24 16x16 -> rgb24 16x32:
   [ u8 +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
     min: {0 0 0 _}, max: {255 255 255 _}
   [ u8 ...X] SWS_OP_FILTER_V     : 16 -> 32 bilinear (2 taps)
     min: {0 0 0 _}, max: {255 255 255 _}
   [f32 ...X] SWS_OP_DITHER       : 16x16 matrix + {0 3 2 -1}
     min: {1/512 1/512 1/512 _}, max: {255.998047 255.998047 255.998047 _}
   [f32 ...X] SWS_OP_MIN          : x <= {255 255 255 _}
     min: {1/512 1/512 1/512 _}, max: {255 255 255 _}
   [f32 +++X] SWS_OP_CONVERT      : f32 -> u8
     min: {0 0 0 _}, max: {255 255 255 _}
   [ u8 XXXX] SWS_OP_WRITE        : 3 elem(s) packed >> 0
     (X = unused, z = byteswapped, + = exact, 0 = zero)
  Retrying with split passes:
   [ u8 +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
     min: {0 0 0 _}, max: {255 255 255 _}
   [ u8 XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
     (X = unused, z = byteswapped, + = exact, 0 = zero)
+ translated micro-ops:
+    u8_read_packed_xyz
+    u8_write_planar_xyz
  Sub-pass #1:
   [ u8 ...X] SWS_OP_READ         : 3 elem(s) planar >> 0 + 2 tap bilinear filter (V)
     min: {0 0 0 _}, max: {255 255 255 _}
   [f32 ...X] SWS_OP_DITHER       : 16x16 matrix + {0 3 2 -1}
     min: {1/512 1/512 1/512 _}, max: {255.998047 255.998047 255.998047 _}
   [f32 ...X] SWS_OP_MIN          : x <= {255 255 255 _}
     min: {1/512 1/512 1/512 _}, max: {255 255 255 _}
   [f32 +++X] SWS_OP_CONVERT      : f32 -> u8
     min: {0 0 0 _}, max: {255 255 255 _}
   [ u8 XXXX] SWS_OP_WRITE        : 3 elem(s) packed >> 0
     (X = unused, z = byteswapped, + = exact, 0 = zero)
+ translated micro-ops:
+    u8_read_planar_fv_xyz
+    f32_dither_xyz_0_3_2_16x16
+    f32_min_xyz
+    f32_to_u8_xyz
+    u8_write_packed_xyz
...

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 01:11:01 +02:00
Niklas Haas 3a7331d311 swscale/ops: remove unused function ff_sws_enum_ops()
Users can trivially recreate this logic anyways.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 01:10:57 +02:00
Niklas Haas 6b75166758 swscale/tests/sws_ops: minor cleanup / consistency
Clean up after the previous revert.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 01:10:54 +02:00
Niklas Haas dcfe3d3b90 Revert "swscale/tests/sws_ops: add option for summarizing all operation patterns"
This reverts commit f76aa4e408.

This is no longer needed once we switch to uops_macros.h, which will do the
same thing except better.
2026-06-09 01:10:49 +02:00
Niklas Haas aaf6a52fe6 swscale/uops: add uop translation logic
This will replace the fuzzy matching logic in op_match() that is used by the
C and x86 implementations, as well as the translation to AARCH64_OP_* that is
used by the NEON asmgen backend.

Down the line, this function will also take a set of flags to enable
backend-specific kernels like FMA variants, but I also decided to keep it
initially simple to ease the transition.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 01:10:39 +02:00
Niklas Haas dc88bcdf8c swscale/uops: add uop definitions
Taken from AARCH64_OP_*, but generalized/simplified a bit and updated to add
missing op types, especially for special cases that already have dedicated
implementations on x86.

This initial definition is kept intentionally simple and close to SwsOp, to
make it easier to port the existing ops backends to the new infrastructure.
However, in the future, this will be refactored dramatically - distinctions
like convert vs expand will cease to exist on the SwsOp level, and will
instead be introduced by separate optimization passes on the uops level.

SWS_UOP_LINEAR in particular will most likely be broken up into multiple
uops. I also took this opportunity to redefine the mask in a more useful way.

I decided to split up SWS_OP_CONVERT as well, because it was making x86
codegen unnecessarily difficult due to the strong interaction between exact
pixel sizes.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-09 01:09:34 +02:00
Niklas Haas ae6f3ce02c swscale/uops: split off from ops.h
Forming what will be the start of a larger helper file for backend-internal
translation of higher-level ops into lower level kernels. This header file
needs to be includable from independent source files, as it will be used to
provide definitions for build-time code generation (e.g. ops_asmgen.c), so
it must be self-contained.

Pulling in all of ops.h from uops.h would be too large dependency, since
ops.h pulls in graph.h, refstruct, bprint, etc. It's easier to start from a
fresh file that is documented as being usable at compile time.

For now, just declare the common types that will be needed by the uops layer.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-08 18:29:02 +02:00
Niklas Haas 48a42b5f21 configure: add -P to $CC_E flag
This suppresses the addition of #line directives in the preprocessed output,
which is what we want when we're invoking the hostcc just to preprocess some
files. (Currently, this variable is only used for configure-internal checks
anyways, but I want to use it to preprocess a NASM file)

On MSVC/Intel, /EP is the equivalent syntax, though we use -EP instead for
consistency.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-08 18:24:45 +02:00
haoyuLiuandmichaelni 6028720d70 avfilter/zmq: initialize send_buf before shared cleanup on parse failure
Found-by: VulnForge Security Research Team
Reported-by: Cloud-LHY <haoyuliu@clouditera.com>
2026-06-08 02:20:53 +00:00
Jun ZhaoandJun Zhao c75701a62f lavf/mov: read multi-valued metadata tags
When a metadata tag (e.g. ©ART) contains multiple values, either as
multiple 'data' child atoms within one tag or as multiple sibling tag
atoms with the same key, only the first value was read.

Fix by joining multiple values with semicolons using AV_DICT_APPEND,
consistent with Ogg Vorbis Comment handling in oggparsevorbis.c, and
reusing the existing 'goto retry' loop that covr already uses.
Also add the missing atom.size -= str_size to correctly track remaining
bytes in the tag atom, matching the covr path.

Limitation: on remux the joined string is written back as a single
value, same lossy behavior as Ogg Vorbis. Lossless round-trip would
require AV_DICT_MULTIKEY support throughout the metadata pipeline.

Fix #22367
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-06-08 02:18:32 +00:00
Michael Niedermayer 5622d515e8 tools/target_dec_fuzzer: reduce 4XM max pixels to avoid timeout
Fixes: Timeout
Fixes: 511356573/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_FOURXM_fuzzer-5010010110492672
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2026-06-08 01:21:12 +00:00
Jun ZhaoandJun Zhao cfa3ceac7a lavc/hevc: add aarch64 NEON for angular modes 10 and 26
Add NEON-optimized implementations for HEVC angular intra prediction
modes 10 (pure horizontal) and 26 (pure vertical) at 8-bit depth.

Mode 10 (Horizontal):
- Broadcasts left[y] to fill each row using ld2r/ld4r for efficiency
- Applies edge smoothing for luma blocks smaller than 32x32

Mode 26 (Vertical):
- Copies top reference row to all output rows
- Applies edge smoothing for luma blocks smaller than 32x32

Edge smoothing uses uhsub+usqadd to compute the filtered result
directly in 8-bit, avoiding widening to 16-bit intermediates.

The C pred_angular wrappers are made non-static with ff_ prefix to
allow the NEON dispatch to fall back to C for modes not yet optimized.
This will be reverted once all angular modes are implemented.

Note: since pred_angular[] is a per-size function pointer (not
per-mode), checkasm benchmarks will show '_neon' for all 33 modes
even though only modes 10/26 are truly accelerated; unoptimized
modes show ~1.0x speedup as they pass through the NEON wrapper to
the C fallback with negligible overhead.

Speedup over C on Apple M4 (checkasm --bench, 15-run average):

  Mode 10 (Horizontal):
    4x4: 4.66x    8x8: 5.80x    16x16: 16.86x    32x32: 24.89x

  Mode 26 (Vertical):
    4x4: 1.16x    8x8: 1.83x    16x16: 2.45x    32x32: 4.50x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-06-07 23:29:33 +00:00
Jun ZhaoandJun Zhao 3d71b9ec93 tests/checkasm: hevc_pred: use pixel helpers for diagnostic output
Replace plain memcmp+fail() with checkasm_check_pixel_padded() for
DC, planar, and angular prediction tests. Use PIXEL_RECT for output
buffers instead of flat arrays.

This enables:
- Detailed per-pixel difference output when run with 'checkasm -v'
- Detection of out-of-bounds writes beyond the NxN block area
- Padding violation reporting (writes past block boundary)

Previously, a test failure would only report "FAILED" with no
information about which pixels were wrong, making assembly debugging
difficult. Follows the pattern established in 4d4b301e4a (checkasm:
hevc_pel: Use helpers for checking for writes out of bounds).

Suggested-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-06-07 23:29:33 +00:00
Jun ZhaoandJames Almer 3ec0f14f7d avcodec/h264_ps: set default SAR, remove stale workaround
Set sps->vui.sar to {0,1} (unspecified) before the VUI parsing
block, matching the HEVC pattern at hevc_ps.c.  The old
zero-init-to-1 workaround is now unreachable and is removed.

Suggested-by: James Almer <jamrial@gmail.com>
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-06-07 18:55:16 +00:00
Jun ZhaoandJames Almer e598463b3d avcodec/h2645_vui: interpret a degenerate SAR as unspecified
Per ITU-T H.264 (ISO/IEC 14496-10) Annex E.2.1 and ITU-T H.265
(ISO/IEC 23008-2) Annex E.3.1, when sar_width or sar_height is zero
the sample aspect ratio shall be considered unspecified. Internally
ffmpeg represents an unspecified SAR as 0/1, while fractions with a
zero denominator are not handled properly (den=0 is silently changed
to den=1 in h264_ps.c, turning an invalid 20480/0 into a "valid" but
impossibly extreme 20480/1); so we bridge the gap by replacing x/0
with 0/1 at the VUI parsing layer.

An av_log warning is added so an invalid SAR in the bitstream is
diagnosed rather than silently overwritten.

This fixes a problem with some video files provided by game
OddBallers when executed with Wine/Proton, which report SAR 20480/0.

Based on patch by Giovanni Mascellani <gmascellani@codeweavers.com>.
Fixes: ticket #23321

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
2026-06-07 18:55:16 +00:00
Andreas Rheinhardt bb49197ede avcodec/liboapvenc: Remove dimension change check
If this were to be checked, it should be checked generically,
not in every single encoder.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-06-07 17:53:44 +02:00
Andreas Rheinhardt 0faa43ae6c avcodec/liboapvenc: Use av_image_copy2() to avoid cast
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-06-07 17:53:44 +02:00
Andreas Rheinhardt bf47563bd8 avcodec/liboapvenc: Remove always-false checks
Already checked in encode_preinit_video().

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-06-07 17:53:44 +02:00
Andreas Rheinhardt 80ea2d1487 avcodec/liboapvenc: Return directly when possible
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-06-07 17:53:44 +02:00
Andreas Rheinhardt 67855a7234 avcodec/liboapvenc: Use av_unreachable for unreachable default cases
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-06-07 17:53:44 +02:00
Andreas Rheinhardt 9791c4d183 avcodec/liboapvenc: Don't set AVCodec.pix_fmts directly
Instead use CODEC_PIXFMTS. Avoids deprecation warnings
from Clang and simplifies the removal of AVCodec.pix_fmts.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-06-07 17:53:44 +02:00
James Almer d1faab734d avcodec/dcadec: map Lw/Rw to FLC/FRC
Some 7.1 DTS files seem to signal Lw/Rw channels that the decoder has been
mapping to SL/SR, despite the macro for the mask being called 7_1_WIDE.
This resulted in said samples reporting the same native layout as actual 7.1
samples with Lsr/Rsr/Lss/Rss (mapped to BL/BR/SL/SR).

If we were to be strict, Lw/Rw would map to WR/WL, but that would result in an
unusual native layout. Instead, lets map them to FLC/FRC, which will result in
the more common 7.1(wide) native layout.

Signed-off-by: James Almer <jamrial@gmail.com>
2026-06-07 10:24:42 -03:00
Niklas HaasandNiklas Haas 3137d337fe tests/checkasm/sw_ops: use new checkasm_set_func_variant()
The current approach of re-testing the C reference for every backend
separately leads to both confusing output (e.g. having an extra redundant
`memcpy_c` line for every op, even those not implemented by the memcpy
backend), as well as a lot of unnecessary wasted time re-testing and
re-benching the same C variant for every backend.

This new API function lets us test the C function only a single time, while
simultaneously having all of the other backends implicitly compare themselves
against the C reference.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-06-07 09:24:23 +00:00
Michael Niedermayer 04e2341056 avcodec/adpcm: fix signed integer overflow in get_nb_samples()
Fixes: signed integer overflow: 314572800 * 8 cannot be represented in type 'int'

Tighten the guard to INT_MAX/14, which covers the largest expansion
factor used in the function currently.

Found-by: Jiale Yao <19888972804@163.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2026-06-07 02:57:25 +00:00
Michael Niedermayer 0a8d961388 avformat/matroskadec: avoid signed overflow in DASH cue time differences
Fixes: 493466409/clusterfuzz-testcase-minimized-ffmpeg_dem_WEBM_DASH_MANIFEST_fuzzer-6150181551931392
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2026-06-07 02:56:44 +00:00
Ramiro Polla f8af47b640 build: remove checkasm_header_config_generated.h on distclean
Forgotten in 4569ab7eaa.
2026-06-06 23:58:13 +02:00
David Korczynskiandmichaelni 1e9984772b avcodec/fastaudio: reject subframes count whose * 256 product overflows 32-bit
fastaudio_decode() computes
    subframes = pkt->size / (40 * channels);
    frame->nb_samples = subframes * 256;
both as 32-bit signed multiplications. When pkt->size is large enough
to make subframes >= 2^24, the second multiplication overflows the
signed int range and frame->nb_samples wraps to a small value.
ff_get_buffer() then sizes the audio plane for that wrapped sample
count, while the decoder loop at line 152 still iterates the full
(unwrapped) subframes count, performing a 1024-byte memcpy per
subframe per channel. The 27th iteration (or first iteration with
nb_samples=0) writes one byte past the per-plane allocation,
yielding the ASan heap-buffer-overflow WRITE at libavcodec/fastaudio
.c:171 reported as ANT-2026-03891.

Reject the subframes value whose *256 product would overflow before
performing the multiplication. The bound INT_MAX / 256 (= 8388607)
keeps the existing two's-complement semantics of every reachable
input and rejects only the configurations that would have wrapped.

Reproducer: a crafted AVI declaring one mono audio chunk of
671_088_680 bytes (sparse) with the decoder forced via
'ffmpeg -c:a fastaudio -i evil.avi'.

Found-by: Anthropic agents; validated and reported by Ada Logics.

Signed-off-by: David Korczynski <david@adalogics.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2026-06-06 21:32:08 +00:00
Michael Niedermayer 7c7ca349bc avcodec/vc2enc_dwt: avoid signed overflow in the 5/3 and Haar DWT
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2026-06-06 19:07:39 +00:00
Michael Niedermayer 5f91556215 avcodec/vc2enc_dwt: avoid signed overflow in the 9/7 DWT lifting
Fixes: 490488944/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_VC2_fuzzer-5310290362433536
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2026-06-06 19:07:39 +00:00
DROOdotFOOandRamiro Polla d9e2239f3c swscale/aarch64/yuv2rgb_neon: 2 lines at a time, yuva420p
Alpha is full resolution, so each row loads its own 16 alpha bytes
via process_row's \rsrcA arg.

Test Name                                A55-gcc            M1-clang             A76-gcc
----------------------------------------------------------------------------------------
yuva420p_to_argb_neon            22607.6 (1.16x)        39.2 (1.24x)     13631.6 (1.12x)
yuva420p_to_rgba_neon            22608.2 (1.16x)        38.3 (1.21x)     13912.8 (1.12x)
yuva420p_to_abgr_neon            23074.6 (1.16x)        38.8 (1.22x)     14492.1 (1.08x)
yuva420p_to_bgra_neon            23079.7 (1.16x)        39.9 (1.19x)     14472.6 (1.08x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla 4b6f7c2a05 swscale/aarch64/yuv2rgb_neon: 2 lines at a time, rgb16
pack_rgb16_2l uses v26-v29 as scratch (luma temps, dead by then)
instead of v20-v23, so v20-v25 chroma survives the pack step. A
.error trips if yuva420p hits rgb16 (v28/v29 would clobber alpha);
the dispatcher routes that combination through yuv420p anyway.

Test Name                                A55-gcc            M1-clang             A76-gcc
----------------------------------------------------------------------------------------
nv12_to_rgb565le_neon            28531.9 (1.12x)        46.8 (1.28x)     19252.9 (1.09x)
nv12_to_bgr565le_neon            29018.1 (1.12x)        48.1 (1.17x)     19252.0 (1.09x)
nv12_to_rgb555le_neon            28531.3 (1.12x)        47.2 (1.24x)     19253.6 (1.09x)
nv12_to_bgr555le_neon            29012.1 (1.12x)        45.8 (1.22x)     19252.5 (1.09x)
nv21_to_rgb565le_neon            28532.3 (1.12x)        48.4 (1.15x)     19430.0 (1.09x)
nv21_to_bgr565le_neon            29013.8 (1.12x)        47.2 (1.21x)     19428.8 (1.09x)
nv21_to_rgb555le_neon            28533.3 (1.12x)        49.7 (1.16x)     19430.5 (1.09x)
nv21_to_bgr555le_neon            29011.4 (1.12x)        48.5 (1.18x)     19428.7 (1.09x)
yuv420p_to_rgb565le_neon         28351.9 (1.11x)        46.4 (1.18x)     19635.3 (1.08x)
yuv420p_to_bgr565le_neon         28831.8 (1.11x)        50.8 (1.09x)     19634.5 (1.08x)
yuv420p_to_rgb555le_neon         28351.3 (1.11x)        46.3 (1.23x)     19634.2 (1.08x)
yuv420p_to_bgr555le_neon         28829.1 (1.11x)        46.5 (1.21x)     19634.3 (1.08x)
yuva420p_to_rgb565le_neon        28349.5 (1.11x)        51.2 (1.06x)     19634.7 (1.08x)
yuva420p_to_bgr565le_neon        28833.1 (1.11x)        48.6 (1.17x)     19633.9 (1.08x)
yuva420p_to_rgb555le_neon        28351.6 (1.11x)        47.8 (1.16x)     19635.2 (1.08x)
yuva420p_to_bgr555le_neon        28831.5 (1.11x)        46.4 (1.14x)     19634.8 (1.08x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla dad212060c swscale/aarch64/yuv2rgb_neon: 2 lines at a time, gbrp
Six dst pointers exhaust the caller-saved registers; spill x19/x20.
yuva420p_to_gbrp_neon is routed through the yuv420p path by the
dispatcher (gbrp has no alpha channel).

Test Name                                A55-gcc            M1-clang             A76-gcc
----------------------------------------------------------------------------------------
nv12_to_gbrp_neon                20017.8 (1.15x)        32.8 (1.34x)     10658.0 (1.27x)
nv21_to_gbrp_neon                20020.9 (1.15x)        32.5 (1.36x)     10691.1 (1.26x)
yuv420p_to_gbrp_neon             19856.3 (1.14x)        31.4 (1.34x)     10348.0 (1.37x)
yuva420p_to_gbrp_neon            19859.8 (1.14x)        30.9 (1.27x)     10350.9 (1.37x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla 4bfe7efd0c swscale/aarch64/yuv2rgb_neon: 2 lines at a time, packed RGB
Vertically-subsampled inputs (nv12, nv21, yuv420p) share a chroma
row across two output rows; compute the chroma -> RGB offsets once
and apply to both luma rows. Covers argb/rgba/abgr/bgra/rgb24/bgr24.

Test Name                                A55-gcc            M1-clang             A76-gcc
----------------------------------------------------------------------------------------
nv12_to_argb_neon                21647.2 (1.16x)        40.1 (1.24x)     13813.3 (1.16x)
nv12_to_rgba_neon                21653.7 (1.16x)        40.8 (1.32x)     14105.0 (1.13x)
nv12_to_abgr_neon                22122.2 (1.15x)        40.3 (1.27x)     14100.2 (1.16x)
nv12_to_bgra_neon                22121.6 (1.15x)        39.6 (1.24x)     14125.9 (1.16x)
nv12_to_rgb24_neon               19842.0 (1.18x)        33.4 (1.28x)     12868.9 (1.17x)
nv12_to_bgr24_neon               20318.0 (1.18x)        34.6 (1.23x)     12868.8 (1.17x)
nv21_to_argb_neon                21648.5 (1.16x)        41.0 (1.29x)     13978.5 (1.14x)
nv21_to_rgba_neon                21653.0 (1.16x)        41.3 (1.21x)     14173.5 (1.11x)
nv21_to_abgr_neon                22120.6 (1.15x)        41.1 (1.20x)     14505.4 (1.14x)
nv21_to_bgra_neon                22120.8 (1.15x)        41.0 (1.22x)     14520.1 (1.14x)
nv21_to_rgb24_neon               19830.5 (1.19x)        35.1 (1.28x)     12832.4 (1.17x)
nv21_to_bgr24_neon               20317.1 (1.18x)        34.6 (1.27x)     12833.1 (1.17x)
yuv420p_to_argb_neon             21450.2 (1.15x)        39.2 (1.19x)     14118.3 (1.12x)
yuv420p_to_rgba_neon             21447.2 (1.15x)        38.8 (1.24x)     14326.0 (1.14x)
yuv420p_to_abgr_neon             21927.0 (1.15x)        38.9 (1.25x)     14826.6 (1.13x)
yuv420p_to_bgra_neon             21930.8 (1.15x)        41.4 (1.18x)     14822.9 (1.13x)
yuv420p_to_rgb24_neon            19365.5 (1.17x)        33.5 (1.25x)     13291.8 (1.16x)
yuv420p_to_bgr24_neon            19848.8 (1.16x)        34.1 (1.35x)     13292.8 (1.16x)

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla 11b1721b11 swscale/aarch64/yuv2rgb_neon: reorder params, unify signature
Pass src[]/srcStride[] as arrays (x5/x6), move y_offset/y_coeff into
register args (w2/w3). Only int-after-pointer stack args remain, so
Apple and AAPCS64 lay them out identically; every __APPLE__ is gone.
nv12/nv21/yuv420p/yuv422p/yuva420p share one signature.

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla 8dbc729950 swscale/aarch64/yuv2rgb_neon: name registers
the loop body. .text byte-identical.

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla e0fa641240 swscale/aarch64/yuv2rgb_neon: chroma-preserve compute_rgb
Macro writes per-luma sums into the destination registers, leaving
v20-v25 (chroma -> RGB offsets) intact for the 2-line callers. Takes
bare register names. compute_rgba and compute_rgba_alpha follow suit.

Single-row callers reload v20-v25 each iteration via
chroma_to_rgb_offsets, so the change is a no-op for them: Apple M1
width=1920 mean -0.54% across 55 paths, within bench noise.

Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
2026-06-06 19:38:40 +02:00
Michael Niedermayer b99c6fc8c3 avformat/dashdec: Fail with any inner stream count being 0
Fixes: ada-3-poc.mpd

Found-by: Claude and Ada Logics. This issue was found by Anthropic from using agents to study security of open source projects, and I am from Ada Logics helping validate the found issues and report to maintainers.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2026-06-06 17:29:17 +00:00
Ramiro Polla 025d6330a5 swscale/aarch64/ops_asmgen: fold impl pointer increment into loading of continuation address
This commit reduces every kernel by one instruction, for example:
 function ff_sws_clear_8_u16_0001_neon, export=1, jumpable=1
-        ldr             x0, [x1]                        // SwsFuncPtr cont = impl->cont;
         ldr             q16, [x1, #16]                  // v128 clear_vec = impl->priv.v128;
+        ldr             x0, [x1], #32                   // SwsFuncPtr cont = (impl++)->cont;
         dup             v0.8h, v16.h[0]                 // vl[0] = broadcast(clear_vec[0])
-        add             x1, x1, #32                     // impl += 1;
         br              x0                              // jump to cont
 endfunc

A55: Overall speedup=1.066x faster, min=0.881x max=1.288x
A76: Overall speedup=1.012x faster, min=0.570x max=1.546x

The large min/max differences are due to pathological branch miss cases
that happen either before of after this commit.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
2026-06-06 11:54:14 +00:00