ffmpeg

x/ffmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2026-06-11 08:13:06 +00:00

Author	SHA1	Message	Date
Niklas Haas	02a168a576	swscale/uops: keep track of input range during op translation Needed for the FMA decision logic. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	3f9219d605	swscale/uops: add SwsUOpFlags to ff_sws_ops_translate() These will be used to e.g. enable extra uops during translation. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	b7a80a9f0d	swscale/ops_backend: delete ops-based C backend And make uops_backend.c the new reference. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	100ce4ac41	tests/checkasm/sw_ops: rewrite using uops_macros.h This ensures 100% coverage of all uop primitives by generating the set of tests exactly from the list of seen primitives, using the uops macros. There are some annoying quirks still because of the fact that we have to essentially "untranslate" the UOPs back to SwsOps that result back in the intended uop after the translation, but overall it's not too bad and still much better than the status quo of hand-rolling the list of test cases. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	636b9eda75	swscale/ops_tmpl_float: allow arbitrary values for 1x1 dither Removes the 1x1 dither fast path, mirroring the previous commit. This is not really needed nor useful but it will make the transition to the uops architecture slightly easier, as 1x1 dither gets reinterpreted as SWS_UOP_ADD there. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	ca8774b9d6	swscale/x86: remove broken and unnecessary 1x1 dither fast path This is broken because it fails to check dither.y_offset[] to determine if dithering for a channel is requested or not. This is unnecessary because the generic dither code already jumps over unused components, which is cheap enough not to worry about this special case for now. This code will, in any case, soon be replaced by a uops_macros.h-derived approach. This commit is only needed as a stopgap to make checkasm continue working after the sws_uops refactor. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	19652a83a2	swscale/x86/ops_include: use %assign instead of %xdefine For numeric 1/0 constants. As an aside, fix the broken comment. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	b328e152a4	swscale/x86: move entry points to ops_common.asm As well as the packed shuffle solver. These don't really interact with the rest of the code in ops_int.asm, which is, by name at least, intended for integer op kernels. More importantly, these functions will be shared with the uops rewrite. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	c5c9c6d996	swscale/x86: rename ops_common.asm to ops_include.asm Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	8118e964bb	swscale/uops: auto-generate reference C backend from uops_macros.h Instead of choosing by hand which kernels to implement, this rewrite focuses on leveraging the power of uops_macros.h to auto-generate all needed kernels. This not only simplifies maintenance, but also improves performance. I have decided to develop the replacement backend as a separate file, under a separate prefix, for the explicit purpose of being able to verify the correctness of the rewrite using the current backend as a checkasm reference. The code for the kernels themselves has been largely copied from the old C backend, modified slightly to conform to the uop template style. This does result in some code duplication, but a following commit will clean it up. I nonetheless want to preserve this commit for bisection purposes, to ensure we have one commit that contains both backends side-by-side. Overall speedup=1.182x faster, min=0.197x max=3.450x The big slowdowns are flukes caused by tiny deviations in the runtime of a noop memcpy conversion. As a nice side benefit, the compiled binary is now also ~10% smaller, and the code ~50% smaller. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	1e268fbedf	swscale/ops_chain: add uop-based helpers to assemble SwsOpChain This will eventually replace the existing op_match() and ff_sws_op_compile_tables(), but I've decided to introduce it separately first so that I can incrementally update the backends to use the new API, at the cost of some temporary code duplication. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	adaf142647	swscale/uops: generate uop helper macros This follows the same approach as is used currently by ops_entries_aarch64, except I decided to have the generation logic live directly in uops.c to allow re-using internal helpers and move it closer to the other helpers that depend on the exact set of uops and their fields. Unlike libswscale/tests/sws_ops.c, we make an effort to actually test all relevant flag combinations, since these can affect the generated op lists. I will use these macros to auto-generate both the C template-based kernels, as well as the entire x86 backend, in the near future, hence their excessive flexibility. Re-use the libswscale/tests/sws_ops.c that we already compile. We could put it in its own file but this is just as convenient, and it's easily moved anyways. Having it be a FATE test ensures that it is always up-to-date. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 18:27:20 +02:00
Niklas Haas	8ad7cc6ccd	swscale/tests/sws_ops: also print/test micro-op list Tests for changes or regressions in the generated micro-ops. This will be instrumental in my development of the micro-ops optimizer, and my plans to phase out some of the macro-op optimization passes in favor of doing those optimizations on the uop level instead. rgb24 16x16 -> rgb24 16x32: [ u8 +++X] SWS_OP_READ : 3 elem(s) packed >> 0 min: {0 0 0 _}, max: {255 255 255 _} [ u8 ...X] SWS_OP_FILTER_V : 16 -> 32 bilinear (2 taps) min: {0 0 0 _}, max: {255 255 255 _} [f32 ...X] SWS_OP_DITHER : 16x16 matrix + {0 3 2 -1} min: {1/512 1/512 1/512 _}, max: {255.998047 255.998047 255.998047 _} [f32 ...X] SWS_OP_MIN : x <= {255 255 255 _} min: {1/512 1/512 1/512 _}, max: {255 255 255 _} [f32 +++X] SWS_OP_CONVERT : f32 -> u8 min: {0 0 0 _}, max: {255 255 255 _} [ u8 XXXX] SWS_OP_WRITE : 3 elem(s) packed >> 0 (X = unused, z = byteswapped, + = exact, 0 = zero) Retrying with split passes: [ u8 +++X] SWS_OP_READ : 3 elem(s) packed >> 0 min: {0 0 0 _}, max: {255 255 255 _} [ u8 XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0 (X = unused, z = byteswapped, + = exact, 0 = zero) + translated micro-ops: + u8_read_packed_xyz + u8_write_planar_xyz Sub-pass #1: [ u8 ...X] SWS_OP_READ : 3 elem(s) planar >> 0 + 2 tap bilinear filter (V) min: {0 0 0 _}, max: {255 255 255 _} [f32 ...X] SWS_OP_DITHER : 16x16 matrix + {0 3 2 -1} min: {1/512 1/512 1/512 _}, max: {255.998047 255.998047 255.998047 _} [f32 ...X] SWS_OP_MIN : x <= {255 255 255 _} min: {1/512 1/512 1/512 _}, max: {255 255 255 _} [f32 +++X] SWS_OP_CONVERT : f32 -> u8 min: {0 0 0 _}, max: {255 255 255 _} [ u8 XXXX] SWS_OP_WRITE : 3 elem(s) packed >> 0 (X = unused, z = byteswapped, + = exact, 0 = zero) + translated micro-ops: + u8_read_planar_fv_xyz + f32_dither_xyz_0_3_2_16x16 + f32_min_xyz + f32_to_u8_xyz + u8_write_packed_xyz ... Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 01:11:01 +02:00
Niklas Haas	3a7331d311	swscale/ops: remove unused function ff_sws_enum_ops() Users can trivially recreate this logic anyways. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 01:10:57 +02:00
Niklas Haas	6b75166758	swscale/tests/sws_ops: minor cleanup / consistency Clean up after the previous revert. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 01:10:54 +02:00
Niklas Haas	dcfe3d3b90	Revert "swscale/tests/sws_ops: add option for summarizing all operation patterns" This reverts commit `f76aa4e408`. This is no longer needed once we switch to uops_macros.h, which will do the same thing except better.	2026-06-09 01:10:49 +02:00
Niklas Haas	aaf6a52fe6	swscale/uops: add uop translation logic This will replace the fuzzy matching logic in op_match() that is used by the C and x86 implementations, as well as the translation to AARCH64_OP_* that is used by the NEON asmgen backend. Down the line, this function will also take a set of flags to enable backend-specific kernels like FMA variants, but I also decided to keep it initially simple to ease the transition. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 01:10:39 +02:00
Niklas Haas	dc88bcdf8c	swscale/uops: add uop definitions Taken from AARCH64_OP_*, but generalized/simplified a bit and updated to add missing op types, especially for special cases that already have dedicated implementations on x86. This initial definition is kept intentionally simple and close to SwsOp, to make it easier to port the existing ops backends to the new infrastructure. However, in the future, this will be refactored dramatically - distinctions like convert vs expand will cease to exist on the SwsOp level, and will instead be introduced by separate optimization passes on the uops level. SWS_UOP_LINEAR in particular will most likely be broken up into multiple uops. I also took this opportunity to redefine the mask in a more useful way. I decided to split up SWS_OP_CONVERT as well, because it was making x86 codegen unnecessarily difficult due to the strong interaction between exact pixel sizes. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-09 01:09:34 +02:00
Niklas Haas	ae6f3ce02c	swscale/uops: split off from ops.h Forming what will be the start of a larger helper file for backend-internal translation of higher-level ops into lower level kernels. This header file needs to be includable from independent source files, as it will be used to provide definitions for build-time code generation (e.g. ops_asmgen.c), so it must be self-contained. Pulling in all of ops.h from uops.h would be too large dependency, since ops.h pulls in graph.h, refstruct, bprint, etc. It's easier to start from a fresh file that is documented as being usable at compile time. For now, just declare the common types that will be needed by the uops layer. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-08 18:29:02 +02:00
Niklas Haas	48a42b5f21	configure: add -P to $CC_E flag This suppresses the addition of #line directives in the preprocessed output, which is what we want when we're invoking the hostcc just to preprocess some files. (Currently, this variable is only used for configure-internal checks anyways, but I want to use it to preprocess a NASM file) On MSVC/Intel, /EP is the equivalent syntax, though we use -EP instead for consistency. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-08 18:24:45 +02:00
haoyuLiuandmichaelni	6028720d70	avfilter/zmq: initialize send_buf before shared cleanup on parse failure Found-by: VulnForge Security Research Team Reported-by: Cloud-LHY <haoyuliu@clouditera.com>	2026-06-08 02:20:53 +00:00
Jun ZhaoandJun Zhao	c75701a62f	lavf/mov: read multi-valued metadata tags When a metadata tag (e.g. ©ART) contains multiple values, either as multiple 'data' child atoms within one tag or as multiple sibling tag atoms with the same key, only the first value was read. Fix by joining multiple values with semicolons using AV_DICT_APPEND, consistent with Ogg Vorbis Comment handling in oggparsevorbis.c, and reusing the existing 'goto retry' loop that covr already uses. Also add the missing atom.size -= str_size to correctly track remaining bytes in the tag atom, matching the covr path. Limitation: on remux the joined string is written back as a single value, same lossy behavior as Ogg Vorbis. Lossless round-trip would require AV_DICT_MULTIKEY support throughout the metadata pipeline. Fix #22367 Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-06-08 02:18:32 +00:00
Michael Niedermayer	5622d515e8	tools/target_dec_fuzzer: reduce 4XM max pixels to avoid timeout Fixes: Timeout Fixes: 511356573/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_FOURXM_fuzzer-5010010110492672 Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2026-06-08 01:21:12 +00:00
Jun ZhaoandJun Zhao	cfa3ceac7a	lavc/hevc: add aarch64 NEON for angular modes 10 and 26 Add NEON-optimized implementations for HEVC angular intra prediction modes 10 (pure horizontal) and 26 (pure vertical) at 8-bit depth. Mode 10 (Horizontal): - Broadcasts left[y] to fill each row using ld2r/ld4r for efficiency - Applies edge smoothing for luma blocks smaller than 32x32 Mode 26 (Vertical): - Copies top reference row to all output rows - Applies edge smoothing for luma blocks smaller than 32x32 Edge smoothing uses uhsub+usqadd to compute the filtered result directly in 8-bit, avoiding widening to 16-bit intermediates. The C pred_angular wrappers are made non-static with ff_ prefix to allow the NEON dispatch to fall back to C for modes not yet optimized. This will be reverted once all angular modes are implemented. Note: since pred_angular[] is a per-size function pointer (not per-mode), checkasm benchmarks will show '_neon' for all 33 modes even though only modes 10/26 are truly accelerated; unoptimized modes show ~1.0x speedup as they pass through the NEON wrapper to the C fallback with negligible overhead. Speedup over C on Apple M4 (checkasm --bench, 15-run average): Mode 10 (Horizontal): 4x4: 4.66x 8x8: 5.80x 16x16: 16.86x 32x32: 24.89x Mode 26 (Vertical): 4x4: 1.16x 8x8: 1.83x 16x16: 2.45x 32x32: 4.50x Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-06-07 23:29:33 +00:00
Jun ZhaoandJun Zhao	3d71b9ec93	tests/checkasm: hevc_pred: use pixel helpers for diagnostic output Replace plain memcmp+fail() with checkasm_check_pixel_padded() for DC, planar, and angular prediction tests. Use PIXEL_RECT for output buffers instead of flat arrays. This enables: - Detailed per-pixel difference output when run with 'checkasm -v' - Detection of out-of-bounds writes beyond the NxN block area - Padding violation reporting (writes past block boundary) Previously, a test failure would only report "FAILED" with no information about which pixels were wrong, making assembly debugging difficult. Follows the pattern established in `4d4b301e4a` (checkasm: hevc_pel: Use helpers for checking for writes out of bounds). Suggested-by: Martin Storsjö <martin@martin.st> Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-06-07 23:29:33 +00:00
Jun ZhaoandJames Almer	3ec0f14f7d	avcodec/h264_ps: set default SAR, remove stale workaround Set sps->vui.sar to {0,1} (unspecified) before the VUI parsing block, matching the HEVC pattern at hevc_ps.c. The old zero-init-to-1 workaround is now unreachable and is removed. Suggested-by: James Almer <jamrial@gmail.com> Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-06-07 18:55:16 +00:00
Jun ZhaoandJames Almer	e598463b3d	avcodec/h2645_vui: interpret a degenerate SAR as unspecified Per ITU-T H.264 (ISO/IEC 14496-10) Annex E.2.1 and ITU-T H.265 (ISO/IEC 23008-2) Annex E.3.1, when sar_width or sar_height is zero the sample aspect ratio shall be considered unspecified. Internally ffmpeg represents an unspecified SAR as 0/1, while fractions with a zero denominator are not handled properly (den=0 is silently changed to den=1 in h264_ps.c, turning an invalid 20480/0 into a "valid" but impossibly extreme 20480/1); so we bridge the gap by replacing x/0 with 0/1 at the VUI parsing layer. An av_log warning is added so an invalid SAR in the bitstream is diagnosed rather than silently overwritten. This fixes a problem with some video files provided by game OddBallers when executed with Wine/Proton, which report SAR 20480/0. Based on patch by Giovanni Mascellani <gmascellani@codeweavers.com>. Fixes: ticket #23321 Signed-off-by: Jun Zhao <barryjzhao@tencent.com>	2026-06-07 18:55:16 +00:00
Andreas Rheinhardt	bb49197ede	avcodec/liboapvenc: Remove dimension change check If this were to be checked, it should be checked generically, not in every single encoder. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-06-07 17:53:44 +02:00
Andreas Rheinhardt	0faa43ae6c	avcodec/liboapvenc: Use av_image_copy2() to avoid cast Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-06-07 17:53:44 +02:00
Andreas Rheinhardt	bf47563bd8	avcodec/liboapvenc: Remove always-false checks Already checked in encode_preinit_video(). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-06-07 17:53:44 +02:00
Andreas Rheinhardt	80ea2d1487	avcodec/liboapvenc: Return directly when possible Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-06-07 17:53:44 +02:00
Andreas Rheinhardt	67855a7234	avcodec/liboapvenc: Use av_unreachable for unreachable default cases Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-06-07 17:53:44 +02:00
Andreas Rheinhardt	9791c4d183	avcodec/liboapvenc: Don't set AVCodec.pix_fmts directly Instead use CODEC_PIXFMTS. Avoids deprecation warnings from Clang and simplifies the removal of AVCodec.pix_fmts. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-06-07 17:53:44 +02:00
James Almer	d1faab734d	avcodec/dcadec: map Lw/Rw to FLC/FRC Some 7.1 DTS files seem to signal Lw/Rw channels that the decoder has been mapping to SL/SR, despite the macro for the mask being called 7_1_WIDE. This resulted in said samples reporting the same native layout as actual 7.1 samples with Lsr/Rsr/Lss/Rss (mapped to BL/BR/SL/SR). If we were to be strict, Lw/Rw would map to WR/WL, but that would result in an unusual native layout. Instead, lets map them to FLC/FRC, which will result in the more common 7.1(wide) native layout. Signed-off-by: James Almer <jamrial@gmail.com>	2026-06-07 10:24:42 -03:00
Niklas HaasandNiklas Haas	3137d337fe	tests/checkasm/sw_ops: use new checkasm_set_func_variant() The current approach of re-testing the C reference for every backend separately leads to both confusing output (e.g. having an extra redundant `memcpy_c` line for every op, even those not implemented by the memcpy backend), as well as a lot of unnecessary wasted time re-testing and re-benching the same C variant for every backend. This new API function lets us test the C function only a single time, while simultaneously having all of the other backends implicitly compare themselves against the C reference. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-06-07 09:24:23 +00:00
Michael Niedermayer	04e2341056	avcodec/adpcm: fix signed integer overflow in get_nb_samples() Fixes: signed integer overflow: 314572800 * 8 cannot be represented in type 'int' Tighten the guard to INT_MAX/14, which covers the largest expansion factor used in the function currently. Found-by: Jiale Yao <19888972804@163.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2026-06-07 02:57:25 +00:00
Michael Niedermayer	0a8d961388	avformat/matroskadec: avoid signed overflow in DASH cue time differences Fixes: 493466409/clusterfuzz-testcase-minimized-ffmpeg_dem_WEBM_DASH_MANIFEST_fuzzer-6150181551931392 Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2026-06-07 02:56:44 +00:00
Ramiro Polla	f8af47b640	build: remove checkasm_header_config_generated.h on distclean Forgotten in `4569ab7eaa`.	2026-06-06 23:58:13 +02:00
David Korczynskiandmichaelni	1e9984772b	avcodec/fastaudio: reject subframes count whose * 256 product overflows 32-bit fastaudio_decode() computes subframes = pkt->size / (40 * channels); frame->nb_samples = subframes * 256; both as 32-bit signed multiplications. When pkt->size is large enough to make subframes >= 2^24, the second multiplication overflows the signed int range and frame->nb_samples wraps to a small value. ff_get_buffer() then sizes the audio plane for that wrapped sample count, while the decoder loop at line 152 still iterates the full (unwrapped) subframes count, performing a 1024-byte memcpy per subframe per channel. The 27th iteration (or first iteration with nb_samples=0) writes one byte past the per-plane allocation, yielding the ASan heap-buffer-overflow WRITE at libavcodec/fastaudio .c:171 reported as ANT-2026-03891. Reject the subframes value whose *256 product would overflow before performing the multiplication. The bound INT_MAX / 256 (= 8388607) keeps the existing two's-complement semantics of every reachable input and rejects only the configurations that would have wrapped. Reproducer: a crafted AVI declaring one mono audio chunk of 671_088_680 bytes (sparse) with the decoder forced via 'ffmpeg -c:a fastaudio -i evil.avi'. Found-by: Anthropic agents; validated and reported by Ada Logics. Signed-off-by: David Korczynski <david@adalogics.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2026-06-06 21:32:08 +00:00
Michael Niedermayer	7c7ca349bc	avcodec/vc2enc_dwt: avoid signed overflow in the 5/3 and Haar DWT Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2026-06-06 19:07:39 +00:00
Michael Niedermayer	5f91556215	avcodec/vc2enc_dwt: avoid signed overflow in the 9/7 DWT lifting Fixes: 490488944/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_VC2_fuzzer-5310290362433536 Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2026-06-06 19:07:39 +00:00
DROOdotFOOandRamiro Polla	d9e2239f3c	swscale/aarch64/yuv2rgb_neon: 2 lines at a time, yuva420p Alpha is full resolution, so each row loads its own 16 alpha bytes via process_row's \rsrcA arg. Test Name A55-gcc M1-clang A76-gcc ---------------------------------------------------------------------------------------- yuva420p_to_argb_neon 22607.6 (1.16x) 39.2 (1.24x) 13631.6 (1.12x) yuva420p_to_rgba_neon 22608.2 (1.16x) 38.3 (1.21x) 13912.8 (1.12x) yuva420p_to_abgr_neon 23074.6 (1.16x) 38.8 (1.22x) 14492.1 (1.08x) yuva420p_to_bgra_neon 23079.7 (1.16x) 39.9 (1.19x) 14472.6 (1.08x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	4b6f7c2a05	swscale/aarch64/yuv2rgb_neon: 2 lines at a time, rgb16 pack_rgb16_2l uses v26-v29 as scratch (luma temps, dead by then) instead of v20-v23, so v20-v25 chroma survives the pack step. A .error trips if yuva420p hits rgb16 (v28/v29 would clobber alpha); the dispatcher routes that combination through yuv420p anyway. Test Name A55-gcc M1-clang A76-gcc ---------------------------------------------------------------------------------------- nv12_to_rgb565le_neon 28531.9 (1.12x) 46.8 (1.28x) 19252.9 (1.09x) nv12_to_bgr565le_neon 29018.1 (1.12x) 48.1 (1.17x) 19252.0 (1.09x) nv12_to_rgb555le_neon 28531.3 (1.12x) 47.2 (1.24x) 19253.6 (1.09x) nv12_to_bgr555le_neon 29012.1 (1.12x) 45.8 (1.22x) 19252.5 (1.09x) nv21_to_rgb565le_neon 28532.3 (1.12x) 48.4 (1.15x) 19430.0 (1.09x) nv21_to_bgr565le_neon 29013.8 (1.12x) 47.2 (1.21x) 19428.8 (1.09x) nv21_to_rgb555le_neon 28533.3 (1.12x) 49.7 (1.16x) 19430.5 (1.09x) nv21_to_bgr555le_neon 29011.4 (1.12x) 48.5 (1.18x) 19428.7 (1.09x) yuv420p_to_rgb565le_neon 28351.9 (1.11x) 46.4 (1.18x) 19635.3 (1.08x) yuv420p_to_bgr565le_neon 28831.8 (1.11x) 50.8 (1.09x) 19634.5 (1.08x) yuv420p_to_rgb555le_neon 28351.3 (1.11x) 46.3 (1.23x) 19634.2 (1.08x) yuv420p_to_bgr555le_neon 28829.1 (1.11x) 46.5 (1.21x) 19634.3 (1.08x) yuva420p_to_rgb565le_neon 28349.5 (1.11x) 51.2 (1.06x) 19634.7 (1.08x) yuva420p_to_bgr565le_neon 28833.1 (1.11x) 48.6 (1.17x) 19633.9 (1.08x) yuva420p_to_rgb555le_neon 28351.6 (1.11x) 47.8 (1.16x) 19635.2 (1.08x) yuva420p_to_bgr555le_neon 28831.5 (1.11x) 46.4 (1.14x) 19634.8 (1.08x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	dad212060c	swscale/aarch64/yuv2rgb_neon: 2 lines at a time, gbrp Six dst pointers exhaust the caller-saved registers; spill x19/x20. yuva420p_to_gbrp_neon is routed through the yuv420p path by the dispatcher (gbrp has no alpha channel). Test Name A55-gcc M1-clang A76-gcc ---------------------------------------------------------------------------------------- nv12_to_gbrp_neon 20017.8 (1.15x) 32.8 (1.34x) 10658.0 (1.27x) nv21_to_gbrp_neon 20020.9 (1.15x) 32.5 (1.36x) 10691.1 (1.26x) yuv420p_to_gbrp_neon 19856.3 (1.14x) 31.4 (1.34x) 10348.0 (1.37x) yuva420p_to_gbrp_neon 19859.8 (1.14x) 30.9 (1.27x) 10350.9 (1.37x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	4bfe7efd0c	swscale/aarch64/yuv2rgb_neon: 2 lines at a time, packed RGB Vertically-subsampled inputs (nv12, nv21, yuv420p) share a chroma row across two output rows; compute the chroma -> RGB offsets once and apply to both luma rows. Covers argb/rgba/abgr/bgra/rgb24/bgr24. Test Name A55-gcc M1-clang A76-gcc ---------------------------------------------------------------------------------------- nv12_to_argb_neon 21647.2 (1.16x) 40.1 (1.24x) 13813.3 (1.16x) nv12_to_rgba_neon 21653.7 (1.16x) 40.8 (1.32x) 14105.0 (1.13x) nv12_to_abgr_neon 22122.2 (1.15x) 40.3 (1.27x) 14100.2 (1.16x) nv12_to_bgra_neon 22121.6 (1.15x) 39.6 (1.24x) 14125.9 (1.16x) nv12_to_rgb24_neon 19842.0 (1.18x) 33.4 (1.28x) 12868.9 (1.17x) nv12_to_bgr24_neon 20318.0 (1.18x) 34.6 (1.23x) 12868.8 (1.17x) nv21_to_argb_neon 21648.5 (1.16x) 41.0 (1.29x) 13978.5 (1.14x) nv21_to_rgba_neon 21653.0 (1.16x) 41.3 (1.21x) 14173.5 (1.11x) nv21_to_abgr_neon 22120.6 (1.15x) 41.1 (1.20x) 14505.4 (1.14x) nv21_to_bgra_neon 22120.8 (1.15x) 41.0 (1.22x) 14520.1 (1.14x) nv21_to_rgb24_neon 19830.5 (1.19x) 35.1 (1.28x) 12832.4 (1.17x) nv21_to_bgr24_neon 20317.1 (1.18x) 34.6 (1.27x) 12833.1 (1.17x) yuv420p_to_argb_neon 21450.2 (1.15x) 39.2 (1.19x) 14118.3 (1.12x) yuv420p_to_rgba_neon 21447.2 (1.15x) 38.8 (1.24x) 14326.0 (1.14x) yuv420p_to_abgr_neon 21927.0 (1.15x) 38.9 (1.25x) 14826.6 (1.13x) yuv420p_to_bgra_neon 21930.8 (1.15x) 41.4 (1.18x) 14822.9 (1.13x) yuv420p_to_rgb24_neon 19365.5 (1.17x) 33.5 (1.25x) 13291.8 (1.16x) yuv420p_to_bgr24_neon 19848.8 (1.16x) 34.1 (1.35x) 13292.8 (1.16x) Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	11b1721b11	swscale/aarch64/yuv2rgb_neon: reorder params, unify signature Pass src[]/srcStride[] as arrays (x5/x6), move y_offset/y_coeff into register args (w2/w3). Only int-after-pointer stack args remain, so Apple and AAPCS64 lay them out identically; every __APPLE__ is gone. nv12/nv21/yuv420p/yuv422p/yuva420p share one signature. Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	8dbc729950	swscale/aarch64/yuv2rgb_neon: name registers the loop body. .text byte-identical. Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
DROOdotFOOandRamiro Polla	e0fa641240	swscale/aarch64/yuv2rgb_neon: chroma-preserve compute_rgb Macro writes per-luma sums into the destination registers, leaving v20-v25 (chroma -> RGB offsets) intact for the 2-line callers. Takes bare register names. compute_rgba and compute_rgba_alpha follow suit. Single-row callers reload v20-v25 each iteration via chroma_to_rgb_offsets, so the change is a no-op for them: Apple M1 width=1920 mean -0.54% across 55 paths, within bench noise. Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com> Signed-off-by: DROOdotFOO <drew@axol.io>	2026-06-06 19:38:40 +02:00
Michael Niedermayer	b99c6fc8c3	avformat/dashdec: Fail with any inner stream count being 0 Fixes: ada-3-poc.mpd Found-by: Claude and Ada Logics. This issue was found by Anthropic from using agents to study security of open source projects, and I am from Ada Logics helping validate the found issues and report to maintainers. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2026-06-06 17:29:17 +00:00
Ramiro Polla	025d6330a5	swscale/aarch64/ops_asmgen: fold impl pointer increment into loading of continuation address This commit reduces every kernel by one instruction, for example: function ff_sws_clear_8_u16_0001_neon, export=1, jumpable=1 - ldr x0, [x1] // SwsFuncPtr cont = impl->cont; ldr q16, [x1, #16] // v128 clear_vec = impl->priv.v128; + ldr x0, [x1], #32 // SwsFuncPtr cont = (impl++)->cont; dup v0.8h, v16.h[0] // vl[0] = broadcast(clear_vec[0]) - add x1, x1, #32 // impl += 1; br x0 // jump to cont endfunc A55: Overall speedup=1.066x faster, min=0.881x max=1.288x A76: Overall speedup=1.012x faster, min=0.570x max=1.546x The large min/max differences are due to pathological branch miss cases that happen either before of after this commit. Sponsored-by: Sovereign Tech Fund Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>	2026-06-06 11:54:14 +00:00

1 2 3 4 5 ...