This ensures 100% coverage of all uop primitives by generating the set of
tests exactly from the list of seen primitives, using the uops macros.
There are some annoying quirks still because of the fact that we have to
essentially "untranslate" the UOPs back to SwsOps that result back in the
intended uop after the translation, but overall it's not too bad and still
much better than the status quo of hand-rolling the list of test cases.
Signed-off-by: Niklas Haas <git@haasn.dev>
Removes the 1x1 dither fast path, mirroring the previous commit.
This is not really needed nor useful but it will make the transition to
the uops architecture slightly easier, as 1x1 dither gets reinterpreted
as SWS_UOP_ADD there.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is broken because it fails to check dither.y_offset[] to determine if
dithering for a channel is requested or not.
This is unnecessary because the generic dither code already jumps over unused
components, which is cheap enough not to worry about this special case for
now.
This code will, in any case, soon be replaced by a uops_macros.h-derived
approach. This commit is only needed as a stopgap to make checkasm continue
working after the sws_uops refactor.
Signed-off-by: Niklas Haas <git@haasn.dev>
As well as the packed shuffle solver. These don't really interact with
the rest of the code in ops_int.asm, which is, by name at least, intended for
integer op kernels.
More importantly, these functions will be shared with the uops rewrite.
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of choosing by hand which kernels to implement, this rewrite focuses
on leveraging the power of uops_macros.h to auto-generate all needed kernels.
This not only simplifies maintenance, but also improves performance.
I have decided to develop the replacement backend as a separate file, under
a separate prefix, for the explicit purpose of being able to verify the
correctness of the rewrite using the current backend as a checkasm reference.
The code for the kernels themselves has been largely copied from the old
C backend, modified slightly to conform to the uop template style. This does
result in some code duplication, but a following commit will clean it up.
I nonetheless want to preserve this commit for bisection purposes, to ensure
we have one commit that contains both backends side-by-side.
Overall speedup=1.182x faster, min=0.197x max=3.450x
The big slowdowns are flukes caused by tiny deviations in the runtime of
a noop memcpy conversion.
As a nice side benefit, the compiled binary is now also ~10% smaller, and
the code ~50% smaller.
Signed-off-by: Niklas Haas <git@haasn.dev>
This will eventually replace the existing op_match() and
ff_sws_op_compile_tables(), but I've decided to introduce it separately first
so that I can incrementally update the backends to use the new API, at the
cost of some temporary code duplication.
Signed-off-by: Niklas Haas <git@haasn.dev>
This follows the same approach as is used currently by ops_entries_aarch64,
except I decided to have the generation logic live directly in uops.c
to allow re-using internal helpers and move it closer to the other helpers
that depend on the exact set of uops and their fields.
Unlike libswscale/tests/sws_ops.c, we make an effort to actually test all
relevant flag combinations, since these can affect the generated op lists.
I will use these macros to auto-generate both the C template-based kernels,
as well as the entire x86 backend, in the near future, hence their excessive
flexibility.
Re-use the libswscale/tests/sws_ops.c that we already compile. We could put it
in its own file but this is just as convenient, and it's easily moved anyways.
Having it be a FATE test ensures that it is always up-to-date.
Signed-off-by: Niklas Haas <git@haasn.dev>
This will replace the fuzzy matching logic in op_match() that is used by the
C and x86 implementations, as well as the translation to AARCH64_OP_* that is
used by the NEON asmgen backend.
Down the line, this function will also take a set of flags to enable
backend-specific kernels like FMA variants, but I also decided to keep it
initially simple to ease the transition.
Signed-off-by: Niklas Haas <git@haasn.dev>
Taken from AARCH64_OP_*, but generalized/simplified a bit and updated to add
missing op types, especially for special cases that already have dedicated
implementations on x86.
This initial definition is kept intentionally simple and close to SwsOp, to
make it easier to port the existing ops backends to the new infrastructure.
However, in the future, this will be refactored dramatically - distinctions
like convert vs expand will cease to exist on the SwsOp level, and will
instead be introduced by separate optimization passes on the uops level.
SWS_UOP_LINEAR in particular will most likely be broken up into multiple
uops. I also took this opportunity to redefine the mask in a more useful way.
I decided to split up SWS_OP_CONVERT as well, because it was making x86
codegen unnecessarily difficult due to the strong interaction between exact
pixel sizes.
Signed-off-by: Niklas Haas <git@haasn.dev>
Forming what will be the start of a larger helper file for backend-internal
translation of higher-level ops into lower level kernels. This header file
needs to be includable from independent source files, as it will be used to
provide definitions for build-time code generation (e.g. ops_asmgen.c), so
it must be self-contained.
Pulling in all of ops.h from uops.h would be too large dependency, since
ops.h pulls in graph.h, refstruct, bprint, etc. It's easier to start from a
fresh file that is documented as being usable at compile time.
For now, just declare the common types that will be needed by the uops layer.
Signed-off-by: Niklas Haas <git@haasn.dev>
This suppresses the addition of #line directives in the preprocessed output,
which is what we want when we're invoking the hostcc just to preprocess some
files. (Currently, this variable is only used for configure-internal checks
anyways, but I want to use it to preprocess a NASM file)
On MSVC/Intel, /EP is the equivalent syntax, though we use -EP instead for
consistency.
Signed-off-by: Niklas Haas <git@haasn.dev>
Add NEON-optimized implementations for HEVC angular intra prediction
modes 10 (pure horizontal) and 26 (pure vertical) at 8-bit depth.
Mode 10 (Horizontal):
- Broadcasts left[y] to fill each row using ld2r/ld4r for efficiency
- Applies edge smoothing for luma blocks smaller than 32x32
Mode 26 (Vertical):
- Copies top reference row to all output rows
- Applies edge smoothing for luma blocks smaller than 32x32
Edge smoothing uses uhsub+usqadd to compute the filtered result
directly in 8-bit, avoiding widening to 16-bit intermediates.
The C pred_angular wrappers are made non-static with ff_ prefix to
allow the NEON dispatch to fall back to C for modes not yet optimized.
This will be reverted once all angular modes are implemented.
Note: since pred_angular[] is a per-size function pointer (not
per-mode), checkasm benchmarks will show '_neon' for all 33 modes
even though only modes 10/26 are truly accelerated; unoptimized
modes show ~1.0x speedup as they pass through the NEON wrapper to
the C fallback with negligible overhead.
Speedup over C on Apple M4 (checkasm --bench, 15-run average):
Mode 10 (Horizontal):
4x4: 4.66x 8x8: 5.80x 16x16: 16.86x 32x32: 24.89x
Mode 26 (Vertical):
4x4: 1.16x 8x8: 1.83x 16x16: 2.45x 32x32: 4.50x
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Replace plain memcmp+fail() with checkasm_check_pixel_padded() for
DC, planar, and angular prediction tests. Use PIXEL_RECT for output
buffers instead of flat arrays.
This enables:
- Detailed per-pixel difference output when run with 'checkasm -v'
- Detection of out-of-bounds writes beyond the NxN block area
- Padding violation reporting (writes past block boundary)
Previously, a test failure would only report "FAILED" with no
information about which pixels were wrong, making assembly debugging
difficult. Follows the pattern established in 4d4b301e4a (checkasm:
hevc_pel: Use helpers for checking for writes out of bounds).
Suggested-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Set sps->vui.sar to {0,1} (unspecified) before the VUI parsing
block, matching the HEVC pattern at hevc_ps.c. The old
zero-init-to-1 workaround is now unreachable and is removed.
Suggested-by: James Almer <jamrial@gmail.com>
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Per ITU-T H.264 (ISO/IEC 14496-10) Annex E.2.1 and ITU-T H.265
(ISO/IEC 23008-2) Annex E.3.1, when sar_width or sar_height is zero
the sample aspect ratio shall be considered unspecified. Internally
ffmpeg represents an unspecified SAR as 0/1, while fractions with a
zero denominator are not handled properly (den=0 is silently changed
to den=1 in h264_ps.c, turning an invalid 20480/0 into a "valid" but
impossibly extreme 20480/1); so we bridge the gap by replacing x/0
with 0/1 at the VUI parsing layer.
An av_log warning is added so an invalid SAR in the bitstream is
diagnosed rather than silently overwritten.
This fixes a problem with some video files provided by game
OddBallers when executed with Wine/Proton, which report SAR 20480/0.
Based on patch by Giovanni Mascellani <gmascellani@codeweavers.com>.
Fixes: ticket #23321
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
If this were to be checked, it should be checked generically,
not in every single encoder.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Instead use CODEC_PIXFMTS. Avoids deprecation warnings
from Clang and simplifies the removal of AVCodec.pix_fmts.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Some 7.1 DTS files seem to signal Lw/Rw channels that the decoder has been
mapping to SL/SR, despite the macro for the mask being called 7_1_WIDE.
This resulted in said samples reporting the same native layout as actual 7.1
samples with Lsr/Rsr/Lss/Rss (mapped to BL/BR/SL/SR).
If we were to be strict, Lw/Rw would map to WR/WL, but that would result in an
unusual native layout. Instead, lets map them to FLC/FRC, which will result in
the more common 7.1(wide) native layout.
Signed-off-by: James Almer <jamrial@gmail.com>
The current approach of re-testing the C reference for every backend
separately leads to both confusing output (e.g. having an extra redundant
`memcpy_c` line for every op, even those not implemented by the memcpy
backend), as well as a lot of unnecessary wasted time re-testing and
re-benching the same C variant for every backend.
This new API function lets us test the C function only a single time, while
simultaneously having all of the other backends implicitly compare themselves
against the C reference.
Signed-off-by: Niklas Haas <git@haasn.dev>
Fixes: signed integer overflow: 314572800 * 8 cannot be represented in type 'int'
Tighten the guard to INT_MAX/14, which covers the largest expansion
factor used in the function currently.
Found-by: Jiale Yao <19888972804@163.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
fastaudio_decode() computes
subframes = pkt->size / (40 * channels);
frame->nb_samples = subframes * 256;
both as 32-bit signed multiplications. When pkt->size is large enough
to make subframes >= 2^24, the second multiplication overflows the
signed int range and frame->nb_samples wraps to a small value.
ff_get_buffer() then sizes the audio plane for that wrapped sample
count, while the decoder loop at line 152 still iterates the full
(unwrapped) subframes count, performing a 1024-byte memcpy per
subframe per channel. The 27th iteration (or first iteration with
nb_samples=0) writes one byte past the per-plane allocation,
yielding the ASan heap-buffer-overflow WRITE at libavcodec/fastaudio
.c:171 reported as ANT-2026-03891.
Reject the subframes value whose *256 product would overflow before
performing the multiplication. The bound INT_MAX / 256 (= 8388607)
keeps the existing two's-complement semantics of every reachable
input and rejects only the configurations that would have wrapped.
Reproducer: a crafted AVI declaring one mono audio chunk of
671_088_680 bytes (sparse) with the decoder forced via
'ffmpeg -c:a fastaudio -i evil.avi'.
Found-by: Anthropic agents; validated and reported by Ada Logics.
Signed-off-by: David Korczynski <david@adalogics.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Pass src[]/srcStride[] as arrays (x5/x6), move y_offset/y_coeff into
register args (w2/w3). Only int-after-pointer stack args remain, so
Apple and AAPCS64 lay them out identically; every __APPLE__ is gone.
nv12/nv21/yuv420p/yuv422p/yuva420p share one signature.
Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
Macro writes per-luma sums into the destination registers, leaving
v20-v25 (chroma -> RGB offsets) intact for the 2-line callers. Takes
bare register names. compute_rgba and compute_rgba_alpha follow suit.
Single-row callers reload v20-v25 each iteration via
chroma_to_rgb_offsets, so the change is a no-op for them: Apple M1
width=1920 mean -0.54% across 55 paths, within bench noise.
Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: DROOdotFOO <drew@axol.io>
Fixes: ada-3-poc.mpd
Found-by: Claude and Ada Logics. This issue was found by Anthropic from using agents to study security of open source projects, and I am from Ada Logics helping validate the found issues and report to maintainers.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This commit reduces every kernel by one instruction, for example:
function ff_sws_clear_8_u16_0001_neon, export=1, jumpable=1
- ldr x0, [x1] // SwsFuncPtr cont = impl->cont;
ldr q16, [x1, #16] // v128 clear_vec = impl->priv.v128;
+ ldr x0, [x1], #32 // SwsFuncPtr cont = (impl++)->cont;
dup v0.8h, v16.h[0] // vl[0] = broadcast(clear_vec[0])
- add x1, x1, #32 // impl += 1;
br x0 // jump to cont
endfunc
A55: Overall speedup=1.066x faster, min=0.881x max=1.288x
A76: Overall speedup=1.012x faster, min=0.570x max=1.546x
The large min/max differences are due to pathological branch miss cases
that happen either before of after this commit.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>