The aarch64 VP9 loopfilters actually violate aarch64 GCS
(Guarded Control Stack), even though we marked the code as GCS
compliant in 846746be4b.
This means that builds with GCS enabled, after that commit,
will crash when decoding VP9, on future hardware (or current
QEMU) that supports GCS. This also goes for ffmpeg version 8.1.1
where the GCS enabling was backported.
This matches the fix that was done for hevcdsp in
1f7ed8a78d.
This issue wasn't observed if running checkasm in QEMU - therefore,
I thought all GCS issues had been fixed by
846746be4b. (If I would have
tested the full "make fate" with QEMU, the issue would
have appeared though.)
However with the new checkasm, some of the GCS violations
do appear even in checkasm.
The reason is that the checkasm vp9 test intentionally craft
input pixels that attempt to trigger all the individual
separate cases in each input buffer (in
randomize_loopfilter_buffers). This means that the checkasm
tests actually never test or exercise the early exit cases,
which are the ones that violate GCS.
With the new checkasm, the call to "bench_new" always test
running the code at least once, even if not benchmarking.
As the input buffers weren't reinitialized between the test
and "bench_new", the pixel differences now differ from the
initial setup, so that the code now some times (often) would
end up hitting the early exit cases.
Ideally, the vp9 checkasm test would be repeated to cover all
cases of input buffers that allow early exits, in addition to
covering the case with all different cases in one block.
ff_yuv2rgb_get_func_ptr() now returns the C reference for explicit
BE/LE 16bpp formats, not only the NE alias.
Signed-off-by: DROOdotFOO <drew@axol.io>
This fails to compile with C23 standard attributes otherwise.
Technically only av_unused requires this, but move the other attributes
as well for consistency / future proofing.
Signed-off-by: Niklas Haas <git@haasn.dev>
Context:
1. In the case sps_subpic_info_present=0, there is a single subpicture
which includes the entire picture.
2. When sps_subpic_info_present=0, we might be using Reference Picture
Resampling (RPR), in which picture sizes might differ in the PPS,
rather than in the SPS.
Because of 2., we can't rely on the sequence-level variables
sps_subpic_width_minus1 and sps_subpic_height_minus1 to derive the
picture-level variable num_entry_points, as the picture might have a
different size to the picture used when deriving those sequence-level
variables.
Move window creation and event processing to a dedicated thread.
GetMessage only processes events from the calling thread's message
queue. Because gdigrab_read_header and gdigrab_read_packet don't run on
the same thread, the message queue was not being drained.
Fixes: #11539
Fall through to the existing cleanup so uops is freed on both the success
and failure paths.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Use a call/ret pair instead of awkwardly exporting and then jumping
back to the return label.
This is similar to c29465bcb6, but for aarch64.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
Scaling ops were add to ff_sws_enum_op_lists() in 1d841635. But the
code that skipped scaling ops in convert_to_aarch64_impl() wasn't
taking into consideration that, in sws_ops_aarch64, the scaling ops
aren't folded into read ops.
Also updates libswscale/aarch64/ops_entries.c with the new entries.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
NV_ENC_CLOCK_TIMESTAMP_SET was changed in SDK 13.1: countingType was
replaced by countingTypeLSB and countingTypeMSB.
Signed-off-by: Diego de Souza <ddesouza@nvidia.com>
Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
e13b0bb3ff x86: Skip the vzeroupper checks when built with MSVC
3cbf066c51 longjmp: Use raw arch defines for checking for x86_32
54817bd68f include: Fix a mismatched include guard comment
162f15c861 github: Build the UWP job as WINAPI_FAMILY_PHONE_APP
34d920e8bb arm/cpu: Avoid the Windows registry API in UWP builds
9a0cb83b69 utils: Avoid the GetStdHandle and GetConsoleScreenBufferInfo APIs in UWP builds
8d1609d583 ci: Test building ffmpeg with the latest checkasm
d93232845f ci: Test building latest dav1d and dav2d with the current checkasm
e15a8efbfc readme: Add me to the list of maintainers
01b4334a95 meson: Bump version to v1.3.0
git-subtree-dir: tests/checkasm/ext
git-subtree-split: e13b0bb3ff0935b7d2a1c2cc91163370f2cc8f40
The AVX2 15xM PFA FFT calls its second-dimension subtransform with dirty
YMM. That subtransform may be a legacy-SSE codelet (fft4 is SSE2 only),
causing AVX<->SSE transition penalties. Clear them after the first
dimension, before the calls.
Detected with `sde64 -ast` FATE job.
Fixes: ace42cf581
Instead of SWS_UOP_PERMUTE/SWS_UOP_COPY.
No real measurable difference in performance (it just eliminates a few
practically free register renames), but definitely simpler.
Signed-off-by: Niklas Haas <git@haasn.dev>
This decomposes a swizzle mask into a series of optimal register-register
moves, using at most two temporary scratch registers.
This is a better match for ASM-style backends than the existing PERMUTE/COPY
uops that are designed for the needs of the C backend (or other backends which
either apply the swizzle mask directly or permute pointers).
I originally had logic equivalent to this written in NASM macros, but it was
just such a complicated mess that I think it's better to rewrite it in C and
have the resulting metadata be an explicit part of the uop definition.
This commit only adds the uop, I'll update the x86 implementation in the
next step.
Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: Niklas Haas <git@haasn.dev>
The old x86 backend was the only backend that actually mutated the ops list.
With this gone, we can constify this parameter.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is no longer needed now that both C and x86 are ported to uops.
The other ff_sws_setup_*() functions are still used by the aarch64 backend.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is a ground-up refactor of the existing x86 ops code, using the new
uops macros to auto-generate every single kernel instance without guesswork.
While I was at it, I also cleaned up the file a bit and made sure we have only
a single, consistent way of writing/defining the kernels. This also gets rid
of some of the old boilerplate like decl_pattern.
Most kernels are trivial ports, but a few deserve attention or note:
- SWS_UOP_LINEAR is now generated more efficiently, thanks to the distinction
between 0/1/arbitrary components. I also rewrote the code to keep track of
whether the output was initialized yet or not, which lets us skip the
initial `xorps` and `addps` for the first component.
- SWS_UOP_PERMUTE is generated automatically by using some NASM logic to
detect permutation cycles and emit the minimal sequence of `mova`
instructions. SWS_UOP_COPY, on the other hand, is implemented naively. I
originally had a more complex implementation that could handle both, but
I decided it really isn't worth the complication just to save 2-3 cycles.
- SWS_UOP_SCALE now has a native 8-bit implementation, which is faster than
falling back to C code.
- SWS_UOP_SWAP_BYTES is no longer compiled as a type-agnostic pshufb, instead
we hard-code the shuffle mask
- SWS_UOP_DITHER is now much simpler and avoids branching etc. entirely
Signed-off-by: Niklas Haas <git@haasn.dev>
Rather than hard-coding a separate set of NASM macros, or generating them
with a separate function, we can just leverage the C preprocessor to generate
a NASM source file *from* the existing ops macros.
This is maybe a bit unorthodox, but it avoids unnecessary overhead from
re-generating the macros twice, avoids manual updating of the NASM macros,
and generally does not come with any real downside except being a bit ugly.
The main source of ugliness is the fact that the C preprocessor expands
everything into a single line, whereas NASM expects separate statements to
be on separate lines. Very fortunately, we can work around this by writing a
another NASM macro to take its arguments and dump them onto multiple lines.
It may seem premature, but I went ahead and defined all the macros, since
it was easy enough to do.
I added the %include in this commit to trigger build errors that occur only
as a result of introducing this file in the same commit that introduces it.
Signed-off-by: Niklas Haas <git@haasn.dev>
The ops.h infrastructure currently hard-codes this as SWS_PIXEL_F32,
but I want to at least properly parametrize this in case we ever
decide to revisit this decision in the future. In particular, it
may become relevant for trivial kernels or kernels whose intermediates
are bounded, exact integers (which could possibly be output directly
as e.g. U16 or U32).
The FATE change is just because the filter op names gained a suffix.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Analog of SWS_UOP_READ_PLANAR_FV for FMA-enabled backends.
The logic for determining when we can safely use FMA is maybe a bit
obtuse, given that a `return type == SWS_PIXEL_U8` would have just done
the trick as well, but better to be safe than sorry, if we ever decide to
tune this constant in the future.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is like SWS_UOP_LINEAR but parametrized by which matrix entries can use
FMA instead of bitexact IEEE mul/add instructions.
I decided to make these a separate uop to avoid bogging down the reference
backend with arch-specific details like FMA. However, I think FMA ops are quite
common/universal so I pre-emptively split it into its own separate flag rather
than defining something like SWS_UOP_FLAG_X86.
Signed-off-by: Niklas Haas <git@haasn.dev>
And SWS_BITEXACT|SWS_ACCURATE_RND, for completeness. This roughly doubles
the runtime of the uops macros generation. Let's hope it doesn't explode
further.
Signed-off-by: Niklas Haas <git@haasn.dev>
This list is currently empty but will be expanded by the following commit.
I briefly tested whether it would be worth avoiding the free/realloc on
the uops array, but found the performance difference to be negligible.
Signed-off-by: Niklas Haas <git@haasn.dev>