This fails to compile with C23 standard attributes otherwise.
Technically only av_unused requires this, but move the other attributes
as well for consistency / future proofing.
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of SWS_UOP_PERMUTE/SWS_UOP_COPY.
No real measurable difference in performance (it just eliminates a few
practically free register renames), but definitely simpler.
Signed-off-by: Niklas Haas <git@haasn.dev>
This decomposes a swizzle mask into a series of optimal register-register
moves, using at most two temporary scratch registers.
This is a better match for ASM-style backends than the existing PERMUTE/COPY
uops that are designed for the needs of the C backend (or other backends which
either apply the swizzle mask directly or permute pointers).
I originally had logic equivalent to this written in NASM macros, but it was
just such a complicated mess that I think it's better to rewrite it in C and
have the resulting metadata be an explicit part of the uop definition.
This commit only adds the uop, I'll update the x86 implementation in the
next step.
Co-authored-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: Niklas Haas <git@haasn.dev>
The old x86 backend was the only backend that actually mutated the ops list.
With this gone, we can constify this parameter.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is no longer needed now that both C and x86 are ported to uops.
The other ff_sws_setup_*() functions are still used by the aarch64 backend.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is a ground-up refactor of the existing x86 ops code, using the new
uops macros to auto-generate every single kernel instance without guesswork.
While I was at it, I also cleaned up the file a bit and made sure we have only
a single, consistent way of writing/defining the kernels. This also gets rid
of some of the old boilerplate like decl_pattern.
Most kernels are trivial ports, but a few deserve attention or note:
- SWS_UOP_LINEAR is now generated more efficiently, thanks to the distinction
between 0/1/arbitrary components. I also rewrote the code to keep track of
whether the output was initialized yet or not, which lets us skip the
initial `xorps` and `addps` for the first component.
- SWS_UOP_PERMUTE is generated automatically by using some NASM logic to
detect permutation cycles and emit the minimal sequence of `mova`
instructions. SWS_UOP_COPY, on the other hand, is implemented naively. I
originally had a more complex implementation that could handle both, but
I decided it really isn't worth the complication just to save 2-3 cycles.
- SWS_UOP_SCALE now has a native 8-bit implementation, which is faster than
falling back to C code.
- SWS_UOP_SWAP_BYTES is no longer compiled as a type-agnostic pshufb, instead
we hard-code the shuffle mask
- SWS_UOP_DITHER is now much simpler and avoids branching etc. entirely
Signed-off-by: Niklas Haas <git@haasn.dev>
Rather than hard-coding a separate set of NASM macros, or generating them
with a separate function, we can just leverage the C preprocessor to generate
a NASM source file *from* the existing ops macros.
This is maybe a bit unorthodox, but it avoids unnecessary overhead from
re-generating the macros twice, avoids manual updating of the NASM macros,
and generally does not come with any real downside except being a bit ugly.
The main source of ugliness is the fact that the C preprocessor expands
everything into a single line, whereas NASM expects separate statements to
be on separate lines. Very fortunately, we can work around this by writing a
another NASM macro to take its arguments and dump them onto multiple lines.
It may seem premature, but I went ahead and defined all the macros, since
it was easy enough to do.
I added the %include in this commit to trigger build errors that occur only
as a result of introducing this file in the same commit that introduces it.
Signed-off-by: Niklas Haas <git@haasn.dev>
The ops.h infrastructure currently hard-codes this as SWS_PIXEL_F32,
but I want to at least properly parametrize this in case we ever
decide to revisit this decision in the future. In particular, it
may become relevant for trivial kernels or kernels whose intermediates
are bounded, exact integers (which could possibly be output directly
as e.g. U16 or U32).
The FATE change is just because the filter op names gained a suffix.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Analog of SWS_UOP_READ_PLANAR_FV for FMA-enabled backends.
The logic for determining when we can safely use FMA is maybe a bit
obtuse, given that a `return type == SWS_PIXEL_U8` would have just done
the trick as well, but better to be safe than sorry, if we ever decide to
tune this constant in the future.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is like SWS_UOP_LINEAR but parametrized by which matrix entries can use
FMA instead of bitexact IEEE mul/add instructions.
I decided to make these a separate uop to avoid bogging down the reference
backend with arch-specific details like FMA. However, I think FMA ops are quite
common/universal so I pre-emptively split it into its own separate flag rather
than defining something like SWS_UOP_FLAG_X86.
Signed-off-by: Niklas Haas <git@haasn.dev>
And SWS_BITEXACT|SWS_ACCURATE_RND, for completeness. This roughly doubles
the runtime of the uops macros generation. Let's hope it doesn't explode
further.
Signed-off-by: Niklas Haas <git@haasn.dev>
This list is currently empty but will be expanded by the following commit.
I briefly tested whether it would be worth avoiding the free/realloc on
the uops array, but found the performance difference to be negligible.
Signed-off-by: Niklas Haas <git@haasn.dev>
This ensures 100% coverage of all uop primitives by generating the set of
tests exactly from the list of seen primitives, using the uops macros.
There are some annoying quirks still because of the fact that we have to
essentially "untranslate" the UOPs back to SwsOps that result back in the
intended uop after the translation, but overall it's not too bad and still
much better than the status quo of hand-rolling the list of test cases.
Signed-off-by: Niklas Haas <git@haasn.dev>
Removes the 1x1 dither fast path, mirroring the previous commit.
This is not really needed nor useful but it will make the transition to
the uops architecture slightly easier, as 1x1 dither gets reinterpreted
as SWS_UOP_ADD there.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is broken because it fails to check dither.y_offset[] to determine if
dithering for a channel is requested or not.
This is unnecessary because the generic dither code already jumps over unused
components, which is cheap enough not to worry about this special case for
now.
This code will, in any case, soon be replaced by a uops_macros.h-derived
approach. This commit is only needed as a stopgap to make checkasm continue
working after the sws_uops refactor.
Signed-off-by: Niklas Haas <git@haasn.dev>
As well as the packed shuffle solver. These don't really interact with
the rest of the code in ops_int.asm, which is, by name at least, intended for
integer op kernels.
More importantly, these functions will be shared with the uops rewrite.
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of choosing by hand which kernels to implement, this rewrite focuses
on leveraging the power of uops_macros.h to auto-generate all needed kernels.
This not only simplifies maintenance, but also improves performance.
I have decided to develop the replacement backend as a separate file, under
a separate prefix, for the explicit purpose of being able to verify the
correctness of the rewrite using the current backend as a checkasm reference.
The code for the kernels themselves has been largely copied from the old
C backend, modified slightly to conform to the uop template style. This does
result in some code duplication, but a following commit will clean it up.
I nonetheless want to preserve this commit for bisection purposes, to ensure
we have one commit that contains both backends side-by-side.
Overall speedup=1.182x faster, min=0.197x max=3.450x
The big slowdowns are flukes caused by tiny deviations in the runtime of
a noop memcpy conversion.
As a nice side benefit, the compiled binary is now also ~10% smaller, and
the code ~50% smaller.
Signed-off-by: Niklas Haas <git@haasn.dev>
This will eventually replace the existing op_match() and
ff_sws_op_compile_tables(), but I've decided to introduce it separately first
so that I can incrementally update the backends to use the new API, at the
cost of some temporary code duplication.
Signed-off-by: Niklas Haas <git@haasn.dev>
This follows the same approach as is used currently by ops_entries_aarch64,
except I decided to have the generation logic live directly in uops.c
to allow re-using internal helpers and move it closer to the other helpers
that depend on the exact set of uops and their fields.
Unlike libswscale/tests/sws_ops.c, we make an effort to actually test all
relevant flag combinations, since these can affect the generated op lists.
I will use these macros to auto-generate both the C template-based kernels,
as well as the entire x86 backend, in the near future, hence their excessive
flexibility.
Re-use the libswscale/tests/sws_ops.c that we already compile. We could put it
in its own file but this is just as convenient, and it's easily moved anyways.
Having it be a FATE test ensures that it is always up-to-date.
Signed-off-by: Niklas Haas <git@haasn.dev>
This will replace the fuzzy matching logic in op_match() that is used by the
C and x86 implementations, as well as the translation to AARCH64_OP_* that is
used by the NEON asmgen backend.
Down the line, this function will also take a set of flags to enable
backend-specific kernels like FMA variants, but I also decided to keep it
initially simple to ease the transition.
Signed-off-by: Niklas Haas <git@haasn.dev>
Taken from AARCH64_OP_*, but generalized/simplified a bit and updated to add
missing op types, especially for special cases that already have dedicated
implementations on x86.
This initial definition is kept intentionally simple and close to SwsOp, to
make it easier to port the existing ops backends to the new infrastructure.
However, in the future, this will be refactored dramatically - distinctions
like convert vs expand will cease to exist on the SwsOp level, and will
instead be introduced by separate optimization passes on the uops level.
SWS_UOP_LINEAR in particular will most likely be broken up into multiple
uops. I also took this opportunity to redefine the mask in a more useful way.
I decided to split up SWS_OP_CONVERT as well, because it was making x86
codegen unnecessarily difficult due to the strong interaction between exact
pixel sizes.
Signed-off-by: Niklas Haas <git@haasn.dev>
Forming what will be the start of a larger helper file for backend-internal
translation of higher-level ops into lower level kernels. This header file
needs to be includable from independent source files, as it will be used to
provide definitions for build-time code generation (e.g. ops_asmgen.c), so
it must be self-contained.
Pulling in all of ops.h from uops.h would be too large dependency, since
ops.h pulls in graph.h, refstruct, bprint, etc. It's easier to start from a
fresh file that is documented as being usable at compile time.
For now, just declare the common types that will be needed by the uops layer.
Signed-off-by: Niklas Haas <git@haasn.dev>
This suppresses the addition of #line directives in the preprocessed output,
which is what we want when we're invoking the hostcc just to preprocess some
files. (Currently, this variable is only used for configure-internal checks
anyways, but I want to use it to preprocess a NASM file)
On MSVC/Intel, /EP is the equivalent syntax, though we use -EP instead for
consistency.
Signed-off-by: Niklas Haas <git@haasn.dev>
The current approach of re-testing the C reference for every backend
separately leads to both confusing output (e.g. having an extra redundant
`memcpy_c` line for every op, even those not implemented by the memcpy
backend), as well as a lot of unnecessary wasted time re-testing and
re-benching the same C variant for every backend.
This new API function lets us test the C function only a single time, while
simultaneously having all of the other backends implicitly compare themselves
against the C reference.
Signed-off-by: Niklas Haas <git@haasn.dev>
These have horrible support in legacy swscale; in particular, they break the
pixel range (limited vs full) when converting to yuva444p, resulting in SSIM
errors like:
uyva 96x96 -> grayf32le 96x96, SSIM={Y=0.997654 U=1.000000 V=1.000000 A=1.000000} loss=1.876414e-03
loss 1.876414e-03 is worse by 1.864254e-03, expected loss 1.215935e-05
(The ops-based backend gets a 100% bit-exact roundtrip here)
Signed-off-by: Niklas Haas <git@haasn.dev>
Uses the internal ff_sws_test_pixfmt_backend() to test for format support
on the concrete backend that's in-use for the auxiliary/main conversions,
respectively, while taking into account the -backends and -api options.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
When the user passes multiple backends (e.g. SWS_BACKEND_ALL), the
static check in sws_setup_frame() might have succeeded for the ops
backend but not the legacy backend, so we need to properly restrict
the legacy backend implementation function as well. Otherwise, this
may trigger internal errors / AVERROR(EINVAL) inside sws_init_context().
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This will effectively disable the cache but allows the cache layer to verify
cached files against the original input file. Useful only for debugging
the shared cache protocol itself, as file corruption can already be caught by
the CRC check.
Decided to split this off from the previous commit in case we
ever want to revert it, since it does double the overhead of the spacemap
as well as adding extra overhead to both the read and write path.
Bump the cache version to 2 to reflect the changed disk format.
This adds a new protocol shared:URI which is distinct from the existing
`cache:` in that it is explicity designed to be thread-safe and cross-process,
enabling multiple ffmpeg processes (or multiple ffmpeg decoders within the same
process) to share a single cache file, for e.g. a remote HTTP stream. As such,
it uses a radically different internal design.
To facilitate zero-knowledge cross-process interoperability, the cache file
itself is just a memory-mapped representation of the underlying file data,
which has the side benefit that the resulting cache file will contain a
working copy of the streamed file (assuming the stream was read to
completion).
To keep track of which regions are cached and which are not, we use a
secondary file that contains a minimal header along with a static bytemap of
blocks within the file. This secondary file is also used to store metadata
such as the filesize, if known, as well as marking "failed" blocks.
Both files can grow dynamically in order to accommodate larger/growing files,
and can be atomically updated (through the use of shared space maps). I have
extensively checked the space map initalization and update code for race
conditions, and I believe the current design to be solid.
That said, it is the user's responsibility to some extent to ensure that the
same URI is not used for different streams, as we rely on the URI to uniquely
identify the cache files. That said, we use a cryptographic hash with
sufficient collision resistance to protect against possible abuse. The lack of
any implicit default on `-cache_dir` also means that `shared:` can't be enabled
via URL injection to possibly access random files on the disk (or intentionally
leak content from other streams with similar URIs, even if the cryptograhic
hash function is broken).
If the input is expected to grow, we shouldn't make any assumptions about
the file size. This matches e.g. the behavior of streamed protocols like
chunked HTTP, which similarly return ENOSYS for streams of unknown size.
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
This matches the behavior of e.g. the pipe: protocol, which returns ENOSYS
on account of ffurl_seek() not being implemented.
The previous behavior of returning s->filesize directly is almost surely a
bug, as s->filesize is UINT64_MAX when never initialized.
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
The checkasm tool originated in x264. It was later rewritten and
modernized for FFmpeg (and relicensed to LGPL). For the dav1d
project, it was relicensed again to 2-clause BSD (with permission
from the relevant authors).
The FFmpeg and dav1d implementations of checkasm have since evolved
independently (with some amount of ported code between the two,
with relicensing permission where relevant).
To synchronize the development, and to make it possible to easily
adopt checkasm in other projects, it has been split out into a
standalone project/library on its own, developed at
https://code.videolan.org/videolan/checkasm/.
That version has all the features of checkasm in both FFmpeg and
dav1d, and has got a number of extra improvements on top:
- More/fixed tests (e.g. properly clobbering high bits of 32-bit registers
on most platforms),
- Vastly improved overall performance / runtime for benchmarking, due
primarily to the ability to scale the runtime of each test to that test's
complexity.
- Much more robust statistical analysis of benchmarking results; including
robust outlier rejection, an estimation of the histogram, and the ability
to report the variance / stddev in addition to the (trimmed) mean.
- Interactive HTML and JSON output formats in addition to CSV/TSV.
- More readable and user-friendly output across the board, especially for
failures and data dumps (e.g. also showing errors inside padding bytes).
- Better cross-platform support, including dynamic fallback of timer
implementations on ARM platforms, a better RISC-V and AArch64 harness,
and more.
On AArch64, it tests which timer out of pmccntr_el0, linux perf,
macos kperf, cntvct_el0 is available, without the user needing to
configure things, and falling back on clock_gettime if neither of
them can be used. This means one automatically gets the best
available timer, if userspace access to pmccntr_el0 has been
unlocked with a kernel module, or if one has permission to use
the perf API, or if the cntvct_el0 is exact enough to be useful.
On AArch64 macOS, there is now a test harness that catches clobbered
registers and stack clobbering, like on other platforms.
- An option for setting affinity, for benchmarking on heterogenous
core systems. (On Linux, this is already easily done through
taskset, but on Windows, the checkasm built in option makes it
possible there as well, and portable.)
- Printing of the tested CPU core name, where possible.
To integrate this external implementation of checkasm into FFmpeg,
without having to build libcheckasm as an external library, the upstream
sources are added as a git subtree, and integrated into the FFmpeg
build system as a foreign source.
For the long and storied history of how we arrived at this solution,
see: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/22546
The relevant config headers for checkasm are generated by configure,
and the sources are built as part of the main ffmpeg build. The
upstream sources, while they use meson as primary build system,
are structured to make it easy to build as part of a foreign build
system.
The existing testcases are mostly kept untouched (only three minor
changes are required, in crc.c, sw_ops.c and vp8dsp.c), while the
majority of the logic from checkasm.c, checkasm.h and the arch
specific assembly files are removed, replaced with the external
implementation.
Co-Authored-By: Martin Storsjö <martin@martin.st>
Signed-off-by: Niklas Haas <git@haasn.dev>
To reproduce this commit, run:
$ git subtree add --squash --prefix=tests/checkasm/ext \
https://code.ffmpeg.org/FFmpeg/checkasm.git master
To update at a later point in time, replace `add` by `pull`
Pre-emptively exclude the external checkasm sources. Split off from the
following merge commit to make the history easier to follow.
Signed-off-by: Niklas Haas <git@haasn.dev>
Pre-emptively exclude the external checkasm sources. Split off from the
following merge commit to make the history easier to follow.
Signed-off-by: Niklas Haas <git@haasn.dev>
Not only is this duplicating code, but it also hard-codes a reference to
`checkasm_lfg`, which I want to eliminate in the interest of being able to
switch out the checkasm implementation.
The test data size is quite large, so re-setting up unused data is eating up
quite a significant amount of CPU time.
This commit cuts execution time of sw_ops in half.
Signed-off-by: Niklas Haas <git@haasn.dev>
If the user passes `-backends all` but without `-flags unstable`, then the
default/legacy backend will be picked unless it doesn't support a given
pixel format.
This allows gradually opting into the new code to handle more pixel formats
than what the legacy backend currently supports, without disturbing the
predictable output/behavior.
Signed-off-by: Niklas Haas <git@haasn.dev>
This allows constraining the set of available backends. This serves as a
better replacement for the "unstable" flag, which is a bit ambiguous. Allows
users to, for example, opt into the memcpy or x86 backend, while excluding
e.g. the upcoming JIT backends.
Signed-off-by: Niklas Haas <git@haasn.dev>
This writes 4 bytes but in SSE4 mode only produces 2 bytes per vector. We
can avoid over-writing by using the appropriately sized register.
Reproducible by:
make libswscale/tests/swscale
libswscale/tests/swscale -dst monob -unscaled 1 -flags unstable -align_src 1 -align_dst 1
Signed-off-by: Niklas Haas <git@haasn.dev>
These loops were both assuming that `h` lines need to be copied; but this
varies. First of all, for plane subsampling; but more importantly, when
vertically scaling, the input line count may be substantially lower than the
actual line count.
This fixes an out-of-bounds read/write when vertically upscaling with a tail
buffer.
Verifiable via e.g.:
make libswscale/tests/swscale
valgrind -- libswscale/tests/swscale -s 63x63 -src yuv444p -dst rgb24 \
-flags unstable -align_src 1 -align_dst 1
(As well as the SSIM scores, which drop from ~e-5 to ~e-3 without this fix)
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
libplacebo versions before v365 passed .flags = 0 when retrieving the queues
from imported Vulkan devices, so we have to error out in the case of a mismatch
to avoid undefined behavior (Vulkan spec).
See-Also: https://code.videolan.org/videolan/libplacebo/-/merge_requests/856
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
These are needed for interop with e.g. libplacebo, which needs to know the
correct flags to call vkGetDeviceQueue2.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is a departure from the conventional idea of decoders always outputting
data as fast as possible. Instead, this allows decoders to be throttled in the
same way filter graphs can be.
This comes into play when e.g. a demuxer is feeding into two decoders, but
only one of the two decoders is actually currently needed (e.g. due to
A/V misalignment). In that case, what typically happens is that the unneeded
decoder alse decodes all frames, and then piles them up on the "buffersrc"
filter's downstream link (growing indefinitely).
Another issue this solves manifests when e.g. a single demuxer is feeding many
decoders that all try to feed frames to the same filter graph. In this case,
all decoders run as fast as posssible, leading to lock contention on the
filter graph input queue; resulting in (again) many frames piling up on the
buffersrc (or downstream filters) for the unneeded inputs that are not actually
the bottleneck, while the input that's actually undersatisfied can end up
starved for CPU time, possibly for long enough to exhaust memory limits. The
normal rate limiting fails to apply in this scenario because all decoders share
a single demuxer, and are hence rate-limited only by the demuxer speed; whereas
the demuxer is not choked because from the PoV of the scheduler, the filter
graph is simply not getting enough frames.
In a more general sense, there's a philosophical argument to be made here.
Since a decoder is typically also a decompressor, it produces more data than
it consumes. So, it a sense, it's acting like a type of producer also - in
the same way that a filter graph can produce more input that outputs.
Solve all of these issues by allowing decoders to be output-choked, which
gives the scheduler control over when decoders are allowed to output frames.
This does mean we have to add some sort of internal packet queue, because the
decoder thread may need to continue *accepting* upstream packets from the
demuxer (or else we risk stalling the demuxer), but defer the actual decoding
by placing them inside an internal "overflow" queue.
This effectively simulates a sort of "filter graph"-type semantics but
for the decoder queue.
This overflow logic is fairly self-contained inside `sch_dec_receive`, though
it is quite nontrivial. I have added as much documentation as is hopefully
needed to understand the logic.
Importantly, we cannot simply unlimit the decoder input thread queue because
the demuxer relies on backpressure from the decoder to rate limit itself. (Note
that demuxers may only be active if there is at least one downstream decoder
that is alse active, so we always have at least one decoder providing
backpressure)
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
When a filter is choked, but upstream threads are trying to write to its input,
this can result in the filter's input queue getting stuck. Normally, the
unchoke_downstream() logic would prevent this from happening, since the
filter would itself get unchoked as a result of upstream decoders receiving
pressure from the demuxer.
However, upcoming changes to this logic will require weakening this upstream
unchoking logic, so preventing the deadlock in a more elegant way helps with
making the code more robust.
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
Exactly what it says on the tin. There is some ambiguity as to whether this
should also prevent reading from *choked*, as opposed to empty queue, but
I think it makes sense to consider them equivalent, as I struggle to think
of a use case where it would be beneficial to allow draining a queue that
was explicitly choked by the upstream (to e.g. prevent further reads).
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
schedule_update_locked() is supposed to be a no-op when `sch->terminate`
is 1. However, there is a TOCTOU error here, where a different thread may
currently be executing schedule_update_locked(), having successfully passed
the sch->terminate check but without actually updating the choke status.
This does not matter for the current code, but will matter with the following
commit, where it creates the theoretical possibility of a race where sch_stop()
is trying to choke the demuxers (and unchoke the decoders) while
schedule_update_locked() is simultaneously trying to choke the decoders,
leading to a deadlock if the last decoder is left choked and unable to
propagate EOF downstream.
The cleanest solution is to just take the scheduler lock while updating the
choke status here. This ensures that any other schedule_update_locked() calls
will have completed.
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of awkwardly looping over the type, just split this up into
multiple loops. The loss in complexity seems worth the loss in conciseness
to me, and more importantly, this allows us to easily add more waiter types.
Sponsored-by: nxtedition AB
Signed-off-by: Niklas Haas <git@haasn.dev>
The code made the fundamental assumption that over-read into the padding
bytes is okay to do; because the most that can happen is that those pixel
values end up corrupted, which doesn't affect any adjacent pixels.
However, this is not true for SWS_OP_FILTER_H, because this operation
fundamentally mixes together horizontal pixels. Normally, this was fine,
because the filter weights for those pixels are set to 0, and 0 * x = 0.
However, that is not true for floating point inputs, which can contain
Infinity; and 0 * Infinity = NaN, thus corrupting the entire pixel.
Solve it by specifically preventing over-read when it would be unsafe.
Signed-off-by: Niklas Haas <git@haasn.dev>
This ensures that the ops printing path goes through the same code as the
actual ops dispatch backend, including all sub-passes etc.
Signed-off-by: Niklas Haas <git@haasn.dev>
Allows the uops macro generation code to not actually compile any passes.
More generally, this could be used to e.g. test if an op list is supported by
a backend without actually creating the passes.
The `bool first` change is needed because the `input == prev` check no longer
works if we don't actually compiled any passes.
Signed-off-by: Niklas Haas <git@haasn.dev>
This will be used eventually when I rewrite checkasm/sw_ops to re-use the
code in ops_dispatch.c instead of hand-rolling the execution layer.
Signed-off-by: Niklas Haas <git@haasn.dev>
This function actually lives in ops_dispatch.c, and doesn't really make
sense in ops.h anymore. We should also move some stuff out of ops_internal.h,
which doesn't depend on any external ops stuff, here.
This allows the backend/compilation-related stuff to co-exist more nicely.
Signed-off-by: Niklas Haas <git@haasn.dev>
Using the configured scaler from the SwsContext implicitly. This does affect
the output of libswscale/tests/sws_ops.c, which now prints about 4x as much
data (taking roughly 4x as long, but still within a second on my machine).
We can make this process a lot faster by forcing SWS_SCALE_POINT as the
scaler, which skips calculating any actual filter weights in favor of
generating a trivial 1-tap filter.
Signed-off-by: Niklas Haas <git@haasn.dev>
The only difference here is an extra ff_sws_add_filters() call, which is
a no-op because src w/h = dst w/h = 16.
Signed-off-by: Niklas Haas <git@haasn.dev>
This no longer accesses prev/next as a result of the `unused` removal, so
the signature can be simplified to just take the op directly.
Signed-off-by: Niklas Haas <git@haasn.dev>
We have other op types that skip checking the data even in non-flexible mode,
so there is a precedent for just leaving away `flexible` for such kernels.
Signed-off-by: Niklas Haas <git@haasn.dev>
Mirroring the precedent established by the other SwsOp-generating functions.
This allows us to re-use it for the uops macro generator.
Signed-off-by: Niklas Haas <git@haasn.dev>