Use a call/ret pair instead of awkwardly exporting and then jumping
back to the return label.
This is similar to c29465bcb6, but for aarch64.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
Scaling ops were add to ff_sws_enum_op_lists() in 1d841635. But the
code that skipped scaling ops in convert_to_aarch64_impl() wasn't
taking into consideration that, in sws_ops_aarch64, the scaling ops
aren't folded into read ops.
Also updates libswscale/aarch64/ops_entries.c with the new entries.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This commit reduces every kernel by one instruction, for example:
function ff_sws_clear_8_u16_0001_neon, export=1, jumpable=1
- ldr x0, [x1] // SwsFuncPtr cont = impl->cont;
ldr q16, [x1, #16] // v128 clear_vec = impl->priv.v128;
+ ldr x0, [x1], #32 // SwsFuncPtr cont = (impl++)->cont;
dup v0.8h, v16.h[0] // vl[0] = broadcast(clear_vec[0])
- add x1, x1, #32 // impl += 1;
br x0 // jump to cont
endfunc
A55: Overall speedup=1.066x faster, min=0.881x max=1.288x
A76: Overall speedup=1.012x faster, min=0.570x max=1.546x
The large min/max differences are due to pathological branch miss cases
that happen either before of after this commit.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The fate test for unscaled conversions (fate-sws-unscaled) does not
test the filtering (scaling) paths.
This commit adds a test for all the scaling paths for the new swscale
code, but only runs 2% of the tests (otherwise this test alone would
take about two and a half minutes on a modern x86_64 machine).
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This is more explicit than -flags unstable, because it also excludes
any pixel formats that are only handled by the legacy code.
Sponsored-by: Sovereign Tech Fund
Co-authored-by: Niklas Haas <git@haasn.dev>
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
yaf32be input wasn't being properly filtered out of the simple copy
condition when converting to grayf32be, which caused half the luma
plane to be interleaved with half the alpha plane, instead of just
copying the luma plane.
This fixes:
$ ./libswscale/tests/swscale -scaler none -src yaf32be -dst grayf32be
yaf32be 96x96 -> grayf32be 96x96, flags=0x0 dither=1 scaler=0/0, SSIM={Y=0.866842 U=1.000000 V=1.000000 A=1.000000} loss=1.065262e-01
yaf32be 96x96 -> grayf32be 96x96, flags=0x0 dither=1 scaler=0/0
loss 1.065262e-01 is WORSE by 1.065202e-01, expected loss 6.020069e-06
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This was triggering an assertion in x86:
$ ./libswscale/tests/swscale -scaler none -backends unstable -src rgb24 -dst yaf32be
Assertion c->srcBpc == 16 failed at src/libswscale/x86/swscale.c:637
Aborted
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This fixes:
$ ./libswscale/tests/swscale -scaler none -backends unstable -src rgb24 -dst yaf32be
... and exposes another bug in x86 when -cpuflags 0 is not passed,
which will be fixed in next commit.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The -unscaled parameter has been removed in favour of "-scaler none".
Some legacy scalers cannot be selected with these options (i.e.: SWS_X
and SWS_FAST_BILINEAR). To test these, the -flags parameter shoule be
used instead.
This option sets the scaler/scaler_sub fields in SwsContext. There is a
comment about these fields in struct SwsContext:
Note: Does not affect the legacy (stateful) API.
This comment is not entirely correct, since scaler/scaler_sub are taken
into consideration to select the algorithm, but that doesn't update the
flags field, which is still used to select implementations:
libswscale/x86/swscale.c:574: if (c->opts.flags & SWS_FAST_BILINEAR && c->canMMXEXTBeUsed) {
libswscale/ppc/swscale_vsx.c:2033: if (c->opts.flags & SWS_FAST_BILINEAR && c->opts.dst_w >= c->opts.src_w && c->chrDstW >= c->chrSrcW) {
libswscale/swscale_unscaled.c:2465: && (!needsDither || (c->opts.flags&(SWS_FAST_BILINEAR|SWS_POINT))))
libswscale/swscale_unscaled.c:2650: if (c->opts.flags&(SWS_FAST_BILINEAR|SWS_POINT)) {
libswscale/utils.c:1279: && !(sws->flags & SWS_FAST_BILINEAR)
libswscale/utils.c:1388: (flags & SWS_FAST_BILINEAR)))
libswscale/utils.c:1417: && (flags & SWS_FAST_BILINEAR)) {
libswscale/utils.c:1437: if (flags & SWS_FAST_BILINEAR) {
libswscale/utils.c:1648: if (c->canMMXEXTBeUsed && (flags & SWS_FAST_BILINEAR)) {
libswscale/swscale.c:678: if (c->opts.flags & SWS_FAST_BILINEAR) {
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The "-legacy 1" option was added in 101a2f6fc6 to run the main
conversion with the legacy scaler. This was done by forcing the use of
the legacy API. This way, it was possible to pass "-flags unstable" and
still ensure the legacy scaler path was being taken.
New legacy-related parameters will be added to the test tool, so it
makes sense to rename the -legacy option to reflect what it was
actually doing.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The formats added in e93de9948d keep the values in the most significant
bits of the uint16_t, and packed30togbra10() and gbr16ptopacked30()
weren't taking into consideration the shift field from AVComponentDescriptor.
Reproducible with:
$ ./libswscale/tests/swscale -unscaled 1 -src gbrp10msbbe -dst x2rgb10le
$ ./libswscale/tests/swscale -unscaled 1 -src x2rgb10le -dst gbrp10msbbe
The fix from 5fa2a65c11 introduced a regression for non-native-endian
formats (such as rgb565be on a little-endian system).
Reproducible with:
$ ./libswscale/tests/swscale -unscaled 1 -src rgb565be -dst rgb24
Also:
$ ./ffmpeg_g -i /opt/samples/jpegls/128.jls -vf "scale=size=512x512,format=rgb24,scale=flags=neighbor,format=rgb565be" -f rawvideo -vframes 1 -y rgb565be.raw
$ magick -size 512x512 -endian MSB RGB565:rgb565be.raw output.png
$ ./ffplay_g output.png
(note: don't use ffmpeg to convert from rgb565be.raw to output for the
test above since it will perform the same bug and cancel out the error)
When running with "-v 0", the test parameters were not being printed,
which made it hard to track down which conversion the error referred
to.
Now the test parameters are logged with av_log() when a loss error
happens.
The -p, -flags, and -unscaled options all affected the decision to
select a subsample of the tests to run. When specifying -p 0.1, about
57% of the tests would run instead of the expect 10%.
This commit fixes this by separating -p from -flags and -unscaled.
The makeinfo_html variable wasn't being disabled when the makeinfo test
failed, which prevented texi2html from being probed.
Fixes 589da160b2.
Found-by: Luke Jolliffe <luke.jolliffe@bbc.co.uk>
Only the process functions are entered via an indirect _call_ from C.
The kernel functions and process_return are dispatched to by indirect
_branches_ instead (continuation-passing style design).
Make use of the recently added "jumpable" parameter to the function
macro in libavutil/aarch64/asm.S to fix these functions when BTI is
enabled.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The function macro emits AARCH64_VALID_CALL_TARGET for exported symbols,
marking them as valid destinations for indirect _calls_. Functions that
are reached by indirect _branches_ (i.e. tail-call dispatch chains
where the link register is not set) require AARCH64_VALID_JUMP_TARGET
instead.
This commit adds a "jumpable" parameter to the function macro that, when
set, emits AARCH64_VALID_JUMP_TARGET instead of AARCH64_VALID_CALL_TARGET.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This commit pieces together the previous few commits to implement the
NEON backend for sws_ops.
In essence, a tool which runs on the target (sws_ops_aarch64) is used
to enumerate all the functions that the backend needs to implement. The
list it generates is stored in the repository (ops_entries.c).
The list from above is used at build time by a code generator tool
(ops_asmgen) to implement all the sws_ops functions the NEON backend
supports, and generate a lookup function in C to retrieve the assembly
function pointers.
At runtime, the NEON backend fetches the function pointers to the
assembly functions and chains them together in a continuation-passing
style design, similar to the x86 backend.
The following speedup is observed from legacy swscale to NEON:
A520: Overall speedup=3.780x faster, min=0.137x max=91.928x
A720: Overall speedup=4.129x faster, min=0.234x max=92.424x
And the following from the C sws_ops implementation to NEON:
A520: Overall speedup=5.513x faster, min=0.927x max=14.169x
A720: Overall speedup=4.786x faster, min=0.585x max=20.157x
The slowdowns from legacy to NEON are the same for C/x86. Mostly low
bit-depth conversions that did not perform dithering in legacy.
The 0.585x outlier from C to NEON is gbrpf32le -> gbrapf32le, which is
mostly memcpy with the C implementation. All other conversions are
better.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The NEON sws_ops backend follows the same continuation-passing style
design as the x86 backend.
Unlike the C and x86 backends, which implement the various operation
functions through the use of templates and preprocessor macros, the
NEON backend uses a build-time code generator, which is introduced by
this commit.
This code generator has two modes of operation:
-ops:
Generates an assembly file in GNU assembler syntax targeting AArch64,
which implements all the sws_ops functions the NEON backend supports.
-lookup:
Generates a C function with a hierarchical condition chain that
returns the pointer to one of the functions generated above, based on
a given set of parameters derived from SwsOp.
This is the core of the NEON sws_ops backend.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The runtime assembler interface provides an instruction-level IR and
builder API tailored to the needs of the swscale dynamic pipeline.
It is not meant to be a general purpose assembler interface.
Currently only a static file backend, which emits GNU assembler text,
has been implemented. In the future, this interface will be used to
write functions dynamically at runtime.
This code will be compiled both for runtime usage to generate optimized
functions and for build-time usage to generate static assembly files.
Therefore, it must not depend on internal FFmpeg libraries.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The NEON sws_ops backend will use a build-time code generator for the
various operation functions it needs to implement. This build time code
generator (ops_asmgen) will need a list of the operations that must be
implemented. This commit adds a tool (sws_ops_aarch64) that generates
such a list (ops_entries.c).
The list is generated by iterating over all possible conversion
combinations and collecting the parameters for each NEON assembly
function that has to be implemented, defined by an unique set of
parameters derived from SwsOp. Whenever swscale evolves, with improved
optimization passes, new pixel formats, or improvements to the backend
itself, this file (ops_entries.c) should be regenerated by running:
$ make sws_ops_entries_aarch64
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The legacy scaler is no longer implicitly used to generate a reference
to perform comparisons for every conversion. It is now up to the user
to generate a reference file and use it as input for a separate run to
perform comparisons.
It is now possible to compare against previous runs of the graph-based
scaler, for example to test for newer optimizations.
This reduces the overall time necessary to obtain speedup numbers from
the legacy scaler to the graph-based scaler (or any other comparison,
for that matter) since the reference must only be run once.
For example, to check the speedup between the legacy scaler and the
graph-based scaler:
./libswscale/tests/swscale [...] -bench 50 -legacy 1 > legacy_ref.txt
./libswscale/tests/swscale [...] -bench 50 -ref legacy_ref.txt
If no -ref file is specified, we are assuming that we are generating a
reference file, and therefore all information is printed (including
ssim/loss, and benchmarks if -bench is used).
If a -ref file is specified, the output printed depends on whether we
are testing for correctness (ssim/loss only) or benchmarking (time/
speedup only, along with overall speedup).
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This emphasizes the order of magnitude of the loss, which is what is
important for us.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The format of the reference file is the output which is printed to
stdout from this tool itself.
Malformed reference files cause an error, with a more descriptive error
message. Running a subset of the reference conversions is still
supported through -src and/or -dst.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The test results (along with SSIM) are printed to stdout again so that
the output can be parsed by -ref.
Benchmark results have also been added to the output.
We still need to re-run the reference tests to perform benchmarks, but
this will be simplified in the next few commits.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The conversion parameters, ssim/loss, and benchmark results will
eventually be merged into the same output line.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The low bit depth workaround code is duplicated in this commit, but the
other occurrence will be removed in a few commits, so I see no reason
to factor it out.
The legacy scaler still has some conversions that give results much
worse than the expected loss, but we still want them as reference, so
we don't trigger expected loss errors on conversions with the legacy
scaler.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
We will eventually be able to select between running the new graph-based
scaler or the legacy scaler.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
Support for input and output formats are already checked in run_self_tests().
This reverts commit a22faeb992.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The ref->src conversion only needs to be performed once per source
pixel format.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This prevents the propagation of dither_error across frames, and should
also improve reproducibility across platforms.
Also remove setting of flags for sws_src_dst early on, since it will
inevitably be overwritten during the tests.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
Remove dimension checks originally added to please static analysis
tools. There is little reason to have arbitrary limits in this
developer test tool. The reference files are under control by the user.
This reverts f70a651b3f and c0f0bec2f2.
Legacy swscale may overwrite the pixel formats in the context (see
handle_formats() in libswscale/utils.c). This may lead to an issue
where, when sws_frame_start() allocates a new frame, it uses the wrong
pixel format.
Instead of fixing the issue in swscale, just make sure dst is always
allocated prior to calling the legacy scaler.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The pixel format for the process loops have already been checked at
this point to be valid.
The switch added in e4abfb8e51 returns AVERROR(EINVAL) in the default
case without calling ff_sws_op_chain_free(chain), but there's no need
to free it since we mark this branch as unreachable.
This gives more information about each operation and helps catch issues
earlier on.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
Instead of unescaping the entire image data buffer in advance, and then
having to perform heuristics to skip over where the restart markers
would have been, unescape the image data for each restart marker
individually.
The initialization code was only being run when mb_y was 0, so it could
just as well be moved out of the loop.
I haven't been able to find a bayer sample that has restart markers to
check whether vpred should be reinitialized at every restart. It would
seem logical that it should, but I have left this out until we find a
sample that does have restart markers.
Use naming for SOS header fields from ISO/IEC 10918-1's non-lossless
mode of operation in ff_mjpeg_decode_sos() instead of mixing JPEG-LS
and lossless names. Each decode function still keeps its correct name
for each field.
For hwaccel, find_marker() was being used to skip over the image data,
which could include multiple restart markers.
For MJPEG-B and THP, the field size was already correct since the image
data was already unescaped.
For the rest (mjpeg and jpegls), the buffer was being incremented by
the unescaped_buf_size, which could be smaller than the actual buffer
size.
Now the buffer is correctly incremented in all cases.
For non-jpegls:
Changes the behaviour to be more in line with IJG's reference implementation:
- optional 0xFF fill bytes in a stuffed zero byte sequence (which is an
invalid pattern according to the standard) are now discarded:
"FF (FF)? 00" => "FF" instead of "FF 00"
- sequences with optional 0xFF fill bytes and a marker are no longer copied:
"FF (FF)? XX" => "" instead of "FF XX"
- a trailing 0xFF byte is no longer issued when a valid "0xFF 0xXX" marker
is found:
"FF XX" => "" instead of "FF"
For jpegls:
Changes the behaviour to be more in line with IJG's (non-jpegls) reference
implementation, similar to the changes above:
- optional 0xFF fill bytes in a stuffed zero bit sequence (which is an
invalid pattern according to the standard) are now discarded:
"FF (FF)? 0b0xxxxxxx" => "FF 0bxxxxxxx" instead of "FF 7F XX"
- sequences with optional 0xFF fill bytes and a marker are no longer copied:
"FF (FF)? 0b1xxxxxxx" => "" instead of "FF 7F"
Unescaping for jpegls is now done in one pass instead of two. The first
pass used to detect the length of the buffer, while the second pass would
copy up to the detected length.
Note that jpegls restart markers are still not supported.
There is also a speed up with the new implementations, mostly due to the
usage of memchr() as suggested by Andreas Rheinhardt <andreas.rheinhardt@outlook.com>