For whatever reason the names of the gamma and delta parameters
have been switched in a few of the warp8x8 asm implementations.
This is a bit confusing, so fix things by switching them back.
This change is purely cosmetical, the output binary is identical.
It previously used 'pixel' which is typedefed to uint8_t in files
that aren't bitdepth-templated, but those are indices and not
pixels so that was just confusing and misleading.
* Use the same approach as AVX2 of using floating-point reciprocal
instructions to replace dav1d_sgr_x_by_x[] table lookups.
* Optimize clipping of p-values in the 10bpc code.
* Rename some macros to clarify their functionality.
* Implement various minor tweaks.
Instead of using gathers we can calculate the value of
sgr_x_by_x[min(z, 255)] by doing 256 / (z + 1) in floating-point
with some clipping for z == 0 and z >= 255.
As the required precision of the division is fairly small it can be
performed using an approximate reciprocal, which is significantly
faster than a regular division.
Gather instructions are slow on all AMD CPU:s, and on most Intel
CPU:s ever since µcode updates were issued as a workaround for
the Gather Data Sampling side channel vulnerability.
The conditions for when to (re)allocate those buffers are identical,
so they can be merged into a single branch.
The allocation of the buffers themselves can also be combined to
reduce the number of allocation calls.
The amount of nested macros caused by having to support SSE2 makes
the code very difficult to maintain and modify. It is also of
questionable value considering most other asm requires SSSE3.
Both POSIX and the C standard places several environmental limits on
setjmp() invocations, with essentially anything beyond comparing the
return value with a constant as a simple branch condition being UB.
We were previously performing a function call using the setjmp()
return value as an argument, which is technically not allowed
even though it happened to work correctly in practice.
Some systems may loosen those restrictions and allow for more
flexible usage, but we shouldn't be relying on that.
It was originally disabled due to older meson versions mixing the output
of 'meson test -v' from different tests, which made the log difficult to
read. Newer versions however caches the output from each test as it runs
and prints it in one contiguous block, so that's no longer an issue.
We can simply use the regular mv contexts for intra frames.
They are mutually exclusive, and the dmv contexts were already
discarded and replaced with default contexts on frame completion.
Attempt to finish writing the current frame before exiting to avoid
ending up with a partially written frame at the end of the output file.
Only try catching a signal once, falling back to the default behavior
of exiting immediately the second time a given signal is raised.
The refmvs_block struct is only 12 bytes large but it's accessed
using 16-byte unaligned loads in asm.
In order to avoid reading past the end of the allocated buffer
we therefore need to pad the allocation size by 4 bytes.
Prints a list of cpuflags available for the current architecture.
Flags which are supported on the current system will be printed in
green, and flags which are unsupported in red with a ~ prefix.
Skip the overhead of shifting in ones into the LSB in the common case,
that's only required in the EOB padding. In practice this means we
only have to invert bits once during the refill process instead of
twice in every call to msac functions.
Also make some improvements to the refill asm, mainly involving
keeping partially inserted bytes at the end instead of clearing them.
* Process the entire buffer to get better coverage of eob handling.
* Use a more reasonable buffer size.
* Ignore trailing dif bits to allow for more implementation flexibility.
Only print the paths relative to the argon directory. This avoids
excessive terminal line wrapping due to long path names which
otherwise interferes with the '\r' usage for progress reporting.
This allows for the use of standard VT100 escape codes for text coloring,
which simplifies things by eliminating a bunch of Windows-specific code.
This is only supported since Windows 10. Things will still run on
older systems, just without colored text output.
Reduces memory usage (by 3 kB per sb128 for 4:2:0) when decoding
streams with subsampled chroma when frame threading is enabled.
This also simplifies the logic for calculating cbi indices.
Both entropy decoding and reconstruction access the elements in
the same order, so calculating block x/y positions is redundant
and we can instead just store values sequentially and increase
the pointer by one every time it's accessed.
Pack two indices into each byte instead of storing them separately.
Reduces memory usage by up to 16 kB per sb128 in streams that uses
screen content tools when frame-threading is enabled, at the cost
of some additional computational overhead for packing/unpacking.
Only one of the sign or no-sign 4:4:4 tables are ever used for
any given wedge index, so there's no point in having both.
Reduces the table size by around 50 kB.
Replace pointers with 16-bit relative offsets and remove entries
for unused block sizes (only 8x8..32x32 are relevant).
Reduces the table size by around 17 kB.
Always-enabled basic sanity checks in API functions is reasonable,
but within internal functions assert() is more appropriate when
it comes to checking for "should never happen" conditions.