Ronald S. Bultje
583e8e02eb
tools/dav1d: initialize elapsed
...
Based on the following comment on IRC:
"<aconz2> the `elapsed` variable in main() is read uninitialized in
synchronize and makes the first frametime with --frametime incorrect
I think. Should be initialized to 0"
Confirmed that after initializing to zero, the first line in the file
generated by --frametime is reasonable.
2025-07-01 08:26:31 -04:00
Ronald S. Bultje
ca83ee6d9d
itx: restrict number of columns iterated over based on EOB
2024-06-17 12:44:34 -04:00
Ronald S. Bultje
f1c518901b
Increase timeout multiplier for aarch64/riscv64/la64-qemu CI jobs
...
They have been failing occasionally lately.
2024-04-13 09:53:54 -04:00
Ronald S. Bultje
7f5d3492f6
picture.c: rename picture_alloc_with_edges() to picture_alloc()
...
The allocated picture has no edges and is not expected to have any
edges, so the _with_edges() suffix was misleading. Fixes #415 .
2024-02-02 10:57:31 -05:00
Ronald S. Bultje
18b6ed7008
Verify ref frame results after decoding completion
...
This fixes the issue where - when frame threading is active - that
a reference could successfully progress to a particular sbrow and
signal that, have that picked up by a frame it serves as a reference
for, which therefore decodes successfully, even though the reference
might fail decoding at a later stage.
2024-02-02 09:14:52 -05:00
Ronald S. Bultje
6d33d1796b
Check for trailing marker/zero bits for tile data
...
Fixes #385 .
2024-02-02 09:14:35 -05:00
Ronald S. Bultje
ceeb535d94
qm: derive more tables at runtime
...
This reduces binary size from ~50kb to ~35kb. Ideas provided by Yu-Chen
(Eric) Sun and Ryan Lei from Meta.
2024-01-03 13:42:40 -05:00
Ronald S. Bultje
47107e384b
deblock_avx512: convert byte-shifts to gf2p8affineqb
2023-10-05 17:24:34 +00:00
Ronald S. Bultje
ad0f3e6a4b
x86: add AVX512-IceLake implementation of HBD 64x64 DCT^2
...
Also implement "fast3" path for pass2.dct64 (where 1/8th of the
coefficients are non-zero), which affects 32x64 as well as 64x64.
Before:
inv_txfm_add_32x64_dct_dct_1_10bpc_c: 51008.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4: 3351.9 (15.22x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2: 1419.5 (35.93x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl: 744.8 (68.49x)
After:
inv_txfm_add_32x64_dct_dct_1_10bpc_c: 51019.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4: 3276.1 (15.57x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2: 1420.7 (35.91x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl: 668.3 (76.34x)
(Not sure why the SSE4 speed changed.)
And speed for 64x64:
inv_txfm_add_64x64_dct_dct_0_10bpc_c: 3506.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_0_10bpc_sse4: 535.6 ( 6.55x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx2: 223.5 (15.69x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx512icl: 252.4 (13.89x)
inv_txfm_add_64x64_dct_dct_1_10bpc_c: 108353.7 ( 1.00x)
inv_txfm_add_64x64_dct_dct_1_10bpc_sse4: 6551.9 (16.54x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx2: 2876.8 (37.66x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx512icl: 1310.1 (82.70x)
inv_txfm_add_64x64_dct_dct_2_10bpc_c: 108347.6 ( 1.00x)
inv_txfm_add_64x64_dct_dct_2_10bpc_sse4: 7985.4 (13.57x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx2: 3561.8 (30.42x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx512icl: 1962.6 (55.20x)
inv_txfm_add_64x64_dct_dct_3_10bpc_c: 108455.5 ( 1.00x)
inv_txfm_add_64x64_dct_dct_3_10bpc_sse4: 9709.0 (11.17x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx2: 4220.5 (25.70x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx512icl: 2991.1 (36.26x)
inv_txfm_add_64x64_dct_dct_4_10bpc_c: 108349.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_4_10bpc_sse4: 11048.0 ( 9.81x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx2: 4898.1 (22.12x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx512icl: 3108.1 (34.86x)
2023-04-20 12:08:42 +00:00
Ronald S. Bultje
68d7a76d08
x86: add AVX512-IceLake implementation of HBD 64x32 DCT^2
...
inv_txfm_add_64x32_dct_dct_0_10bpc_c: 1760.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_0_10bpc_sse4: 271.1 ( 6.49x)
inv_txfm_add_64x32_dct_dct_0_10bpc_avx2: 121.3 (14.52x)
inv_txfm_add_64x32_dct_dct_0_10bpc_avx512icl: 116.3 (15.14x)
inv_txfm_add_64x32_dct_dct_1_10bpc_c: 66507.4 ( 1.00x)
inv_txfm_add_64x32_dct_dct_1_10bpc_sse4: 3712.4 (17.91x)
inv_txfm_add_64x32_dct_dct_1_10bpc_avx2: 1830.5 (36.33x)
inv_txfm_add_64x32_dct_dct_1_10bpc_avx512icl: 805.4 (82.58x)
inv_txfm_add_64x32_dct_dct_2_10bpc_c: 66491.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_2_10bpc_sse4: 5325.3 (12.49x)
inv_txfm_add_64x32_dct_dct_2_10bpc_avx2: 2578.5 (25.79x)
inv_txfm_add_64x32_dct_dct_2_10bpc_avx512icl: 1394.5 (47.68x)
inv_txfm_add_64x32_dct_dct_3_10bpc_c: 66490.2 ( 1.00x)
inv_txfm_add_64x32_dct_dct_3_10bpc_sse4: 6418.5 (10.36x)
inv_txfm_add_64x32_dct_dct_3_10bpc_avx2: 3305.6 (20.11x)
inv_txfm_add_64x32_dct_dct_3_10bpc_avx512icl: 2571.5 (25.86x)
inv_txfm_add_64x32_dct_dct_4_10bpc_c: 66508.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_4_10bpc_sse4: 8671.2 ( 7.67x)
inv_txfm_add_64x32_dct_dct_4_10bpc_avx2: 4054.2 (16.40x)
inv_txfm_add_64x32_dct_dct_4_10bpc_avx512icl: 2691.6 (24.71x)
2023-04-18 11:01:53 -04:00
Ronald S. Bultje
0b809a9281
x86: add AVX512-IceLake implementation of HBD 64x16 DCT^2
...
inv_txfm_add_64x16_dct_dct_0_10bpc_c: 892.0 ( 1.00x)
inv_txfm_add_64x16_dct_dct_0_10bpc_sse4: 131.5 ( 6.78x)
inv_txfm_add_64x16_dct_dct_0_10bpc_avx2: 63.4 (14.07x)
inv_txfm_add_64x16_dct_dct_0_10bpc_avx512icl: 56.8 (15.71x)
inv_txfm_add_64x16_dct_dct_1_10bpc_c: 29253.7 ( 1.00x)
inv_txfm_add_64x16_dct_dct_1_10bpc_sse4: 1639.7 (17.84x)
inv_txfm_add_64x16_dct_dct_1_10bpc_avx2: 1106.8 (26.43x)
inv_txfm_add_64x16_dct_dct_1_10bpc_avx512icl: 532.9 (54.89x)
inv_txfm_add_64x16_dct_dct_2_10bpc_c: 29249.8 ( 1.00x)
inv_txfm_add_64x16_dct_dct_2_10bpc_sse4: 3065.6 ( 9.54x)
inv_txfm_add_64x16_dct_dct_2_10bpc_avx2: 1791.0 (16.33x)
inv_txfm_add_64x16_dct_dct_2_10bpc_avx512icl: 1108.0 (26.40x)
inv_txfm_add_64x16_dct_dct_3_10bpc_c: 29269.1 ( 1.00x)
inv_txfm_add_64x16_dct_dct_3_10bpc_sse4: 3738.2 ( 7.83x)
inv_txfm_add_64x16_dct_dct_3_10bpc_avx2: 1790.9 (16.34x)
inv_txfm_add_64x16_dct_dct_3_10bpc_avx512icl: 1203.8 (24.31x)
inv_txfm_add_64x16_dct_dct_4_10bpc_c: 29337.7 ( 1.00x)
inv_txfm_add_64x16_dct_dct_4_10bpc_sse4: 3749.7 ( 7.82x)
inv_txfm_add_64x16_dct_dct_4_10bpc_avx2: 1791.0 (16.38x)
inv_txfm_add_64x16_dct_dct_4_10bpc_avx512icl: 1203.8 (24.37x)
2023-04-13 10:36:38 -04:00
Ronald S. Bultje
6ae5766724
x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2
...
inv_txfm_add_32x64_dct_dct_0_10bpc_c: 1783.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_0_10bpc_sse4: 243.3 ( 7.33x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx2: 119.1 (14.97x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx512icl: 142.6 (12.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_c: 50422.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4: 2880.5 (17.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2: 1423.4 (35.43x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl: 741.6 (67.99x)
inv_txfm_add_32x64_dct_dct_2_10bpc_c: 50433.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_2_10bpc_sse4: 4015.1 (12.56x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx2: 1767.7 (28.53x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx512icl: 960.8 (52.49x)
inv_txfm_add_32x64_dct_dct_3_10bpc_c: 50422.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_3_10bpc_sse4: 4500.5 (11.20x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx2: 2111.7 (23.88x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx512icl: 1777.1 (28.37x)
inv_txfm_add_32x64_dct_dct_4_10bpc_c: 50444.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_4_10bpc_sse4: 5592.8 ( 9.02x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx2: 2458.1 (20.52x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx512icl: 1867.2 (27.02x)
2023-04-12 19:16:21 -04:00
Ronald S. Bultje
5aa3b38f98
x86: add AVX512-IceLake implementation of HBD 16x64 DCT^2
...
nop: 39.4
inv_txfm_add_16x64_dct_dct_0_10bpc_c: 2208.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_0_10bpc_sse4: 133.5 (16.54x)
inv_txfm_add_16x64_dct_dct_0_10bpc_avx2: 71.3 (30.98x)
inv_txfm_add_16x64_dct_dct_0_10bpc_avx512icl: 102.0 (21.66x)
inv_txfm_add_16x64_dct_dct_1_10bpc_c: 25757.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_1_10bpc_sse4: 1366.1 (18.85x)
inv_txfm_add_16x64_dct_dct_1_10bpc_avx2: 657.6 (39.17x)
inv_txfm_add_16x64_dct_dct_1_10bpc_avx512icl: 378.9 (67.98x)
inv_txfm_add_16x64_dct_dct_2_10bpc_c: 25771.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_2_10bpc_sse4: 1739.7 (14.81x)
inv_txfm_add_16x64_dct_dct_2_10bpc_avx2: 772.1 (33.38x)
inv_txfm_add_16x64_dct_dct_2_10bpc_avx512icl: 469.3 (54.92x)
inv_txfm_add_16x64_dct_dct_3_10bpc_c: 25775.7 ( 1.00x)
inv_txfm_add_16x64_dct_dct_3_10bpc_sse4: 1968.1 (13.10x)
inv_txfm_add_16x64_dct_dct_3_10bpc_avx2: 886.5 (29.08x)
inv_txfm_add_16x64_dct_dct_3_10bpc_avx512icl: 662.6 (38.90x)
inv_txfm_add_16x64_dct_dct_4_10bpc_c: 25745.9 ( 1.00x)
inv_txfm_add_16x64_dct_dct_4_10bpc_sse4: 2330.9 (11.05x)
inv_txfm_add_16x64_dct_dct_4_10bpc_avx2: 1008.5 (25.53x)
inv_txfm_add_16x64_dct_dct_4_10bpc_avx512icl: 662.3 (38.88x)
2023-04-08 11:47:31 +00:00
Ronald S. Bultje
b6bd4007cc
lib.c: re-order so all code accessing f->* is grouped together
2022-03-09 15:31:04 -05:00
Ronald S. Bultje
b6bec5b453
lib.c: consider a cached_error as a valid output picture
...
Fixes #277 .
2022-03-08 11:37:00 -05:00
Ronald S. Bultje and Henrik Gramner
5fb9f3a460
CI: add threaded tests to avx512icl instance
2022-03-07 20:23:29 +00:00
Ronald S. Bultje
e3f4c70006
lib.c: clear cf after seeking
...
Fixes #390 .
2022-03-07 14:38:27 +00:00
Ronald S. Bultje
8ccdf0f6b9
task_thread: use EINVAL/ENOMEM instead of -1 for f->task_thread.retval
2022-02-17 08:46:22 -05:00
Ronald S. Bultje
2a00fb6d47
Forward frame-thread decoding errors back to user thread
2022-02-17 08:46:19 -05:00
Ronald S. Bultje
00d4715ca2
decode.c: remove dead assignment
2022-02-16 18:00:27 -05:00
Ronald S. Bultje
239c951f2e
decode.c: fix return value on bitstream decoding errors
...
Change ENOMEM into EINVAL, since at this point memory allocation
errors don't occur, and bitstream decoding errors are not fatal.
2022-02-16 18:00:20 -05:00
Ronald S. Bultje
cae2c4f0bd
tools/dav1d: fix infinite loop on corrupt bitstreams
...
Unref data after decoding failure to prevent re-entering the loop
with the same data.
2022-02-16 18:00:09 -05:00
Ronald S. Bultje
2131a2cdaf
Fix typo in EINVAL comparison
2022-02-09 13:21:28 +00:00
Ronald S. Bultje
a363374a83
tools/dav1d: continue on recoverable bitstream decoding errors
...
Fixes inconsistent output frame count depending on --threads=X value
for the sample in #244 .
2022-02-07 13:39:58 -05:00
Ronald S. Bultje and James Almer
f984447637
Output only latest spatial layer if --alllayers 0
...
Right now, --alllayers 0 will only output operating points that exactly
match the largest one in the sequence header. However, in certain cases,
the largest one might not be available, and a smaller one should be
returned to the user instead.
This matches update_frame_buffers() in aomdec to output only the latest
frame if --alllayers 0 is specified.
Signed-off-by: James Almer <jamrial@gmail.com >
2022-02-03 19:37:56 -03:00
Ronald S. Bultje
45e8f2f5f8
Fix indentation
2022-01-18 11:25:04 -05:00
Ronald S. Bultje
9a691b3131
add --inloopfilters to enable/disable postfilters dynamically
...
(To be used alongside --filmgrain.)
Addresses part of #310 .
2022-01-14 16:27:42 -05:00
Ronald S. Bultje
b562b7f648
Set default framedelay to min(8, ceil(sqrt(n_threads)))
...
This reduces memory usage significantly. Fixes #375 .
2022-01-10 14:49:11 +00:00
Ronald S. Bultje
068697556f
Add interface to output invisible (alt-ref) frames
...
Addresses part of #310 .
2022-01-07 22:04:24 +00:00
Ronald S. Bultje
36beb8185d
Add option to write each frame to separate output file
...
For per-file yuv/y4m writes, this can be automatically specified
using e.g. -o file_%w_%h_%5n.yuv/y4m. --muxer=framemd5 -o - --quiet
will accomplish the same for per-frame md5sums.
Addresses part of #310 .
2022-01-06 18:50:09 +00:00
Ronald S. Bultje
2337127cec
Mark failed-to-decode frames as incomplete when --maxframedelay=1
...
Credit to oss-fuzz.
2021-11-12 07:56:05 -05:00
Ronald S. Bultje
c7a5b90001
Fix wrong assignment if stride or sbh change, but stride * sbh don't
...
Credit to oss-fuzz.
2021-11-11 07:29:05 -05:00
Ronald S. Bultje
c7f8c8276b
Clear clobbered coefficient array when flushing after seek
...
Fixes #369 .
2021-09-13 09:37:11 -04:00
Ronald S. Bultje
d9c01c34dc
Fix formatting string
2021-09-11 10:47:20 -04:00
Ronald S. Bultje and Jean-Baptiste Kempf
eae65df192
Fix memleak
...
Credit to Oss-Fuzz.
2021-09-05 08:19:55 +00:00
Ronald S. Bultje
12156a507b
x86/itx: 64x64 inverse dct transforms hbd/sse4
2021-08-17 14:48:20 -04:00
Ronald S. Bultje
80bfd416d7
x86/itx: 64x32 inverse dct transforms hbd/sse4
2021-08-17 14:48:14 -04:00
Ronald S. Bultje
01466edf2e
x86/itx: 64x16 inverse dct transforms hbd/sse4
2021-08-17 14:35:59 -04:00
Ronald S. Bultje
be788c6319
x86/itx: 32x64 inverse dct transforms hbd/sse4
2021-08-17 14:35:59 -04:00
Ronald S. Bultje
db6455e479
x86/itx: 16x64 inverse dct transforms hbd/sse4
2021-08-17 14:35:52 -04:00
Ronald S. Bultje
78d4c87851
itx/x86: rewrite .transpose4x8packed so it uses only m0-3,4&6
...
And same for .transpose4x8packed_hi.
2021-08-12 15:20:03 -04:00
Ronald S. Bultje
ec9ecba1e6
itx/x86: replace idct8x8.transpose with idct8x4.transpose4x8packed
2021-08-12 15:20:03 -04:00
Ronald S. Bultje
59770564c0
x86/itx: add 1/sqrt(2) (rect2) multiply macro
2021-08-12 15:20:01 -04:00
Ronald S. Bultje
5455e8250c
x86/itx: share pass2 loop between {16,32}x32 dct^2 functions
2021-08-12 14:47:14 -04:00
Ronald S. Bultje
9cf9d4a613
x86/itx: combine .write_8x8 and .round{1,2,3,4} into a single function
2021-08-12 14:01:45 -04:00
Ronald S. Bultje
7050f0581d
x86/itx: combine .write_8x4 and .round{1,2} into a single function
2021-08-12 14:01:45 -04:00
Ronald S. Bultje
a5cea27ce9
x86/itx: split dct/adst/identity pass=2 implementations for 16x8
...
This simplifies the code a bit, and allows sharing the dct pass=2
implementation with 32x8.
2021-08-12 14:01:45 -04:00
Ronald S. Bultje and Jean-Baptiste Kempf
86b03c3cbe
x86/itx: 32x32 inverse dct transforms hbd/sse4
2021-08-12 16:56:40 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
59b3fe6c50
x86/itx: 32x16 inverse dct transforms hbd/sse4
2021-08-12 16:56:40 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
2974828a25
x86/itx: 32x8 inverse dct transforms hbd/sse4
2021-08-12 16:56:40 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
de6603a207
x86/itx: 16x32 inverse dct transforms hbd/sse4
2021-08-12 16:56:40 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
072eb21430
x86/itx: 8x32 inverse dct transforms hbd/sse4
2021-08-12 16:56:40 +00:00
Ronald S. Bultje
b119e71dc5
x86/itx: merge pass=2 rounding and writing operations
2021-08-10 09:06:27 -04:00
Ronald S. Bultje and Jean-Baptiste Kempf
ec18f047ca
x86/itx: 32x{8,16,32} & {8,16}x32 idtx transforms hbd/sse4
2021-08-10 11:33:18 +00:00
Ronald S. Bultje
a5f32330e4
x86/itx: replace .transpose8x8 with 2 calls to .transpose4x8packed
2021-08-08 17:50:09 -04:00
Ronald S. Bultje
b34244599c
x86/itx: document third argument in INV_TXFM_WxH_FN macros
2021-08-04 10:46:41 -04:00
Ronald S. Bultje
7edb1a7ed5
x86/itx: 16x16 inverse transforms hbd/sse4
2021-08-02 18:17:32 -04:00
Ronald S. Bultje
bcc994514c
x86/itx: 16x8 inverse transforms hbd/sse4
2021-08-02 18:17:16 -04:00
Ronald S. Bultje
ac8fa32a06
x86/itx: 16x4 inverse transforms hbd/sse4
2021-08-02 18:16:04 -04:00
Ronald S. Bultje
e266f9fa40
x86/itx: 8x16 inverse transforms hbd/sse4
2021-07-28 09:13:38 -04:00
Ronald S. Bultje
d5c0831297
x86/itx: 8x8 inverse transforms hbd/sse4
2021-07-28 09:13:32 -04:00
Ronald S. Bultje
a804d43004
x86/itx: add eob-based fast path to 4x16 hbd/sse4 itx
2021-07-28 09:10:14 -04:00
Ronald S. Bultje
e7228e8013
x86/itx: add eob-based fast path to 4x8 hbd/sse4 itx
2021-07-28 09:10:14 -04:00
Ronald S. Bultje
999a1c4d2a
x86/itx: 8x4 inverse transforms hbd/sse4
2021-07-28 09:10:14 -04:00
Ronald S. Bultje
ba183d230c
x86/itx: 4x16 inverse transforms hbd/sse4
2021-07-28 09:10:10 -04:00
Ronald S. Bultje
755364cbc6
x86/itx: 4x8 inverse transforms hbd/sse4
2021-07-21 11:12:28 -04:00
Ronald S. Bultje and Jean-Baptiste Kempf
c719d4a4e1
x86/filmgrain: add fguv_32x32xn_i444 HBD/AVX2
2021-07-20 12:23:15 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
cc0e2d5f2d
x86/filmgrain: add fguv_32x32xn_i422 HBD/AVX2
2021-07-20 12:23:15 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
8f858c2385
x86/filmgrain: add fguv_32x32xn_i422/444 HBD/SSSE3
2021-07-20 12:23:15 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
42978746f4
x86/itx: change function signatures of itx_4x4 to 0 GPRs
...
The wrapper function already backs up GPRs, and declaring 7 here means
we will backup/restore twice on x86-32.
2021-07-19 13:09:20 +00:00
Ronald S. Bultje
1944317ea6
x86/filmgrain: simplify post-horizontal filter blending
2021-07-16 17:51:17 -04:00
Ronald S. Bultje and Jean-Baptiste Kempf
73db537834
x86/filmgrain: add generate_grain_uv_i422/i444 HBD AVX2 & SSSE3
2021-07-15 15:07:22 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
35aa1c226b
x86/filmgrain: make fguv_i420_32x32xn HBD/SSSE3 32bit-compatible
2021-07-14 17:44:21 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
6235cdf16e
x86/filmgrain: make fgy_32x32xn HBD/SSSE3 32bit-compatible
2021-07-14 17:44:21 +00:00
Ronald S. Bultje
7e6fc8b040
x86/film_grain: make generate_grain_y/uv_420 32-bit compatible
2021-07-12 13:36:48 +00:00
Ronald S. Bultje and Jean-Baptiste Kempf
33180d8f6f
x86/deblock: make hbd/ssse3 implementations 32bit-compatible
2021-07-06 21:45:20 +00:00
Ronald S. Bultje
da98a8d562
x86/deblock_avx2: use vpblendvb instead of pand/pandn/por in flat16/8/6
2021-07-05 07:40:27 -04:00
Ronald S. Bultje
0aca76c3b7
x86/deblock_hbd_avx2: use vpblendvb instead of pand/pandn/por in flat16/8/6
2021-07-05 07:40:24 -04:00
Ronald S. Bultje
af16b652aa
Add SSSE3 HBD filmgrain assembly optimizations
2021-06-15 09:49:02 -04:00
Ronald S. Bultje
f7043e4742
Add 10/12-bit deblock SSSE3 implementation
...
Currently 64-bit only.
2021-06-11 12:06:15 -04:00
Ronald S. Bultje
1156c0442a
mc: add HBD/SSSE3 mc.emu_edge optimizations
2021-06-09 23:21:44 +00:00
Ronald S. Bultje
e00e741161
checkasm: allow 1 >= h >= 2 in fgy_32x32xn unit test
2021-06-05 20:22:44 +00:00
Ronald S. Bultje
a8b13fc110
Do avx2/hbd scaling*grain multiplication in 16bit instead of 32bit
2021-06-04 19:39:00 +00:00
Ronald S. Bultje
d16ddb34aa
x86: add 10/12-bpc AVX2 version of mc.emu_edge
2021-05-11 08:02:21 -04:00
Ronald S. Bultje and Henrik Gramner
3a6630707e
x86: Add high bitdepth filmgrain AVX2 asm
2021-05-10 20:41:23 +02:00
Ronald S. Bultje and Henrik Gramner
24b1a4adb3
x86: Add high bitdepth loopfilter AVX2 asm
2021-05-05 00:25:55 +02:00
Ronald S. Bultje and Henrik Gramner
87aa815cfa
x86: Add high bitdepth cdef AVX2 asm
2021-05-05 00:25:55 +02:00
Ronald S. Bultje
47daa4df33
Accumulate leb128 value using uint64_t as intermediate type
...
The shift-amount can be up to 56, and left-shifting 32-bit integers
by values >=32 is undefined behaviour. Therefore, use 64-bit integers
instead. Also slightly rewrite so we only call dav1d_get_bits() once
for the combined more|bits value, and mask the relevant portions
out instead of reading twice. Lastly, move the overflow check out of
the loop (as suggested by @wtc)
Fixes #341 .
2020-06-22 21:10:55 -04:00
Ronald S. Bultje
41cd4199f1
Skip loop restoration cache buffer resize for too-small buffers
...
Fixes crashes in dav1d_resize_{avx2,ssse3} on very small resolutions
with super_res enabled but skipped because the width is too small.
2020-04-02 02:44:27 +02:00
Ronald S. Bultje
4687c4696f
x86: add SSSE3 versions for filmgrain.fguv_32x32xn[422/444]
...
fguv_32x32xn_8bpc_420_csfl0_c: 14568.2
fguv_32x32xn_8bpc_420_csfl0_ssse3: 1162.3
fguv_32x32xn_8bpc_420_csfl1_c: 10682.0
fguv_32x32xn_8bpc_420_csfl1_ssse3: 910.3
fguv_32x32xn_8bpc_422_csfl0_c: 16370.5
fguv_32x32xn_8bpc_422_csfl0_ssse3: 1202.6
fguv_32x32xn_8bpc_422_csfl1_c: 11333.8
fguv_32x32xn_8bpc_422_csfl1_ssse3: 958.8
fguv_32x32xn_8bpc_444_csfl0_c: 12950.1
fguv_32x32xn_8bpc_444_csfl0_ssse3: 1133.6
fguv_32x32xn_8bpc_444_csfl1_c: 8806.7
fguv_32x32xn_8bpc_444_csfl1_ssse3: 731.0
2020-04-01 10:50:56 -04:00
Ronald S. Bultje
b73acaa894
x86: use btc instead of xor+test or 32byte alignment in fgy_32x32xn_ssse3
2020-04-01 10:50:22 -04:00
Ronald S. Bultje
275e91de9e
x86: add AVX2 versions for filmgrain.fguv_32x32xn[422/444]
...
fguv_32x32xn_8bpc_420_csfl0_c: 14568.2
fguv_32x32xn_8bpc_420_csfl0_avx2: 940.2
fguv_32x32xn_8bpc_420_csfl1_c: 10682.0
fguv_32x32xn_8bpc_420_csfl1_avx2: 783.3
fguv_32x32xn_8bpc_422_csfl0_c: 16370.5
fguv_32x32xn_8bpc_422_csfl0_avx2: 1557.3
fguv_32x32xn_8bpc_422_csfl1_c: 11333.8
fguv_32x32xn_8bpc_422_csfl1_avx2: 902.1
fguv_32x32xn_8bpc_444_csfl0_c: 12950.1
fguv_32x32xn_8bpc_444_csfl0_avx2: 822.9
fguv_32x32xn_8bpc_444_csfl1_c: 8806.7
fguv_32x32xn_8bpc_444_csfl1_avx2: 708.2
2020-04-01 10:50:02 -04:00
Ronald S. Bultje
fcc94fa905
x86: use btc instead of xor+test in fgy_32x32xn_avx2
2020-04-01 10:49:41 -04:00
Ronald S. Bultje
4dd943156d
x86: don't use vptest in SSSE3 version
...
This is the VEX (AVX) encoded variant for the SSE4 instruction ptest,
so emulate it using pmovmskb in the SSSE3 version.
2020-03-31 10:26:08 -04:00
Ronald S. Bultje
e308ae49b3
x86: add SSSE3 version of mc.resize()
...
resize_8bpc_c: 1613670.2
resize_8bpc_ssse3: 110469.5
resize_8bpc_avx2: 93580.6
2020-03-31 13:19:55 +02:00
Ronald S. Bultje
9e36b9b001
x86: add AVX2 version of mc.resize()
...
resize_8bpc_c: 1637609.7
resize_8bpc_avx2: 95162.6
2020-03-31 13:19:55 +02:00
Ronald S. Bultje
862e5bc773
checkasm: add test for mc.resize()
2020-03-31 13:19:55 +02:00
Ronald S. Bultje
aa1866f2ba
Invert src_w/h argument in mc.resize()
2020-03-31 13:19:55 +02:00
Ronald S. Bultje
8fd5dc3a5c
Make dav1d_resize_filter[] negative so it fits in int8_t
2020-03-31 13:19:55 +02:00
Ronald S. Bultje
7f2833a991
x86: add AVX2 SIMD for ipred.cfl_ac[444]
...
cfl_ac_444_w4_8bpc_c: 499.1
cfl_ac_444_w4_8bpc_ssse3: 24.3
cfl_ac_444_w4_8bpc_avx2: 28.9
cfl_ac_444_w8_8bpc_c: 1240.2
cfl_ac_444_w8_8bpc_ssse3: 47.4
cfl_ac_444_w8_8bpc_avx2: 34.9
cfl_ac_444_w16_8bpc_c: 1785.7
cfl_ac_444_w16_8bpc_ssse3: 86.7
cfl_ac_444_w16_8bpc_avx2: 54.6
cfl_ac_444_w32_8bpc_c: 4343.5
cfl_ac_444_w32_8bpc_ssse3: 236.5
cfl_ac_444_w32_8bpc_avx2: 113.6
2020-03-25 22:39:28 +01:00