100 Commits
Author SHA1 Message Date
Ronald S. Bultje 583e8e02eb tools/dav1d: initialize elapsed
Based on the following comment on IRC:
"<aconz2> the `elapsed` variable in main() is read uninitialized in
 synchronize and makes the first frametime with --frametime incorrect
 I think. Should be initialized to 0"

Confirmed that after initializing to zero, the first line in the file
generated by --frametime is reasonable.
2025-07-01 08:26:31 -04:00
Ronald S. Bultje ca83ee6d9d itx: restrict number of columns iterated over based on EOB 2024-06-17 12:44:34 -04:00
Ronald S. Bultje f1c518901b Increase timeout multiplier for aarch64/riscv64/la64-qemu CI jobs
They have been failing occasionally lately.
2024-04-13 09:53:54 -04:00
Ronald S. Bultje 7f5d3492f6 picture.c: rename picture_alloc_with_edges() to picture_alloc()
The allocated picture has no edges and is not expected to have any
edges, so the _with_edges() suffix was misleading. Fixes #415.
2024-02-02 10:57:31 -05:00
Ronald S. Bultje 18b6ed7008 Verify ref frame results after decoding completion
This fixes the issue where - when frame threading is active - that
a reference could successfully progress to a particular sbrow and
signal that, have that picked up by a frame it serves as a reference
for, which therefore decodes successfully, even though the reference
might fail decoding at a later stage.
2024-02-02 09:14:52 -05:00
Ronald S. Bultje 6d33d1796b Check for trailing marker/zero bits for tile data
Fixes #385.
2024-02-02 09:14:35 -05:00
Ronald S. Bultje ceeb535d94 qm: derive more tables at runtime
This reduces binary size from ~50kb to ~35kb. Ideas provided by Yu-Chen
(Eric) Sun and Ryan Lei from Meta.
2024-01-03 13:42:40 -05:00
Ronald S. Bultje 47107e384b deblock_avx512: convert byte-shifts to gf2p8affineqb 2023-10-05 17:24:34 +00:00
Ronald S. Bultje ad0f3e6a4b x86: add AVX512-IceLake implementation of HBD 64x64 DCT^2
Also implement "fast3" path for pass2.dct64 (where 1/8th of the
coefficients are non-zero), which affects 32x64 as well as 64x64.

Before:
inv_txfm_add_32x64_dct_dct_1_10bpc_c:          51008.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        3351.9 (15.22x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1419.5 (35.93x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    744.8 (68.49x)

After:
inv_txfm_add_32x64_dct_dct_1_10bpc_c:          51019.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        3276.1 (15.57x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1420.7 (35.91x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    668.3 (76.34x)

(Not sure why the SSE4 speed changed.)

And speed for 64x64:
inv_txfm_add_64x64_dct_dct_0_10bpc_c:           3506.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_0_10bpc_sse4:         535.6 ( 6.55x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx2:         223.5 (15.69x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx512icl:    252.4 (13.89x)
inv_txfm_add_64x64_dct_dct_1_10bpc_c:         108353.7 ( 1.00x)
inv_txfm_add_64x64_dct_dct_1_10bpc_sse4:        6551.9 (16.54x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx2:        2876.8 (37.66x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx512icl:   1310.1 (82.70x)
inv_txfm_add_64x64_dct_dct_2_10bpc_c:         108347.6 ( 1.00x)
inv_txfm_add_64x64_dct_dct_2_10bpc_sse4:        7985.4 (13.57x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx2:        3561.8 (30.42x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx512icl:   1962.6 (55.20x)
inv_txfm_add_64x64_dct_dct_3_10bpc_c:         108455.5 ( 1.00x)
inv_txfm_add_64x64_dct_dct_3_10bpc_sse4:        9709.0 (11.17x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx2:        4220.5 (25.70x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx512icl:   2991.1 (36.26x)
inv_txfm_add_64x64_dct_dct_4_10bpc_c:         108349.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_4_10bpc_sse4:       11048.0 ( 9.81x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx2:        4898.1 (22.12x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx512icl:   3108.1 (34.86x)
2023-04-20 12:08:42 +00:00
Ronald S. Bultje 68d7a76d08 x86: add AVX512-IceLake implementation of HBD 64x32 DCT^2
inv_txfm_add_64x32_dct_dct_0_10bpc_c:           1760.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_0_10bpc_sse4:         271.1 ( 6.49x)
inv_txfm_add_64x32_dct_dct_0_10bpc_avx2:         121.3 (14.52x)
inv_txfm_add_64x32_dct_dct_0_10bpc_avx512icl:    116.3 (15.14x)
inv_txfm_add_64x32_dct_dct_1_10bpc_c:          66507.4 ( 1.00x)
inv_txfm_add_64x32_dct_dct_1_10bpc_sse4:        3712.4 (17.91x)
inv_txfm_add_64x32_dct_dct_1_10bpc_avx2:        1830.5 (36.33x)
inv_txfm_add_64x32_dct_dct_1_10bpc_avx512icl:    805.4 (82.58x)
inv_txfm_add_64x32_dct_dct_2_10bpc_c:          66491.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_2_10bpc_sse4:        5325.3 (12.49x)
inv_txfm_add_64x32_dct_dct_2_10bpc_avx2:        2578.5 (25.79x)
inv_txfm_add_64x32_dct_dct_2_10bpc_avx512icl:   1394.5 (47.68x)
inv_txfm_add_64x32_dct_dct_3_10bpc_c:          66490.2 ( 1.00x)
inv_txfm_add_64x32_dct_dct_3_10bpc_sse4:        6418.5 (10.36x)
inv_txfm_add_64x32_dct_dct_3_10bpc_avx2:        3305.6 (20.11x)
inv_txfm_add_64x32_dct_dct_3_10bpc_avx512icl:   2571.5 (25.86x)
inv_txfm_add_64x32_dct_dct_4_10bpc_c:          66508.6 ( 1.00x)
inv_txfm_add_64x32_dct_dct_4_10bpc_sse4:        8671.2 ( 7.67x)
inv_txfm_add_64x32_dct_dct_4_10bpc_avx2:        4054.2 (16.40x)
inv_txfm_add_64x32_dct_dct_4_10bpc_avx512icl:   2691.6 (24.71x)
2023-04-18 11:01:53 -04:00
Ronald S. Bultje 0b809a9281 x86: add AVX512-IceLake implementation of HBD 64x16 DCT^2
inv_txfm_add_64x16_dct_dct_0_10bpc_c:            892.0 ( 1.00x)
inv_txfm_add_64x16_dct_dct_0_10bpc_sse4:         131.5 ( 6.78x)
inv_txfm_add_64x16_dct_dct_0_10bpc_avx2:          63.4 (14.07x)
inv_txfm_add_64x16_dct_dct_0_10bpc_avx512icl:     56.8 (15.71x)
inv_txfm_add_64x16_dct_dct_1_10bpc_c:          29253.7 ( 1.00x)
inv_txfm_add_64x16_dct_dct_1_10bpc_sse4:        1639.7 (17.84x)
inv_txfm_add_64x16_dct_dct_1_10bpc_avx2:        1106.8 (26.43x)
inv_txfm_add_64x16_dct_dct_1_10bpc_avx512icl:    532.9 (54.89x)
inv_txfm_add_64x16_dct_dct_2_10bpc_c:          29249.8 ( 1.00x)
inv_txfm_add_64x16_dct_dct_2_10bpc_sse4:        3065.6 ( 9.54x)
inv_txfm_add_64x16_dct_dct_2_10bpc_avx2:        1791.0 (16.33x)
inv_txfm_add_64x16_dct_dct_2_10bpc_avx512icl:   1108.0 (26.40x)
inv_txfm_add_64x16_dct_dct_3_10bpc_c:          29269.1 ( 1.00x)
inv_txfm_add_64x16_dct_dct_3_10bpc_sse4:        3738.2 ( 7.83x)
inv_txfm_add_64x16_dct_dct_3_10bpc_avx2:        1790.9 (16.34x)
inv_txfm_add_64x16_dct_dct_3_10bpc_avx512icl:   1203.8 (24.31x)
inv_txfm_add_64x16_dct_dct_4_10bpc_c:          29337.7 ( 1.00x)
inv_txfm_add_64x16_dct_dct_4_10bpc_sse4:        3749.7 ( 7.82x)
inv_txfm_add_64x16_dct_dct_4_10bpc_avx2:        1791.0 (16.38x)
inv_txfm_add_64x16_dct_dct_4_10bpc_avx512icl:   1203.8 (24.37x)
2023-04-13 10:36:38 -04:00
Ronald S. Bultje 6ae5766724 x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2
inv_txfm_add_32x64_dct_dct_0_10bpc_c:           1783.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_0_10bpc_sse4:         243.3 ( 7.33x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx2:         119.1 (14.97x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx512icl:    142.6 (12.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_c:          50422.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        2880.5 (17.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1423.4 (35.43x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    741.6 (67.99x)
inv_txfm_add_32x64_dct_dct_2_10bpc_c:          50433.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_2_10bpc_sse4:        4015.1 (12.56x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx2:        1767.7 (28.53x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx512icl:    960.8 (52.49x)
inv_txfm_add_32x64_dct_dct_3_10bpc_c:          50422.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_3_10bpc_sse4:        4500.5 (11.20x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx2:        2111.7 (23.88x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx512icl:   1777.1 (28.37x)
inv_txfm_add_32x64_dct_dct_4_10bpc_c:          50444.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_4_10bpc_sse4:        5592.8 ( 9.02x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx2:        2458.1 (20.52x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx512icl:   1867.2 (27.02x)
2023-04-12 19:16:21 -04:00
Ronald S. Bultje 5aa3b38f98 x86: add AVX512-IceLake implementation of HBD 16x64 DCT^2
nop:                                              39.4
inv_txfm_add_16x64_dct_dct_0_10bpc_c:           2208.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_0_10bpc_sse4:         133.5 (16.54x)
inv_txfm_add_16x64_dct_dct_0_10bpc_avx2:          71.3 (30.98x)
inv_txfm_add_16x64_dct_dct_0_10bpc_avx512icl:    102.0 (21.66x)
inv_txfm_add_16x64_dct_dct_1_10bpc_c:          25757.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_1_10bpc_sse4:        1366.1 (18.85x)
inv_txfm_add_16x64_dct_dct_1_10bpc_avx2:         657.6 (39.17x)
inv_txfm_add_16x64_dct_dct_1_10bpc_avx512icl:    378.9 (67.98x)
inv_txfm_add_16x64_dct_dct_2_10bpc_c:          25771.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_2_10bpc_sse4:        1739.7 (14.81x)
inv_txfm_add_16x64_dct_dct_2_10bpc_avx2:         772.1 (33.38x)
inv_txfm_add_16x64_dct_dct_2_10bpc_avx512icl:    469.3 (54.92x)
inv_txfm_add_16x64_dct_dct_3_10bpc_c:          25775.7 ( 1.00x)
inv_txfm_add_16x64_dct_dct_3_10bpc_sse4:        1968.1 (13.10x)
inv_txfm_add_16x64_dct_dct_3_10bpc_avx2:         886.5 (29.08x)
inv_txfm_add_16x64_dct_dct_3_10bpc_avx512icl:    662.6 (38.90x)
inv_txfm_add_16x64_dct_dct_4_10bpc_c:          25745.9 ( 1.00x)
inv_txfm_add_16x64_dct_dct_4_10bpc_sse4:        2330.9 (11.05x)
inv_txfm_add_16x64_dct_dct_4_10bpc_avx2:        1008.5 (25.53x)
inv_txfm_add_16x64_dct_dct_4_10bpc_avx512icl:    662.3 (38.88x)
2023-04-08 11:47:31 +00:00
Ronald S. Bultje b6bd4007cc lib.c: re-order so all code accessing f->* is grouped together 2022-03-09 15:31:04 -05:00
Ronald S. Bultje b6bec5b453 lib.c: consider a cached_error as a valid output picture
Fixes #277.
2022-03-08 11:37:00 -05:00
Ronald S. BultjeandHenrik Gramner 5fb9f3a460 CI: add threaded tests to avx512icl instance 2022-03-07 20:23:29 +00:00
Ronald S. Bultje e3f4c70006 lib.c: clear cf after seeking
Fixes #390.
2022-03-07 14:38:27 +00:00
Ronald S. Bultje 8ccdf0f6b9 task_thread: use EINVAL/ENOMEM instead of -1 for f->task_thread.retval 2022-02-17 08:46:22 -05:00
Ronald S. Bultje 2a00fb6d47 Forward frame-thread decoding errors back to user thread 2022-02-17 08:46:19 -05:00
Ronald S. Bultje 00d4715ca2 decode.c: remove dead assignment 2022-02-16 18:00:27 -05:00
Ronald S. Bultje 239c951f2e decode.c: fix return value on bitstream decoding errors
Change ENOMEM into EINVAL, since at this point memory allocation
errors don't occur, and bitstream decoding errors are not fatal.
2022-02-16 18:00:20 -05:00
Ronald S. Bultje cae2c4f0bd tools/dav1d: fix infinite loop on corrupt bitstreams
Unref data after decoding failure to prevent re-entering the loop
with the same data.
2022-02-16 18:00:09 -05:00
Ronald S. Bultje 2131a2cdaf Fix typo in EINVAL comparison 2022-02-09 13:21:28 +00:00
Ronald S. Bultje a363374a83 tools/dav1d: continue on recoverable bitstream decoding errors
Fixes inconsistent output frame count depending on --threads=X value
for the sample in #244.
2022-02-07 13:39:58 -05:00
Ronald S. BultjeandJames Almer f984447637 Output only latest spatial layer if --alllayers 0
Right now, --alllayers 0 will only output operating points that exactly
match the largest one in the sequence header. However, in certain cases,
the largest one might not be available, and a smaller one should be
returned to the user instead.

This matches update_frame_buffers() in aomdec to output only the latest
frame if --alllayers 0 is specified.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-02-03 19:37:56 -03:00
Ronald S. Bultje 45e8f2f5f8 Fix indentation 2022-01-18 11:25:04 -05:00
Ronald S. Bultje 9a691b3131 add --inloopfilters to enable/disable postfilters dynamically
(To be used alongside --filmgrain.)

Addresses part of #310.
2022-01-14 16:27:42 -05:00
Ronald S. Bultje b562b7f648 Set default framedelay to min(8, ceil(sqrt(n_threads)))
This reduces memory usage significantly. Fixes #375.
2022-01-10 14:49:11 +00:00
Ronald S. Bultje 068697556f Add interface to output invisible (alt-ref) frames
Addresses part of #310.
2022-01-07 22:04:24 +00:00
Ronald S. Bultje 36beb8185d Add option to write each frame to separate output file
For per-file yuv/y4m writes, this can be automatically specified
using e.g. -o file_%w_%h_%5n.yuv/y4m. --muxer=framemd5 -o - --quiet
will accomplish the same for per-frame md5sums.

Addresses part of #310.
2022-01-06 18:50:09 +00:00
Ronald S. Bultje 2337127cec Mark failed-to-decode frames as incomplete when --maxframedelay=1
Credit to oss-fuzz.
2021-11-12 07:56:05 -05:00
Ronald S. Bultje c7a5b90001 Fix wrong assignment if stride or sbh change, but stride * sbh don't
Credit to oss-fuzz.
2021-11-11 07:29:05 -05:00
Ronald S. Bultje c7f8c8276b Clear clobbered coefficient array when flushing after seek
Fixes #369.
2021-09-13 09:37:11 -04:00
Ronald S. Bultje d9c01c34dc Fix formatting string 2021-09-11 10:47:20 -04:00
Ronald S. BultjeandJean-Baptiste Kempf eae65df192 Fix memleak
Credit to Oss-Fuzz.
2021-09-05 08:19:55 +00:00
Ronald S. Bultje 12156a507b x86/itx: 64x64 inverse dct transforms hbd/sse4 2021-08-17 14:48:20 -04:00
Ronald S. Bultje 80bfd416d7 x86/itx: 64x32 inverse dct transforms hbd/sse4 2021-08-17 14:48:14 -04:00
Ronald S. Bultje 01466edf2e x86/itx: 64x16 inverse dct transforms hbd/sse4 2021-08-17 14:35:59 -04:00
Ronald S. Bultje be788c6319 x86/itx: 32x64 inverse dct transforms hbd/sse4 2021-08-17 14:35:59 -04:00
Ronald S. Bultje db6455e479 x86/itx: 16x64 inverse dct transforms hbd/sse4 2021-08-17 14:35:52 -04:00
Ronald S. Bultje 78d4c87851 itx/x86: rewrite .transpose4x8packed so it uses only m0-3,4&6
And same for .transpose4x8packed_hi.
2021-08-12 15:20:03 -04:00
Ronald S. Bultje ec9ecba1e6 itx/x86: replace idct8x8.transpose with idct8x4.transpose4x8packed 2021-08-12 15:20:03 -04:00
Ronald S. Bultje 59770564c0 x86/itx: add 1/sqrt(2) (rect2) multiply macro 2021-08-12 15:20:01 -04:00
Ronald S. Bultje 5455e8250c x86/itx: share pass2 loop between {16,32}x32 dct^2 functions 2021-08-12 14:47:14 -04:00
Ronald S. Bultje 9cf9d4a613 x86/itx: combine .write_8x8 and .round{1,2,3,4} into a single function 2021-08-12 14:01:45 -04:00
Ronald S. Bultje 7050f0581d x86/itx: combine .write_8x4 and .round{1,2} into a single function 2021-08-12 14:01:45 -04:00
Ronald S. Bultje a5cea27ce9 x86/itx: split dct/adst/identity pass=2 implementations for 16x8
This simplifies the code a bit, and allows sharing the dct pass=2
implementation with 32x8.
2021-08-12 14:01:45 -04:00
Ronald S. BultjeandJean-Baptiste Kempf 86b03c3cbe x86/itx: 32x32 inverse dct transforms hbd/sse4 2021-08-12 16:56:40 +00:00
Ronald S. BultjeandJean-Baptiste Kempf 59b3fe6c50 x86/itx: 32x16 inverse dct transforms hbd/sse4 2021-08-12 16:56:40 +00:00
Ronald S. BultjeandJean-Baptiste Kempf 2974828a25 x86/itx: 32x8 inverse dct transforms hbd/sse4 2021-08-12 16:56:40 +00:00
Ronald S. BultjeandJean-Baptiste Kempf de6603a207 x86/itx: 16x32 inverse dct transforms hbd/sse4 2021-08-12 16:56:40 +00:00
Ronald S. BultjeandJean-Baptiste Kempf 072eb21430 x86/itx: 8x32 inverse dct transforms hbd/sse4 2021-08-12 16:56:40 +00:00
Ronald S. Bultje b119e71dc5 x86/itx: merge pass=2 rounding and writing operations 2021-08-10 09:06:27 -04:00
Ronald S. BultjeandJean-Baptiste Kempf ec18f047ca x86/itx: 32x{8,16,32} & {8,16}x32 idtx transforms hbd/sse4 2021-08-10 11:33:18 +00:00
Ronald S. Bultje a5f32330e4 x86/itx: replace .transpose8x8 with 2 calls to .transpose4x8packed 2021-08-08 17:50:09 -04:00
Ronald S. Bultje b34244599c x86/itx: document third argument in INV_TXFM_WxH_FN macros 2021-08-04 10:46:41 -04:00
Ronald S. Bultje 7edb1a7ed5 x86/itx: 16x16 inverse transforms hbd/sse4 2021-08-02 18:17:32 -04:00
Ronald S. Bultje bcc994514c x86/itx: 16x8 inverse transforms hbd/sse4 2021-08-02 18:17:16 -04:00
Ronald S. Bultje ac8fa32a06 x86/itx: 16x4 inverse transforms hbd/sse4 2021-08-02 18:16:04 -04:00
Ronald S. Bultje e266f9fa40 x86/itx: 8x16 inverse transforms hbd/sse4 2021-07-28 09:13:38 -04:00
Ronald S. Bultje d5c0831297 x86/itx: 8x8 inverse transforms hbd/sse4 2021-07-28 09:13:32 -04:00
Ronald S. Bultje a804d43004 x86/itx: add eob-based fast path to 4x16 hbd/sse4 itx 2021-07-28 09:10:14 -04:00
Ronald S. Bultje e7228e8013 x86/itx: add eob-based fast path to 4x8 hbd/sse4 itx 2021-07-28 09:10:14 -04:00
Ronald S. Bultje 999a1c4d2a x86/itx: 8x4 inverse transforms hbd/sse4 2021-07-28 09:10:14 -04:00
Ronald S. Bultje ba183d230c x86/itx: 4x16 inverse transforms hbd/sse4 2021-07-28 09:10:10 -04:00
Ronald S. Bultje 755364cbc6 x86/itx: 4x8 inverse transforms hbd/sse4 2021-07-21 11:12:28 -04:00
Ronald S. BultjeandJean-Baptiste Kempf c719d4a4e1 x86/filmgrain: add fguv_32x32xn_i444 HBD/AVX2 2021-07-20 12:23:15 +00:00
Ronald S. BultjeandJean-Baptiste Kempf cc0e2d5f2d x86/filmgrain: add fguv_32x32xn_i422 HBD/AVX2 2021-07-20 12:23:15 +00:00
Ronald S. BultjeandJean-Baptiste Kempf 8f858c2385 x86/filmgrain: add fguv_32x32xn_i422/444 HBD/SSSE3 2021-07-20 12:23:15 +00:00
Ronald S. BultjeandJean-Baptiste Kempf 42978746f4 x86/itx: change function signatures of itx_4x4 to 0 GPRs
The wrapper function already backs up GPRs, and declaring 7 here means
we will backup/restore twice on x86-32.
2021-07-19 13:09:20 +00:00
Ronald S. Bultje 1944317ea6 x86/filmgrain: simplify post-horizontal filter blending 2021-07-16 17:51:17 -04:00
Ronald S. BultjeandJean-Baptiste Kempf 73db537834 x86/filmgrain: add generate_grain_uv_i422/i444 HBD AVX2 & SSSE3 2021-07-15 15:07:22 +00:00
Ronald S. BultjeandJean-Baptiste Kempf 35aa1c226b x86/filmgrain: make fguv_i420_32x32xn HBD/SSSE3 32bit-compatible 2021-07-14 17:44:21 +00:00
Ronald S. BultjeandJean-Baptiste Kempf 6235cdf16e x86/filmgrain: make fgy_32x32xn HBD/SSSE3 32bit-compatible 2021-07-14 17:44:21 +00:00
Ronald S. Bultje 7e6fc8b040 x86/film_grain: make generate_grain_y/uv_420 32-bit compatible 2021-07-12 13:36:48 +00:00
Ronald S. BultjeandJean-Baptiste Kempf 33180d8f6f x86/deblock: make hbd/ssse3 implementations 32bit-compatible 2021-07-06 21:45:20 +00:00
Ronald S. Bultje da98a8d562 x86/deblock_avx2: use vpblendvb instead of pand/pandn/por in flat16/8/6 2021-07-05 07:40:27 -04:00
Ronald S. Bultje 0aca76c3b7 x86/deblock_hbd_avx2: use vpblendvb instead of pand/pandn/por in flat16/8/6 2021-07-05 07:40:24 -04:00
Ronald S. Bultje af16b652aa Add SSSE3 HBD filmgrain assembly optimizations 2021-06-15 09:49:02 -04:00
Ronald S. Bultje f7043e4742 Add 10/12-bit deblock SSSE3 implementation
Currently 64-bit only.
2021-06-11 12:06:15 -04:00
Ronald S. Bultje 1156c0442a mc: add HBD/SSSE3 mc.emu_edge optimizations 2021-06-09 23:21:44 +00:00
Ronald S. Bultje e00e741161 checkasm: allow 1 >= h >= 2 in fgy_32x32xn unit test 2021-06-05 20:22:44 +00:00
Ronald S. Bultje a8b13fc110 Do avx2/hbd scaling*grain multiplication in 16bit instead of 32bit 2021-06-04 19:39:00 +00:00
Ronald S. Bultje d16ddb34aa x86: add 10/12-bpc AVX2 version of mc.emu_edge 2021-05-11 08:02:21 -04:00
Ronald S. BultjeandHenrik Gramner 3a6630707e x86: Add high bitdepth filmgrain AVX2 asm 2021-05-10 20:41:23 +02:00
Ronald S. BultjeandHenrik Gramner 24b1a4adb3 x86: Add high bitdepth loopfilter AVX2 asm 2021-05-05 00:25:55 +02:00
Ronald S. BultjeandHenrik Gramner 87aa815cfa x86: Add high bitdepth cdef AVX2 asm 2021-05-05 00:25:55 +02:00
Ronald S. Bultje 47daa4df33 Accumulate leb128 value using uint64_t as intermediate type
The shift-amount can be up to 56, and left-shifting 32-bit integers
by values >=32 is undefined behaviour. Therefore, use 64-bit integers
instead. Also slightly rewrite so we only call dav1d_get_bits() once
for the combined more|bits value, and mask the relevant portions
out instead of reading twice. Lastly, move the overflow check out of
the loop (as suggested by @wtc)

Fixes #341.
2020-06-22 21:10:55 -04:00
Ronald S. Bultje 41cd4199f1 Skip loop restoration cache buffer resize for too-small buffers
Fixes crashes in dav1d_resize_{avx2,ssse3} on very small resolutions
with super_res enabled but skipped because the width is too small.
2020-04-02 02:44:27 +02:00
Ronald S. Bultje 4687c4696f x86: add SSSE3 versions for filmgrain.fguv_32x32xn[422/444]
fguv_32x32xn_8bpc_420_csfl0_c: 14568.2
fguv_32x32xn_8bpc_420_csfl0_ssse3: 1162.3
fguv_32x32xn_8bpc_420_csfl1_c: 10682.0
fguv_32x32xn_8bpc_420_csfl1_ssse3: 910.3
fguv_32x32xn_8bpc_422_csfl0_c: 16370.5
fguv_32x32xn_8bpc_422_csfl0_ssse3: 1202.6
fguv_32x32xn_8bpc_422_csfl1_c: 11333.8
fguv_32x32xn_8bpc_422_csfl1_ssse3: 958.8
fguv_32x32xn_8bpc_444_csfl0_c: 12950.1
fguv_32x32xn_8bpc_444_csfl0_ssse3: 1133.6
fguv_32x32xn_8bpc_444_csfl1_c: 8806.7
fguv_32x32xn_8bpc_444_csfl1_ssse3: 731.0
2020-04-01 10:50:56 -04:00
Ronald S. Bultje b73acaa894 x86: use btc instead of xor+test or 32byte alignment in fgy_32x32xn_ssse3 2020-04-01 10:50:22 -04:00
Ronald S. Bultje 275e91de9e x86: add AVX2 versions for filmgrain.fguv_32x32xn[422/444]
fguv_32x32xn_8bpc_420_csfl0_c: 14568.2
fguv_32x32xn_8bpc_420_csfl0_avx2: 940.2
fguv_32x32xn_8bpc_420_csfl1_c: 10682.0
fguv_32x32xn_8bpc_420_csfl1_avx2: 783.3
fguv_32x32xn_8bpc_422_csfl0_c: 16370.5
fguv_32x32xn_8bpc_422_csfl0_avx2: 1557.3
fguv_32x32xn_8bpc_422_csfl1_c: 11333.8
fguv_32x32xn_8bpc_422_csfl1_avx2: 902.1
fguv_32x32xn_8bpc_444_csfl0_c: 12950.1
fguv_32x32xn_8bpc_444_csfl0_avx2: 822.9
fguv_32x32xn_8bpc_444_csfl1_c: 8806.7
fguv_32x32xn_8bpc_444_csfl1_avx2: 708.2
2020-04-01 10:50:02 -04:00
Ronald S. Bultje fcc94fa905 x86: use btc instead of xor+test in fgy_32x32xn_avx2 2020-04-01 10:49:41 -04:00
Ronald S. Bultje 4dd943156d x86: don't use vptest in SSSE3 version
This is the VEX (AVX) encoded variant for the SSE4 instruction ptest,
so emulate it using pmovmskb in the SSSE3 version.
2020-03-31 10:26:08 -04:00
Ronald S. Bultje e308ae49b3 x86: add SSSE3 version of mc.resize()
resize_8bpc_c: 1613670.2
resize_8bpc_ssse3: 110469.5
resize_8bpc_avx2: 93580.6
2020-03-31 13:19:55 +02:00
Ronald S. Bultje 9e36b9b001 x86: add AVX2 version of mc.resize()
resize_8bpc_c: 1637609.7
resize_8bpc_avx2: 95162.6
2020-03-31 13:19:55 +02:00
Ronald S. Bultje 862e5bc773 checkasm: add test for mc.resize() 2020-03-31 13:19:55 +02:00
Ronald S. Bultje aa1866f2ba Invert src_w/h argument in mc.resize() 2020-03-31 13:19:55 +02:00
Ronald S. Bultje 8fd5dc3a5c Make dav1d_resize_filter[] negative so it fits in int8_t 2020-03-31 13:19:55 +02:00
Ronald S. Bultje 7f2833a991 x86: add AVX2 SIMD for ipred.cfl_ac[444]
cfl_ac_444_w4_8bpc_c: 499.1
cfl_ac_444_w4_8bpc_ssse3: 24.3
cfl_ac_444_w4_8bpc_avx2: 28.9
cfl_ac_444_w8_8bpc_c: 1240.2
cfl_ac_444_w8_8bpc_ssse3: 47.4
cfl_ac_444_w8_8bpc_avx2: 34.9
cfl_ac_444_w16_8bpc_c: 1785.7
cfl_ac_444_w16_8bpc_ssse3: 86.7
cfl_ac_444_w16_8bpc_avx2: 54.6
cfl_ac_444_w32_8bpc_c: 4343.5
cfl_ac_444_w32_8bpc_ssse3: 236.5
cfl_ac_444_w32_8bpc_avx2: 113.6
2020-03-25 22:39:28 +01:00