97 Commits
Author SHA1 Message Date
Victorien Le Couviour--Tuffet f995e1fbf9 threading: Schedule TILE tasks for all passes at once
Closes #465.
2026-04-27 21:09:28 +02:00
victorien 575af25859 flush: Reset f->task_thread.error
f->task_thread.error can be set during flushing, not resetting this can
lead to c->task_thread.first being increased after having already submitted
a frame post flushing. That's fine if it happens on the very first frame,
but if that's the case on any subsequent frame it will incur a wrong frame
ordering.
Now that a non-first frame will be considered as such, its tasks won't be
able to execute (since they depend on a truly previous frame considered as
being after) and c->task_thread.cur will be increased past that frame, with
no way of it being reset, eventually leading to a hang.
2024-11-28 17:56:13 +01:00
Victorien Le Couviour--Tuffet a500abb750 x86: Add refmvs.load_tmvs asm 2023-06-30 21:34:31 +02:00
Victorien Le Couviour--Tuffet f89dbc0717 threading: Fix a race on task_thread.init_done
Fixes a race where the tasks inserted by the init one could all be
executed, signaling frame completion, leading to another frame starting
before init_done could be set by the aforementioned init task, which then
sets it, preventing the init task of the new frame to be executed.

This then caused an assert to trigger down the task picking loop.
Credits to Oss-Fuzz.
2023-05-04 14:59:07 +02:00
Victorien Le Couviour--Tuffet 8c731791c7 checkasm: Improve mv generation for refmvs.save_tmvs 2023-03-23 15:44:03 +01:00
Victorien Le Couviour--Tuffet 16c943484e x86: Add refmvs.save_tmvs AVX-512 (Ice Lake) asm 2023-03-16 16:09:46 +01:00
Victorien Le Couviour--Tuffet 7d23ec4a04 x86: Add refmvs.save_tmvs SSSE3 asm 2023-03-13 15:19:35 +00:00
Victorien Le Couviour--Tuffet c77fb1f016 x86: Optimize refmvs.save_tmvs AVX2 asm
Process 2 blocks per iteration instead of 4.

Credits to gramner@twoorioles.com.
2023-03-13 15:19:35 +00:00
Victorien Le Couviour--Tuffet cf617fdae0 threading: Ensure passing the correct retval to decode_frame_exit
We must reload error just before calling dav1d_decode_frame_exit, as
it may have become stale between the last load and that call.
This can result in crashes since we signal a seemingly successfully decoded
frame, when it's not.
Reloading error within the frame done condition's body ensures a non-stale
value, as we use 'f->task_thread.task_counter == 0' to ensure all other
threads / tasks have already completed when entering it. In other words, only
the last thread still working on this frame can execute this code, after
all other threads have returned to doing something else.
2023-03-13 13:54:36 +01:00
Victorien Le Couviour--Tuffet 6b8438b193 x86: Add refmvs.save_tmvs AVX2 asm 2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet 0d9fe4ea65 refmvs: Add refmvs_load/save_tmvs to dsp interface 2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet 19167a2c93 refmvs: Pack refmvs_temporal_block struct
Pack the 5 bytes of data to improve memory and perf.
2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet 9b4b244810 drain: Properly fix a desync between next and first
The code in dav1d_drain_picture could result in a desync between
c->task_thread.first (oldest submitted frame) and c->frame_thread.next (first
frame to retrieve and/or next submit location).
As we loop through drain, we always increment next, but first only if the
frame has data. If the frame is visible we return. The problem arises when
encountering (an) invisible frame(s), and the next entries haven't been fed
yet, we then keep on looping increasing next but not first, as these have no
data.

We should always return when we encountered data (visible or
invisible decoded frame): for visible, the code already returns, for
invisible, we can store a boolean indicating we drained at least one frame,
whenever we reach an empty entry after that, we return (all subsequent
entries are guaranteed to be empty anyway), not incrementing next nor first.
This will have the effect to insert the next frame at the first free spot
(which is much better than the weird skips it's doing now).

So basically, c->frame_thread.next could skip some (empty) entries.
Now it's contiguous.

Fixes #416.
2023-02-10 15:11:32 +01:00
Victorien Le Couviour--Tuffet 3f19ece69f Revert "Fix mismatch between first and next in drain"
This reverts commit a51b6ce417.

We can't increment first when no data is there, otherwise we might do it
while the first frame was not yet decoded, messing up ordering: imagine
having a framedelay of 8, and a file with 7 frames. We feed 7 frames over 8
slots, now next points to [7] (empty entry), and we start draining cause EOF.
We do need next to be incremented to reach the first frame ([0]), so it can
be outputted, and only then first too.

Fixes #418.
2023-02-09 16:36:57 +01:00
Victorien Le Couviour--Tuffet a51b6ce417 Fix mismatch between first and next in drain
Fixes #416.
2023-01-26 12:49:36 +01:00
Victorien Le Couviour--Tuffet 8f16314dba threading: Add a pending list for async task insertion 2022-10-27 13:03:22 +00:00
Victorien Le Couviour--Tuffet 3e7886db54 threading: Fix a race around frame completion (frame-mt)
The completion of the first frame to decode while an async reset
request on that same frame is pending will render it stale. The
processing of such a stale request is likely to result in a hang.

One reason this happens is the skip condition at the beginning of
reset_task_cur().
=> Consume the async request before that check.

Another reason is several threads producing async reset requests in
parallel: an async request for the first frame could cascade through the
other threads (other frames) during completion of that frame, meaning
not being caught by the last synchronous reset_task_cur() after
signaling the main thread and before releasing the lock.
=> To solve this we need to add protections at the racy locations. That
means after we increase first, before returning from
reset_task_cur_async(), and after consuming the async request.
2022-10-20 14:23:30 +02:00
Victorien Le Couviour--Tuffet 6680d26f30 threading: Limit the progress bitfields to the used size
Store the used size instead of the allocated size.

The used size can be smaller than the allocated size, which results in
a wrong computation of the linear progress from the frame_progress
bitfield.
2022-09-08 14:50:25 +02:00
Victorien Le Couviour--Tuffet 895fed08e1 checkasm: Add short options 2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet 713a4f4e50 checkasm: Add pattern matching to --test 2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet a63a7c9674 checkasm: Remove pattern matching from --bench
The pattern matching feature has been improved and is now performed
under the new --function parameter, rendering this one obsolete.
2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet d5d37926b6 checkasm: Add a --function option
Allows to run checkasm only for functions matching a given pattern.
2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet a3a55b1849 threading: Fix copy_lpf_progress initialization
The copy_lpf_progress bitfield might not be fully cleared when size goes
down.

Credit to Oss-Fuzz.
2022-08-30 17:31:28 +02:00
Victorien Le Couviour--Tuffet 9717802d01 checkasm/lpf: Use operating dimensions
Fixes use of uninitialized value.
2022-06-13 14:00:59 +02:00
Victorien Le Couviour--TuffetandRonald S. Bultje b4f9eac858 checkasm: Fix uninitialized variable
fg_data->num_y_points is used in generate_grain_uv, but is only set
after the call: move the initialization above.
2022-05-31 16:28:34 +00:00
Victorien Le Couviour--Tuffet ebeaac6d60 Fix typo
Insert missing space.
2022-05-25 19:10:00 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner b4d70152fd Fix delayed_fg.scaling alignment for AVX-512 2022-03-07 19:27:08 +00:00
Victorien Le Couviour--Tuffet 402b54fcae Integrate film grain in the task threading system 2022-02-28 18:20:48 +01:00
Victorien Le Couviour--Tuffet 4a52aa4790 x86: Add mc.resize AVX-512 (Ice Lake) asm
resize_8bpc_c: 542599.0
resize_8bpc_ssse3: 87635.4
resize_8bpc_avx2: 67401.1
resize_8bpc_avx512icl: 50263.6

resize_16bpc_c: 573438.9
resize_16bpc_ssse3: 121505.2
resize_16bpc_avx2: 83293.4
resize_16bpc_avx512icl: 77974.8
2022-01-24 15:37:16 +01:00
Victorien Le Couviour--Tuffet 1cdde64f82 Run the init tasks for all frames first 2022-01-24 13:16:54 +01:00
Victorien Le Couviour--Tuffet a8f3124a6c Split the frame init task
Allows to run most of dav1d_decode_frame_init unconditionally by putting the
CDF and subsequent initializations in a separate task.
2022-01-24 13:16:54 +01:00
Victorien Le Couviour--Tuffet 1e3f0bea39 Move ENTROPY_PROGRESS task up 2022-01-19 20:17:31 +01:00
Victorien Le Couviour--Tuffet 6aaeeea689 Fix current frame selector wrapping condition
This could cause a desync between first and cur, which results in
skipping a frame, halting the decoding.
This desync typically doesn't occur "long enough" in the current state of the
project to trigger the bug, as some frames would fix this cur back.
In order to trigger this, one needs to call reset_task_cur() on the last
frame, this would be the call post insertion of the INIT task (during
dav1d_task_frame_init).
This doesn't happen as we would normally pick a task from a previous frame
already in the queue.
2022-01-19 20:17:31 +01:00
Victorien Le Couviour--Tuffet 5919517ff6 x86: Add high bitdepth mc(t)_scaled SSSE3 asm
mc_scaled_8tap_regular_w2_16bpc_c: 737.7
mc_scaled_8tap_regular_w2_16bpc_ssse3: 151.7
mc_scaled_8tap_regular_w2_16bpc_avx2: 141.2
mc_scaled_8tap_regular_w2_dy1_16bpc_c: 660.3
mc_scaled_8tap_regular_w2_dy1_16bpc_ssse3: 80.8
mc_scaled_8tap_regular_w2_dy1_16bpc_avx2: 73.2
mc_scaled_8tap_regular_w2_dy2_16bpc_c: 884.9
mc_scaled_8tap_regular_w2_dy2_16bpc_ssse3: 101.6
mc_scaled_8tap_regular_w2_dy2_16bpc_avx2: 87.2
mc_scaled_8tap_regular_w4_16bpc_c: 1356.3
mc_scaled_8tap_regular_w4_16bpc_ssse3: 172.3
mc_scaled_8tap_regular_w4_16bpc_avx2: 172.5
mc_scaled_8tap_regular_w4_dy1_16bpc_c: 1244.9
mc_scaled_8tap_regular_w4_dy1_16bpc_ssse3: 125.7
mc_scaled_8tap_regular_w4_dy1_16bpc_avx2: 96.1
mc_scaled_8tap_regular_w4_dy2_16bpc_c: 1665.6
mc_scaled_8tap_regular_w4_dy2_16bpc_ssse3: 150.2
mc_scaled_8tap_regular_w4_dy2_16bpc_avx2: 112.8
mc_scaled_8tap_regular_w8_16bpc_c: 2536.5
mc_scaled_8tap_regular_w8_16bpc_ssse3: 383.4
mc_scaled_8tap_regular_w8_16bpc_avx2: 256.2
mc_scaled_8tap_regular_w8_dy1_16bpc_c: 2331.8
mc_scaled_8tap_regular_w8_dy1_16bpc_ssse3: 350.0
mc_scaled_8tap_regular_w8_dy1_16bpc_avx2: 214.0
mc_scaled_8tap_regular_w8_dy2_16bpc_c: 3169.6
mc_scaled_8tap_regular_w8_dy2_16bpc_ssse3: 395.7
mc_scaled_8tap_regular_w8_dy2_16bpc_avx2: 265.7
mc_scaled_8tap_regular_w16_16bpc_c: 6384.6
mc_scaled_8tap_regular_w16_16bpc_ssse3: 1004.4
mc_scaled_8tap_regular_w16_16bpc_avx2: 665.0
mc_scaled_8tap_regular_w16_dy1_16bpc_c: 6103.4
mc_scaled_8tap_regular_w16_dy1_16bpc_ssse3: 896.3
mc_scaled_8tap_regular_w16_dy1_16bpc_avx2: 544.2
mc_scaled_8tap_regular_w16_dy2_16bpc_c: 8584.5
mc_scaled_8tap_regular_w16_dy2_16bpc_ssse3: 1049.0
mc_scaled_8tap_regular_w16_dy2_16bpc_avx2: 695.1
mc_scaled_8tap_regular_w32_16bpc_c: 19672.8
mc_scaled_8tap_regular_w32_16bpc_ssse3: 3204.3
mc_scaled_8tap_regular_w32_16bpc_avx2: 2109.6
mc_scaled_8tap_regular_w32_dy1_16bpc_c: 15964.6
mc_scaled_8tap_regular_w32_dy1_16bpc_ssse3: 2634.5
mc_scaled_8tap_regular_w32_dy1_16bpc_avx2: 1555.8
mc_scaled_8tap_regular_w32_dy2_16bpc_c: 24156.9
mc_scaled_8tap_regular_w32_dy2_16bpc_ssse3: 3217.3
mc_scaled_8tap_regular_w32_dy2_16bpc_avx2: 2088.8
mc_scaled_8tap_regular_w64_16bpc_c: 74356.3
mc_scaled_8tap_regular_w64_16bpc_ssse3: 11225.9
mc_scaled_8tap_regular_w64_16bpc_avx2: 7434.7
mc_scaled_8tap_regular_w64_dy1_16bpc_c: 60080.9
mc_scaled_8tap_regular_w64_dy1_16bpc_ssse3: 8912.8
mc_scaled_8tap_regular_w64_dy1_16bpc_avx2: 5222.2
mc_scaled_8tap_regular_w64_dy2_16bpc_c: 88891.4
mc_scaled_8tap_regular_w64_dy2_16bpc_ssse3: 10824.8
mc_scaled_8tap_regular_w64_dy2_16bpc_avx2: 7086.3
mc_scaled_8tap_regular_w128_16bpc_c: 171633.3
mc_scaled_8tap_regular_w128_16bpc_ssse3: 27089.3
mc_scaled_8tap_regular_w128_16bpc_avx2: 17998.2
mc_scaled_8tap_regular_w128_dy1_16bpc_c: 164399.9
mc_scaled_8tap_regular_w128_dy1_16bpc_ssse3: 24694.1
mc_scaled_8tap_regular_w128_dy1_16bpc_avx2: 14711.2
mc_scaled_8tap_regular_w128_dy2_16bpc_c: 244865.3
mc_scaled_8tap_regular_w128_dy2_16bpc_ssse3: 30599.1
mc_scaled_8tap_regular_w128_dy2_16bpc_avx2: 20341.1

mct_scaled_8tap_regular_w4_16bpc_c: 946.2
mct_scaled_8tap_regular_w4_16bpc_ssse3: 117.5
mct_scaled_8tap_regular_w4_16bpc_avx2: 112.5
mct_scaled_8tap_regular_w4_dy1_16bpc_c: 886.1
mct_scaled_8tap_regular_w4_dy1_16bpc_ssse3: 100.5
mct_scaled_8tap_regular_w4_dy1_16bpc_avx2: 76.8
mct_scaled_8tap_regular_w4_dy2_16bpc_c: 1170.1
mct_scaled_8tap_regular_w4_dy2_16bpc_ssse3: 117.6
mct_scaled_8tap_regular_w4_dy2_16bpc_avx2: 87.9
mct_scaled_8tap_regular_w8_16bpc_c: 2784.2
mct_scaled_8tap_regular_w8_16bpc_ssse3: 408.5
mct_scaled_8tap_regular_w8_16bpc_avx2: 280.3
mct_scaled_8tap_regular_w8_dy1_16bpc_c: 2530.5
mct_scaled_8tap_regular_w8_dy1_16bpc_ssse3: 358.2
mct_scaled_8tap_regular_w8_dy1_16bpc_avx2: 227.1
mct_scaled_8tap_regular_w8_dy2_16bpc_c: 3525.0
mct_scaled_8tap_regular_w8_dy2_16bpc_ssse3: 425.6
mct_scaled_8tap_regular_w8_dy2_16bpc_avx2: 283.6
mct_scaled_8tap_regular_w16_16bpc_c: 6773.8
mct_scaled_8tap_regular_w16_16bpc_ssse3: 1054.6
mct_scaled_8tap_regular_w16_16bpc_avx2: 696.4
mct_scaled_8tap_regular_w16_dy1_16bpc_c: 6418.0
mct_scaled_8tap_regular_w16_dy1_16bpc_ssse3: 938.7
mct_scaled_8tap_regular_w16_dy1_16bpc_avx2: 584.5
mct_scaled_8tap_regular_w16_dy2_16bpc_c: 9432.4
mct_scaled_8tap_regular_w16_dy2_16bpc_ssse3: 1125.3
mct_scaled_8tap_regular_w16_dy2_16bpc_avx2: 753.1
mct_scaled_8tap_regular_w32_16bpc_c: 26028.8
mct_scaled_8tap_regular_w32_16bpc_ssse3: 4128.4
mct_scaled_8tap_regular_w32_16bpc_avx2: 2748.4
mct_scaled_8tap_regular_w32_dy1_16bpc_c: 21604.3
mct_scaled_8tap_regular_w32_dy1_16bpc_ssse3: 3312.4
mct_scaled_8tap_regular_w32_dy1_16bpc_avx2: 2051.1
mct_scaled_8tap_regular_w32_dy2_16bpc_c: 32844.3
mct_scaled_8tap_regular_w32_dy2_16bpc_ssse3: 4102.9
mct_scaled_8tap_regular_w32_dy2_16bpc_avx2: 2741.6
mct_scaled_8tap_regular_w64_16bpc_c: 49101.8
mct_scaled_8tap_regular_w64_16bpc_ssse3: 8758.9
mct_scaled_8tap_regular_w64_16bpc_avx2: 5822.2
mct_scaled_8tap_regular_w64_dy1_16bpc_c: 53557.7
mct_scaled_8tap_regular_w64_dy1_16bpc_ssse3: 8469.7
mct_scaled_8tap_regular_w64_dy1_16bpc_avx2: 5264.3
mct_scaled_8tap_regular_w64_dy2_16bpc_c: 83379.7
mct_scaled_8tap_regular_w64_dy2_16bpc_ssse3: 10623.7
mct_scaled_8tap_regular_w64_dy2_16bpc_avx2: 7164.0
mct_scaled_8tap_regular_w128_16bpc_c: 163182.2
mct_scaled_8tap_regular_w128_16bpc_ssse3: 26452.9
mct_scaled_8tap_regular_w128_16bpc_avx2: 18402.2
mct_scaled_8tap_regular_w128_dy1_16bpc_c: 148199.8
mct_scaled_8tap_regular_w128_dy1_16bpc_ssse3: 23584.9
mct_scaled_8tap_regular_w128_dy1_16bpc_avx2: 14808.1
mct_scaled_8tap_regular_w128_dy2_16bpc_c: 234702.2
mct_scaled_8tap_regular_w128_dy2_16bpc_ssse3: 29653.8
mct_scaled_8tap_regular_w128_dy2_16bpc_avx2: 20042.4
2022-01-12 19:08:56 +01:00
Victorien Le Couviour--Tuffet 42ad602ddd x86: Add 8-bit mc(t)_scaled SSSE3 32-bit asm
mc_scaled_8tap_regular_w2_8bpc_c: 1070.7
mc_scaled_8tap_regular_w2_8bpc_ssse3: 253.0
mc_scaled_8tap_regular_w2_dy1_8bpc_c: 1079.9
mc_scaled_8tap_regular_w2_dy1_8bpc_ssse3: 114.8
mc_scaled_8tap_regular_w2_dy2_8bpc_c: 1466.1
mc_scaled_8tap_regular_w2_dy2_8bpc_ssse3: 145.7
mc_scaled_8tap_regular_w4_8bpc_c: 1965.4
mc_scaled_8tap_regular_w4_8bpc_ssse3: 251.4
mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1989.4
mc_scaled_8tap_regular_w4_dy1_8bpc_ssse3: 166.1
mc_scaled_8tap_regular_w4_dy2_8bpc_c: 2728.8
mc_scaled_8tap_regular_w4_dy2_8bpc_ssse3: 163.4
mc_scaled_8tap_regular_w8_8bpc_c: 3670.1
mc_scaled_8tap_regular_w8_8bpc_ssse3: 477.0
mc_scaled_8tap_regular_w8_dy1_8bpc_c: 3651.1
mc_scaled_8tap_regular_w8_dy1_8bpc_ssse3: 464.8
mc_scaled_8tap_regular_w8_dy2_8bpc_c: 5079.6
mc_scaled_8tap_regular_w8_dy2_8bpc_ssse3: 494.0
mc_scaled_8tap_regular_w16_8bpc_c: 8366.9
mc_scaled_8tap_regular_w16_8bpc_ssse3: 1197.4
mc_scaled_8tap_regular_w16_dy1_8bpc_c: 9088.5
mc_scaled_8tap_regular_w16_dy1_8bpc_ssse3: 1212.6
mc_scaled_8tap_regular_w16_dy2_8bpc_c: 13166.1
mc_scaled_8tap_regular_w16_dy2_8bpc_ssse3: 1301.4
mc_scaled_8tap_regular_w32_8bpc_c: 29883.7
mc_scaled_8tap_regular_w32_8bpc_ssse3: 3990.3
mc_scaled_8tap_regular_w32_dy1_8bpc_c: 23404.1
mc_scaled_8tap_regular_w32_dy1_8bpc_ssse3: 3617.4
mc_scaled_8tap_regular_w32_dy2_8bpc_c: 36248.3
mc_scaled_8tap_regular_w32_dy2_8bpc_ssse3: 3949.3
mc_scaled_8tap_regular_w64_8bpc_c: 57228.6
mc_scaled_8tap_regular_w64_8bpc_ssse3: 9359.4
mc_scaled_8tap_regular_w64_dy1_8bpc_c: 87271.8
mc_scaled_8tap_regular_w64_dy1_8bpc_ssse3: 12472.7
mc_scaled_8tap_regular_w64_dy2_8bpc_c: 135050.9
mc_scaled_8tap_regular_w64_dy2_8bpc_ssse3: 13585.4
mc_scaled_8tap_regular_w128_8bpc_c: 219123.0
mc_scaled_8tap_regular_w128_8bpc_ssse3: 31867.7
mc_scaled_8tap_regular_w128_dy1_8bpc_c: 240143.3
mc_scaled_8tap_regular_w128_dy1_8bpc_ssse3: 35275.7
mc_scaled_8tap_regular_w128_dy2_8bpc_c: 376357.7
mc_scaled_8tap_regular_w128_dy2_8bpc_ssse3: 39411.4

mct_scaled_8tap_regular_w4_8bpc_c: 1178.7
mct_scaled_8tap_regular_w4_8bpc_ssse3: 176.8
mct_scaled_8tap_regular_w4_dy1_8bpc_c: 1354.8
mct_scaled_8tap_regular_w4_dy1_8bpc_ssse3: 131.5
mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1832.2
mct_scaled_8tap_regular_w4_dy2_8bpc_ssse3: 123.0
mct_scaled_8tap_regular_w8_8bpc_c: 3547.6
mct_scaled_8tap_regular_w8_8bpc_ssse3: 526.0
mct_scaled_8tap_regular_w8_dy1_8bpc_c: 3683.8
mct_scaled_8tap_regular_w8_dy1_8bpc_ssse3: 513.8
mct_scaled_8tap_regular_w8_dy2_8bpc_c: 5260.7
mct_scaled_8tap_regular_w8_dy2_8bpc_ssse3: 566.1
mct_scaled_8tap_regular_w16_8bpc_c: 8424.5
mct_scaled_8tap_regular_w16_8bpc_ssse3: 1340.0
mct_scaled_8tap_regular_w16_dy1_8bpc_c: 9515.8
mct_scaled_8tap_regular_w16_dy1_8bpc_ssse3: 1337.0
mct_scaled_8tap_regular_w16_dy2_8bpc_c: 14247.3
mct_scaled_8tap_regular_w16_dy2_8bpc_ssse3: 1492.7
mct_scaled_8tap_regular_w32_8bpc_c: 32059.9
mct_scaled_8tap_regular_w32_8bpc_ssse3: 5177.5
mct_scaled_8tap_regular_w32_dy1_8bpc_c: 32557.6
mct_scaled_8tap_regular_w32_dy1_8bpc_ssse3: 4889.9
mct_scaled_8tap_regular_w32_dy2_8bpc_c: 50844.2
mct_scaled_8tap_regular_w32_dy2_8bpc_ssse3: 5667.1
mct_scaled_8tap_regular_w64_8bpc_c: 59903.1
mct_scaled_8tap_regular_w64_8bpc_ssse3: 10453.6
mct_scaled_8tap_regular_w64_dy1_8bpc_c: 80298.8
mct_scaled_8tap_regular_w64_dy1_8bpc_ssse3: 12597.8
mct_scaled_8tap_regular_w64_dy2_8bpc_c: 127244.8
mct_scaled_8tap_regular_w64_dy2_8bpc_ssse3: 14677.9
mct_scaled_8tap_regular_w128_8bpc_c: 280097.0
mct_scaled_8tap_regular_w128_8bpc_ssse3: 41989.3
mct_scaled_8tap_regular_w128_dy1_8bpc_c: 208913.2
mct_scaled_8tap_regular_w128_dy1_8bpc_ssse3: 35525.2
mct_scaled_8tap_regular_w128_dy2_8bpc_c: 341367.6
mct_scaled_8tap_regular_w128_dy2_8bpc_ssse3: 41449.0
2021-12-13 14:27:00 +01:00
Victorien Le Couviour--Tuffet 3fd2ad938a Fix a leak when threading is active
Credit to Oss-Fuzz.
2021-11-01 15:14:21 +01:00
Victorien Le Couviour--Tuffet f7e0d4c032 Remove lpf_stride parameter from LR filters 2021-10-29 22:18:20 +02:00
Victorien Le Couviour--Tuffet 609fbaba84 Allow CDEF and LR to run sbrows in parallel 2021-10-29 22:18:20 +02:00
Victorien Le Couviour--Tuffet 8e6d5214a3 CI: Add tests for negative stride 2021-10-29 22:18:05 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner 82d6d950a2 x86: Add deblock loop filters AVX-512 (Ice Lake) asm 2021-10-18 14:49:05 +00:00
Victorien Le Couviour--Tuffet 5991883dc6 x86: Add high bitdepth mc(t)_scaled AVX2 asm 2021-09-20 13:47:49 +02:00
Victorien Le Couviour--Tuffet 69ff474a7f Revert "Group lr_lpf_line re-allocation with lr_mask_sz"
This reverts commit e53314177a.

Causes issues when the sample has both 8 and 16 bit content.

Credit to Oss-Fuzz.
2021-09-10 19:39:05 +02:00
Victorien Le Couviour--Tuffet 833c818b87 Minor consistency fixes, purely cosmetic 2021-09-09 13:42:04 +00:00
Victorien Le Couviour--Tuffet 976b9e4965 Fix a potential hang when dav1d_submit_frame fails
Credit to Oss-Fuzz.
2021-09-09 13:42:04 +00:00
Victorien Le Couviour--Tuffet e53314177a Group lr_lpf_line re-allocation with lr_mask_sz 2021-09-07 17:13:44 +02:00
Victorien Le Couviour--Tuffet 159215a82d Fix lr_lpf_line re-allocation check
Credit to Oss-Fuzz.
2021-09-07 17:13:41 +02:00
Victorien Le Couviour--TuffetandRonald S. Bultje 753eef833b Merge the 3 threading models into a single one
Merges the 3 threading parameters into a single `--threads=` argument.
Frame threading can still be controlled via the `--framedelay=` argument.
Internally, the threading model is now a global thread/task pool design.

Co-authored-by: Ronald S. Bultje <rsbultje@gmail.com>
2021-09-03 16:06:31 +00:00
Victorien Le Couviour--Tuffet b1adba65c9 x86: Add high bitdepth mc.resize SSSE3 asm
resize_16bpc_ssse3: 141122.7
resize_16bpc_avx2: 105971.7
2021-08-12 12:14:49 +02:00
Victorien Le Couviour--Tuffet e647a54db9 x86: Fix minor things in mc.resize_8bpc_ssse3
- number of gpr and xmm regs in use
- some cosmetics (no need to specify x for xmm regs on SSSE3)
- a comment with wrong registers (unedited copy from AVX2 code)
2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet e479e4a942 x86: Add high bitdepth mc.resize AVX2 asm
resize_8bpc_avx2: 82986.1
resize_16bpc_avx2: 103896.7
2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet b7f5503159 x86: Add minor improvement to mc.resize_8bpc_avx2
Simplify some gpr extract and sign extend operations.
2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet 356387f6f6 x86: Add bpc suffix to mc functions 2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet fe903da5b8 x86: Rewrite sgr8 SSSE3 asm
Old:
sgr_3x3_8bpc_ssse3: 140121.1
sgr_3x3_8bpc_avx2: 72965.4
sgr_5x5_8bpc_ssse3: 89859.1
sgr_5x5_8bpc_avx2: 48881.9
sgr_mix_8bpc_ssse3: 236626.5
sgr_mix_8bpc_avx2: 110552.6

New:
sgr_3x3_8bpc_ssse3: 117294.4
sgr_3x3_8bpc_avx2: 72243.5
sgr_5x5_8bpc_ssse3: 79929.6
sgr_5x5_8bpc_avx2: 49798.4
sgr_mix_8bpc_ssse3: 184183.9
sgr_mix_8bpc_avx2: 109771.7
2021-08-03 14:58:45 +00:00
Victorien Le Couviour--Tuffet 935175daa7 x86: Add minor improvements to sgr16 SSSE3 asm
Old:
sgr_5x5_10bpc_ssse3: 87026.6
sgr_5x5_10bpc_avx2: 51864.5
sgr_mix_10bpc_ssse3: 205460.2
sgr_mix_10bpc_avx2: 122199.7

New:
sgr_5x5_10bpc_ssse3: 84786.5
sgr_5x5_10bpc_avx2: 51651.3
sgr_mix_10bpc_ssse3: 202722.2
sgr_mix_10bpc_avx2: 122340.0
2021-08-03 14:58:45 +00:00
Victorien Le Couviour--Tuffet 513fd90c26 x86: Add high bitdepth (10-bit) sgr SSSE3 asm 2021-07-12 07:40:23 +00:00
Victorien Le Couviour--Tuffet 12f170c437 x86: Add minor improvements to wiener16 SSSE3 asm 2021-07-12 07:40:23 +00:00
Victorien Le Couviour--Tuffet 193db389e9 x86: Add high bitdepth wiener filter SSSE3 asm 2021-06-09 14:15:31 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner dc7cdc0b58 x86: Add high bitdepth pal_pred AVX2 asm 2021-05-04 22:39:17 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner 0d42b3030b x86: Add high bitdepth ipred_cfl_ac_422 AVX2 asm 2021-05-04 17:00:07 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner ec5e93eecd x86: Add high bitdepth ipred_cfl_ac_420 AVX2 asm 2021-05-04 17:00:07 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner de6813f92c x86: Add high bitdepth ipred_filter AVX2 asm 2021-05-04 17:00:07 +02:00
Victorien Le Couviour--TuffetandJean-Baptiste Kempf 8b1a96e481 Fix potential deadlock
If the postfilter tasks allocation fails, a deadlock would occur.
2021-02-05 23:54:58 +01:00
Victorien Le Couviour--Tuffet 288ed4b8ec dav1dplay: Add pause and seek features 2021-02-01 11:18:04 +01:00
Victorien Le Couviour--Tuffet 549086e4d3 Add post-filters threading model 2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet 4db73f115e tests: Refactor seek_stress decoding functions 2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet 66c8a1ec28 fuzzer: Remove redundant flush
Calling dav1d_close already takes care of flushing the internal state.
Calling it just before is superfluous.
2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet 5686e8355c tests/seek_stress: Reduce the number of iterations 2021-01-21 09:54:50 +01:00
Victorien Le Couviour--Tuffet 05d05f9776 CI: Run the seek stress test 2021-01-18 13:59:26 +01:00
Victorien Le Couviour--Tuffet 63a918b487 tests: Add a seek stress test
Closes #203.
2021-01-18 13:58:30 +01:00
Victorien Le Couviour--Tuffet 493d2b9157 input/ivf: Add seeking capability 2021-01-15 14:56:23 +01:00
Victorien Le Couviour--Tuffet a40d3b5f0f Abort frame decoding properly on reference error
This could cause a frame waiting on the current one to not be notified
on error.

Fixes #351.
2020-10-21 14:37:12 +02:00
Victorien Le Couviour--Tuffet 06f12a8995 x86: Add {put/prep}_{8tap/bilin} SSSE3 asm (64-bit) 2020-08-06 15:34:40 +02:00
Victorien Le Couviour--Tuffet 652e5b38b0 x86: Minor changes to MC scaled AVX2 asm 2020-08-05 12:25:53 +02:00
Victorien Le Couviour--Tuffet a75ee78bd9 x86: Add put/prep_bilin_scaled AVX2 asm
Bilin scaled being very rarely used, add a new table entry to
mc_subpel_filters, and jump to the put/prep_8tap_scaled code.

AVX2 performance is obviously the same as the 8tap code, the speed up is
much smaller though, as the C code is a true bilinear codepath,
auto-vectorized. Yet, the AVX2 performance are always better.
2020-06-18 11:37:00 +02:00
Victorien Le Couviour--Tuffet ea74e3d513 x86: Add prep_8tap_scaled AVX2 asm
mct_scaled_8tap_regular_w4_8bpc_c: 872.1
mct_scaled_8tap_regular_w4_8bpc_avx2: 125.6
mct_scaled_8tap_regular_w4_dy1_8bpc_c: 886.3
mct_scaled_8tap_regular_w4_dy1_8bpc_avx2: 84.0
mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1189.1
mct_scaled_8tap_regular_w4_dy2_8bpc_avx2: 84.7

mct_scaled_8tap_regular_w8_8bpc_c: 2261.0
mct_scaled_8tap_regular_w8_8bpc_avx2: 306.2
mct_scaled_8tap_regular_w8_dy1_8bpc_c: 2189.9
mct_scaled_8tap_regular_w8_dy1_8bpc_avx2: 233.8
mct_scaled_8tap_regular_w8_dy2_8bpc_c: 3060.3
mct_scaled_8tap_regular_w8_dy2_8bpc_avx2: 282.8

mct_scaled_8tap_regular_w16_8bpc_c: 4335.3
mct_scaled_8tap_regular_w16_8bpc_avx2: 680.7
mct_scaled_8tap_regular_w16_dy1_8bpc_c: 5137.2
mct_scaled_8tap_regular_w16_dy1_8bpc_avx2: 578.6
mct_scaled_8tap_regular_w16_dy2_8bpc_c: 7878.4
mct_scaled_8tap_regular_w16_dy2_8bpc_avx2: 774.6

mct_scaled_8tap_regular_w32_8bpc_c: 17871.9
mct_scaled_8tap_regular_w32_8bpc_avx2: 2954.8
mct_scaled_8tap_regular_w32_dy1_8bpc_c: 18594.7
mct_scaled_8tap_regular_w32_dy1_8bpc_avx2: 2073.9
mct_scaled_8tap_regular_w32_dy2_8bpc_c: 28696.0
mct_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2852.1

mct_scaled_8tap_regular_w64_8bpc_c: 46967.5
mct_scaled_8tap_regular_w64_8bpc_avx2: 7527.5
mct_scaled_8tap_regular_w64_dy1_8bpc_c: 45564.2
mct_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5262.9
mct_scaled_8tap_regular_w64_dy2_8bpc_c: 72793.3
mct_scaled_8tap_regular_w64_dy2_8bpc_avx2: 7535.9

mct_scaled_8tap_regular_w128_8bpc_c: 111190.8
mct_scaled_8tap_regular_w128_8bpc_avx2: 19386.8
mct_scaled_8tap_regular_w128_dy1_8bpc_c: 122625.0
mct_scaled_8tap_regular_w128_dy1_8bpc_avx2: 15376.1
mct_scaled_8tap_regular_w128_dy2_8bpc_c: 197120.6
mct_scaled_8tap_regular_w128_dy2_8bpc_avx2: 21871.0
2020-06-18 11:37:00 +02:00
Victorien Le Couviour--Tuffet 22fb8a42a1 x86: Adapt SSSE3 prep_8tap to SSE2
---------------------
x86_64:
------------------------------------------
mct_8tap_regular_w4_h_8bpc_c: 302.3
mct_8tap_regular_w4_h_8bpc_sse2: 47.3
mct_8tap_regular_w4_h_8bpc_ssse3: 19.5
---------------------
mct_8tap_regular_w8_h_8bpc_c: 745.5
mct_8tap_regular_w8_h_8bpc_sse2: 235.2
mct_8tap_regular_w8_h_8bpc_ssse3: 70.4
---------------------
mct_8tap_regular_w16_h_8bpc_c: 1844.3
mct_8tap_regular_w16_h_8bpc_sse2: 755.6
mct_8tap_regular_w16_h_8bpc_ssse3: 225.9
---------------------
mct_8tap_regular_w32_h_8bpc_c: 6685.5
mct_8tap_regular_w32_h_8bpc_sse2: 2954.4
mct_8tap_regular_w32_h_8bpc_ssse3: 795.8
---------------------
mct_8tap_regular_w64_h_8bpc_c: 15633.5
mct_8tap_regular_w64_h_8bpc_sse2: 7120.4
mct_8tap_regular_w64_h_8bpc_ssse3: 1900.4
---------------------
mct_8tap_regular_w128_h_8bpc_c: 37772.1
mct_8tap_regular_w128_h_8bpc_sse2: 17698.1
mct_8tap_regular_w128_h_8bpc_ssse3: 4665.5
------------------------------------------
mct_8tap_regular_w4_v_8bpc_c: 306.5
mct_8tap_regular_w4_v_8bpc_sse2: 71.7
mct_8tap_regular_w4_v_8bpc_ssse3: 37.9
---------------------
mct_8tap_regular_w8_v_8bpc_c: 923.3
mct_8tap_regular_w8_v_8bpc_sse2: 168.7
mct_8tap_regular_w8_v_8bpc_ssse3: 71.3
---------------------
mct_8tap_regular_w16_v_8bpc_c: 3040.1
mct_8tap_regular_w16_v_8bpc_sse2: 505.1
mct_8tap_regular_w16_v_8bpc_ssse3: 199.7
---------------------
mct_8tap_regular_w32_v_8bpc_c: 12354.8
mct_8tap_regular_w32_v_8bpc_sse2: 1942.0
mct_8tap_regular_w32_v_8bpc_ssse3: 714.2
---------------------
mct_8tap_regular_w64_v_8bpc_c: 29427.9
mct_8tap_regular_w64_v_8bpc_sse2: 4637.4
mct_8tap_regular_w64_v_8bpc_ssse3: 1829.2
---------------------
mct_8tap_regular_w128_v_8bpc_c: 72756.9
mct_8tap_regular_w128_v_8bpc_sse2: 11301.0
mct_8tap_regular_w128_v_8bpc_ssse3: 5020.6
------------------------------------------
mct_8tap_regular_w4_hv_8bpc_c: 876.9
mct_8tap_regular_w4_hv_8bpc_sse2: 171.7
mct_8tap_regular_w4_hv_8bpc_ssse3: 112.2
---------------------
mct_8tap_regular_w8_hv_8bpc_c: 2215.1
mct_8tap_regular_w8_hv_8bpc_sse2: 730.2
mct_8tap_regular_w8_hv_8bpc_ssse3: 330.9
---------------------
mct_8tap_regular_w16_hv_8bpc_c: 6075.5
mct_8tap_regular_w16_hv_8bpc_sse2: 2252.1
mct_8tap_regular_w16_hv_8bpc_ssse3: 973.4
---------------------
mct_8tap_regular_w32_hv_8bpc_c: 22182.7
mct_8tap_regular_w32_hv_8bpc_sse2: 7692.6
mct_8tap_regular_w32_hv_8bpc_ssse3: 3599.8
---------------------
mct_8tap_regular_w64_hv_8bpc_c: 50876.8
mct_8tap_regular_w64_hv_8bpc_sse2: 18499.6
mct_8tap_regular_w64_hv_8bpc_ssse3: 8815.6
---------------------
mct_8tap_regular_w128_hv_8bpc_c: 122926.3
mct_8tap_regular_w128_hv_8bpc_sse2: 45120.0
mct_8tap_regular_w128_hv_8bpc_ssse3: 22085.7
------------------------------------------
2020-06-11 12:37:36 +02:00
Victorien Le Couviour--Tuffet 83956bf10e x86: Adapt SSSE3 prep_bilin to SSE2
---------------------
x86_64:
------------------------------------------
mct_bilinear_w4_h_8bpc_c: 98.9
mct_bilinear_w4_h_8bpc_sse2: 30.2
mct_bilinear_w4_h_8bpc_ssse3: 11.5
---------------------
mct_bilinear_w8_h_8bpc_c: 175.3
mct_bilinear_w8_h_8bpc_sse2: 57.0
mct_bilinear_w8_h_8bpc_ssse3: 19.7
---------------------
mct_bilinear_w16_h_8bpc_c: 396.2
mct_bilinear_w16_h_8bpc_sse2: 179.3
mct_bilinear_w16_h_8bpc_ssse3: 50.9
---------------------
mct_bilinear_w32_h_8bpc_c: 1311.2
mct_bilinear_w32_h_8bpc_sse2: 718.8
mct_bilinear_w32_h_8bpc_ssse3: 243.9
---------------------
mct_bilinear_w64_h_8bpc_c: 2892.7
mct_bilinear_w64_h_8bpc_sse2: 1746.0
mct_bilinear_w64_h_8bpc_ssse3: 568.0
---------------------
mct_bilinear_w128_h_8bpc_c: 7192.6
mct_bilinear_w128_h_8bpc_sse2: 4339.8
mct_bilinear_w128_h_8bpc_ssse3: 1619.2
------------------------------------------
mct_bilinear_w4_v_8bpc_c: 129.7
mct_bilinear_w4_v_8bpc_sse2: 26.6
mct_bilinear_w4_v_8bpc_ssse3: 16.7
---------------------
mct_bilinear_w8_v_8bpc_c: 233.3
mct_bilinear_w8_v_8bpc_sse2: 55.0
mct_bilinear_w8_v_8bpc_ssse3: 24.7
---------------------
mct_bilinear_w16_v_8bpc_c: 498.9
mct_bilinear_w16_v_8bpc_sse2: 146.0
mct_bilinear_w16_v_8bpc_ssse3: 54.2
---------------------
mct_bilinear_w32_v_8bpc_c: 1562.2
mct_bilinear_w32_v_8bpc_sse2: 560.6
mct_bilinear_w32_v_8bpc_ssse3: 201.0
---------------------
mct_bilinear_w64_v_8bpc_c: 3221.3
mct_bilinear_w64_v_8bpc_sse2: 1380.6
mct_bilinear_w64_v_8bpc_ssse3: 499.3
---------------------
mct_bilinear_w128_v_8bpc_c: 7357.7
mct_bilinear_w128_v_8bpc_sse2: 3439.0
mct_bilinear_w128_v_8bpc_ssse3: 1489.1
------------------------------------------
mct_bilinear_w4_hv_8bpc_c: 185.0
mct_bilinear_w4_hv_8bpc_sse2: 54.5
mct_bilinear_w4_hv_8bpc_ssse3: 22.1
---------------------
mct_bilinear_w8_hv_8bpc_c: 377.8
mct_bilinear_w8_hv_8bpc_sse2: 104.3
mct_bilinear_w8_hv_8bpc_ssse3: 35.8
---------------------
mct_bilinear_w16_hv_8bpc_c: 1159.4
mct_bilinear_w16_hv_8bpc_sse2: 311.0
mct_bilinear_w16_hv_8bpc_ssse3: 106.3
---------------------
mct_bilinear_w32_hv_8bpc_c: 4436.2
mct_bilinear_w32_hv_8bpc_sse2: 1230.7
mct_bilinear_w32_hv_8bpc_ssse3: 400.7
---------------------
mct_bilinear_w64_hv_8bpc_c: 10627.7
mct_bilinear_w64_hv_8bpc_sse2: 2934.2
mct_bilinear_w64_hv_8bpc_ssse3: 957.2
---------------------
mct_bilinear_w128_hv_8bpc_c: 26048.9
mct_bilinear_w128_hv_8bpc_sse2: 7590.3
mct_bilinear_w128_hv_8bpc_ssse3: 2947.0
------------------------------------------
2020-06-11 12:37:36 +02:00
Victorien Le Couviour--Tuffet a755541faa x86: Add put_8tap_scaled AVX2 asm
mc_scaled_8tap_regular_w2_8bpc_c: 764.4
mc_scaled_8tap_regular_w2_8bpc_avx2: 191.3
mc_scaled_8tap_regular_w2_dy1_8bpc_c: 705.8
mc_scaled_8tap_regular_w2_dy1_8bpc_avx2: 89.5
mc_scaled_8tap_regular_w2_dy2_8bpc_c: 964.0
mc_scaled_8tap_regular_w2_dy2_8bpc_avx2: 120.3

mc_scaled_8tap_regular_w4_8bpc_c: 1355.7
mc_scaled_8tap_regular_w4_8bpc_avx2: 180.9
mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1233.2
mc_scaled_8tap_regular_w4_dy1_8bpc_avx2: 115.3
mc_scaled_8tap_regular_w4_dy2_8bpc_c: 1707.6
mc_scaled_8tap_regular_w4_dy2_8bpc_avx2: 117.9

mc_scaled_8tap_regular_w8_8bpc_c: 2483.2
mc_scaled_8tap_regular_w8_8bpc_avx2: 294.8
mc_scaled_8tap_regular_w8_dy1_8bpc_c: 2166.4
mc_scaled_8tap_regular_w8_dy1_8bpc_avx2: 222.0
mc_scaled_8tap_regular_w8_dy2_8bpc_c: 3133.7
mc_scaled_8tap_regular_w8_dy2_8bpc_avx2: 292.6

mc_scaled_8tap_regular_w16_8bpc_c: 5239.2
mc_scaled_8tap_regular_w16_8bpc_avx2: 729.9
mc_scaled_8tap_regular_w16_dy1_8bpc_c: 5156.5
mc_scaled_8tap_regular_w16_dy1_8bpc_avx2: 602.2
mc_scaled_8tap_regular_w16_dy2_8bpc_c: 8018.4
mc_scaled_8tap_regular_w16_dy2_8bpc_avx2: 783.1

mc_scaled_8tap_regular_w32_8bpc_c: 14745.0
mc_scaled_8tap_regular_w32_8bpc_avx2: 2205.0
mc_scaled_8tap_regular_w32_dy1_8bpc_c: 14862.3
mc_scaled_8tap_regular_w32_dy1_8bpc_avx2: 1721.3
mc_scaled_8tap_regular_w32_dy2_8bpc_c: 23607.6
mc_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2325.7

mc_scaled_8tap_regular_w64_8bpc_c: 54891.7
mc_scaled_8tap_regular_w64_8bpc_avx2: 8351.4
mc_scaled_8tap_regular_w64_dy1_8bpc_c: 50249.0
mc_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5864.4
mc_scaled_8tap_regular_w64_dy2_8bpc_c: 79400.1
mc_scaled_8tap_regular_w64_dy2_8bpc_avx2: 8295.7

mc_scaled_8tap_regular_w128_8bpc_c: 121046.8
mc_scaled_8tap_regular_w128_8bpc_avx2: 21809.1
mc_scaled_8tap_regular_w128_dy1_8bpc_c: 133720.4
mc_scaled_8tap_regular_w128_dy1_8bpc_avx2: 16197.8
mc_scaled_8tap_regular_w128_dy2_8bpc_c: 218774.8
mc_scaled_8tap_regular_w128_dy2_8bpc_avx2: 22993.1
2020-06-01 15:30:36 +02:00
Victorien Le Couviour--Tuffet 98ed9be69b Fix MC masks alignment for sizes >= 64 for AVX-512
Those need to be aligned when w*h >= 64, as we will try to load by 64 bytes.

(also realigns the 4x4 masks to 16 as a 32-byte alignment is unnecessary)
2020-04-16 11:43:08 +02:00
Victorien Le Couviour--Tuffet 604d93c5f7 x86: Split AVX2 / AVX-512 CDEF into dedicated files 2020-04-07 16:21:53 +02:00
Victorien Le Couviour--Tuffet 95068df6a6 x86: Add cdef_filter_{4,8}x8 AVX-512 (Ice Lake) asm
cdef_filter_4x8_8bpc_avx2: 54.0
cdef_filter_4x8_8bpc_avx512icl: 35.5
=> +52.1%

cdef_filter_8x8_8bpc_avx2: 71.0
cdef_filter_8x8_8bpc_avx512icl: 49.0
=> +44.9%
2020-04-07 16:10:44 +02:00
Victorien Le Couviour--Tuffet 71f27407dd x86: add some explanatory comment to wiener_filter_h
Explains how the clipping to the range defined in the spec works.
2020-04-03 14:21:36 +02:00
Victorien Le Couviour--Tuffet 22080aa30c x86: optimize cdef_filter_{4x{4,8},8x8}_avx2
Add 2 seperate code paths for pri/sec strengths equal 0.
Having both strengths not equal to 0 is uncommon, branching to skip
unnecessary computations is therefore beneficial.

------------------------------------------
before: cdef_filter_4x4_8bpc_avx2: 93.8
 after: cdef_filter_4x4_8bpc_avx2: 71.7
---------------------
before: cdef_filter_4x8_8bpc_avx2: 161.5
 after: cdef_filter_4x8_8bpc_avx2: 116.3
---------------------
before: cdef_filter_8x8_8bpc_avx2: 221.8
 after: cdef_filter_8x8_8bpc_avx2: 156.4
------------------------------------------
2020-02-24 11:23:20 +01:00
Victorien Le Couviour--Tuffet 1bd078c2e5 x86: add a seperate fully edged case to cdef_filter_avx2
---------------------
fully edged blocks perf
------------------------------------------
before: cdef_filter_4x4_8bpc_avx2: 91.0
 after: cdef_filter_4x4_8bpc_avx2: 75.7
---------------------
before: cdef_filter_4x8_8bpc_avx2: 154.6
 after: cdef_filter_4x8_8bpc_avx2: 131.8
---------------------
before: cdef_filter_8x8_8bpc_avx2: 214.1
 after: cdef_filter_8x8_8bpc_avx2: 195.9
------------------------------------------
2020-02-24 11:23:20 +01:00
Victorien Le Couviour--TuffetandVictorien Le Couviour--Tuffet e706fac9cf x86: add prep_8tap AVX512 asm 2020-01-20 11:42:53 +01:00
Victorien Le Couviour--TuffetandVictorien Le Couviour--Tuffet b83cb9643b x86: replace "mov hb, Xb" by "movzx hd, Xb" in MC
It's a little easier for the CPU to simply overwrite a 32-bit reg rather
than writing it's low 8 bits while conserving bits 8 to 31.
In order to do that it actually fetches those bits, merge to a 32-bit
value, and write that back to the 32-bit GPR.

As those are always cleared, perform a zero extend mov to dword instead.
2020-01-20 11:18:07 +01:00
Victorien Le Couviour--TuffetandRonald S. Bultje 5462c2a80d x86: add prep_bilin AVX512 asm
------------------------------------------
mct_bilinear_w4_0_8bpc_avx2:      3.8
mct_bilinear_w4_0_8bpc_avx512icl: 3.7
---------------------
mct_bilinear_w8_0_8bpc_avx2:      5.0
mct_bilinear_w8_0_8bpc_avx512icl: 4.8
---------------------
mct_bilinear_w16_0_8bpc_avx2:      8.5
mct_bilinear_w16_0_8bpc_avx512icl: 7.1
---------------------
mct_bilinear_w32_0_8bpc_avx2:      29.5
mct_bilinear_w32_0_8bpc_avx512icl: 17.1
---------------------
mct_bilinear_w64_0_8bpc_avx2:      68.1
mct_bilinear_w64_0_8bpc_avx512icl: 34.7
---------------------
mct_bilinear_w128_0_8bpc_avx2:      180.5
mct_bilinear_w128_0_8bpc_avx512icl: 138.0
------------------------------------------
mct_bilinear_w4_h_8bpc_avx2:      4.0
mct_bilinear_w4_h_8bpc_avx512icl: 3.9
---------------------
mct_bilinear_w8_h_8bpc_avx2:      5.3
mct_bilinear_w8_h_8bpc_avx512icl: 5.0
---------------------
mct_bilinear_w16_h_8bpc_avx2:      11.7
mct_bilinear_w16_h_8bpc_avx512icl:  7.5
---------------------
mct_bilinear_w32_h_8bpc_avx2:      41.8
mct_bilinear_w32_h_8bpc_avx512icl: 20.3
---------------------
mct_bilinear_w64_h_8bpc_avx2:      94.9
mct_bilinear_w64_h_8bpc_avx512icl: 35.0
---------------------
mct_bilinear_w128_h_8bpc_avx2:      240.1
mct_bilinear_w128_h_8bpc_avx512icl: 143.8
------------------------------------------
mct_bilinear_w4_v_8bpc_avx2:      4.1
mct_bilinear_w4_v_8bpc_avx512icl: 4.0
---------------------
mct_bilinear_w8_v_8bpc_avx2:      6.0
mct_bilinear_w8_v_8bpc_avx512icl: 5.4
---------------------
mct_bilinear_w16_v_8bpc_avx2:      10.3
mct_bilinear_w16_v_8bpc_avx512icl:  8.9
---------------------
mct_bilinear_w32_v_8bpc_avx2:      29.5
mct_bilinear_w32_v_8bpc_avx512icl: 25.9
---------------------
mct_bilinear_w64_v_8bpc_avx2:      64.3
mct_bilinear_w64_v_8bpc_avx512icl: 41.3
---------------------
mct_bilinear_w128_v_8bpc_avx2:      198.2
mct_bilinear_w128_v_8bpc_avx512icl: 139.6
------------------------------------------
mct_bilinear_w4_hv_8bpc_avx2:      5.6
mct_bilinear_w4_hv_8bpc_avx512icl: 5.2
---------------------
mct_bilinear_w8_hv_8bpc_avx2:      8.3
mct_bilinear_w8_hv_8bpc_avx512icl: 7.0
---------------------
mct_bilinear_w16_hv_8bpc_avx2:      19.4
mct_bilinear_w16_hv_8bpc_avx512icl: 12.1
---------------------
mct_bilinear_w32_hv_8bpc_avx2:      69.1
mct_bilinear_w32_hv_8bpc_avx512icl: 32.5
---------------------
mct_bilinear_w64_hv_8bpc_avx2:      164.4
mct_bilinear_w64_hv_8bpc_avx512icl:  71.1
---------------------
mct_bilinear_w128_hv_8bpc_avx2:      405.2
mct_bilinear_w128_hv_8bpc_avx512icl: 193.1
------------------------------------------
2020-01-09 14:56:42 +01:00
Victorien Le Couviour--TuffetandRonald S. Bultje 40891aab9b x86: add avx512icl cpu flag to x86inc.asm 2020-01-09 14:56:42 +01:00
Victorien Le Couviour--TuffetandRonald S. Bultje 430967a627 checkasm: x86: ensure all SIMD lanes are turned on at all times
YMM and ZMM registers on x86 are turned off to save power when they haven't
been used for some period of time. When they are used there will be a
"warmup" period during which performance will be reduced and inconsistent
which is problematic when trying to benchmark individual functions.

Periodically issue "dummy" instructions that uses those registers to
prevent them from being powered down. The end result is more consistent
benchmark results.

Credits to Henrik Gramner's commit
1878c7f2af0a9c73e291488209109782c428cfcf from x264.
2020-01-09 14:56:42 +01:00
Victorien Le Couviour--Tuffet 36d615d120 x86: adapt SSSE3 wiener filter to SSE2
Also slightly optimized the 32-bit SSSE3, especially by the removal of
an XMM store/load.

---------------------
x86_64:
------------------------------------------
wiener_chroma_8bpc_c: 193155.1
wiener_chroma_8bpc_sse2: 48973.4
wiener_chroma_8bpc_ssse3: 31486.3
---------------------
wiener_luma_8bpc_c: 192787.5
wiener_luma_8bpc_sse2: 48674.9
wiener_luma_8bpc_ssse3: 30446.3
------------------------------------------

---------------------
x86_32:
------------------------------------------
wiener_chroma_8bpc_c: 309861.0
wiener_chroma_8bpc_sse2: 52345.9
wiener_chroma_8bpc_ssse3: 32983.2
---------------------
wiener_luma_8bpc_c: 317909.1
wiener_luma_8bpc_sse2: 52522.1
wiener_luma_8bpc_ssse3: 33323.1
------------------------------------------
2019-10-24 20:42:52 +02:00
Victorien Le Couviour--Tuffet 4866abab1f x86: adapt SSSE3 warp_affine_8x8{,t} to SSE2
---------------------
x86_64:
------------------------------------------
warp_8x8_8bpc_c: 1761.5
warp_8x8_8bpc_sse2: 583.0
warp_8x8_8bpc_ssse3: 329.3
---------------------
warp_8x8t_8bpc_c: 1694.3
warp_8x8t_8bpc_sse2: 577.6
warp_8x8t_8bpc_ssse3: 334.1
------------------------------------------

---------------------
x86_32:
------------------------------------------
warp_8x8_8bpc_c: 1842.6
warp_8x8_8bpc_sse2: 677.1
warp_8x8_8bpc_ssse3: 394.9
---------------------
warp_8x8t_8bpc_c: 1741.1
warp_8x8t_8bpc_sse2: 648.5
warp_8x8t_8bpc_ssse3: 372.6
------------------------------------------
2019-10-24 20:42:52 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner 477905413d x86inc: fix LOAD_MM_PERMUTATION for AVX512
Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION)
is redondant. It causes the register mapping to be the same as without
the initial AVX512_MM_PERMUTATION, with the user SWAPs applied.

For example...

INIT_YMM avx512
SWAP m0, m16
SAVE_MM_PERMUTATION
; do whatever
LOAD_MM_PERMUTATION

... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1
instead of ymm17.
2019-10-21 20:21:38 +02:00
Victorien Le Couviour--Tuffet 3e9f967640 x86: adapt SSSE3 cdef_filter_{4x4,4x8,8x8} to SSE2
---------------------
x86_64:
------------------------------------------
cdef_filter_4x4_8bpc_c: 1376.0
cdef_filter_4x4_8bpc_sse2: 177.6
cdef_filter_4x4_8bpc_ssse3: 132.5
---------------------
cdef_filter_4x8_8bpc_c: 2725.0
cdef_filter_4x8_8bpc_sse2: 327.6
cdef_filter_4x8_8bpc_ssse3: 234.9
---------------------
cdef_filter_8x8_8bpc_c: 5938.8
cdef_filter_8x8_8bpc_sse2: 556.8
cdef_filter_8x8_8bpc_ssse3: 388.1
------------------------------------------

---------------------
x86_32:
------------------------------------------
cdef_filter_4x4_8bpc_c: 1569.5
cdef_filter_4x4_8bpc_sse2: 201.9
cdef_filter_4x4_8bpc_ssse3: 162.3
---------------------
cdef_filter_4x8_8bpc_c: 3141.6
cdef_filter_4x8_8bpc_sse2: 368.3
cdef_filter_4x8_8bpc_ssse3: 283.4
---------------------
cdef_filter_8x8_8bpc_c: 6534.5
cdef_filter_8x8_8bpc_sse2: 666.7
cdef_filter_8x8_8bpc_ssse3: 503.5
------------------------------------------
2019-10-18 11:05:11 +02:00
Victorien Le Couviour--Tuffet 11b7250644 tools: fix SSE2 cpu masking 2019-10-16 10:45:54 +02:00
Victorien Le Couviour--Tuffet a91a03b0e1 x86: add warp_affine SSE4 and SSSE3 asm
------------------------------------------
x86_64: warp_8x8_8bpc_c: 1773.4
x86_32: warp_8x8_8bpc_c: 1740.4
----------
x86_64: warp_8x8_8bpc_ssse3: 317.5
x86_32: warp_8x8_8bpc_ssse3: 378.4
----------
x86_64: warp_8x8_8bpc_sse4: 303.7
x86_32: warp_8x8_8bpc_sse4: 367.7
----------
x86_64: warp_8x8_8bpc_avx2: 224.9
---------------------
---------------------
x86_64: warp_8x8t_8bpc_c: 1664.6
x86_32: warp_8x8t_8bpc_c: 1674.0
----------
x86_64: warp_8x8t_8bpc_ssse3: 320.7
x86_32: warp_8x8t_8bpc_ssse3: 379.5
----------
x86_64: warp_8x8t_8bpc_sse4: 304.8
x86_32: warp_8x8t_8bpc_sse4: 369.8
----------
x86_64: warp_8x8t_8bpc_avx2: 228.5
------------------------------------------
2019-09-30 15:40:43 +02:00
Victorien Le Couviour--Tuffet c0865f35c7 x86: add 32-bit support to SSSE3 deblock lpf
------------------------------------------
x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6
x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6
x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4
---------------------
x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9
x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6
x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6
---------------------
x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7
x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3
x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3
x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7
---------------------
x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7
x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8
x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9
x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9
---------------------
x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9
x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5
x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0
---------------------
x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4
x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6
x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0
---------------------
x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6
x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7
x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2
---------------------
x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4
x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5
x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8
x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7
---------------------
x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2
x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9
x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3
x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9
---------------------
x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6
x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7
x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8
------------------------------------------
2019-09-19 12:07:23 +02:00
Victorien Le Couviour--Tuffet beda6e0d1c build: fix meson deprecation warning
'build_' prefix is reserved by meson, this will become an error in the
future, as indicated by a warning when configuring the build dir.

Closes #285.
2019-07-02 14:02:40 +02:00