Victorien Le Couviour--Tuffet
f995e1fbf9
threading: Schedule TILE tasks for all passes at once
...
Closes #465 .
2026-04-27 21:09:28 +02:00
victorien
575af25859
flush: Reset f->task_thread.error
...
f->task_thread.error can be set during flushing, not resetting this can
lead to c->task_thread.first being increased after having already submitted
a frame post flushing. That's fine if it happens on the very first frame,
but if that's the case on any subsequent frame it will incur a wrong frame
ordering.
Now that a non-first frame will be considered as such, its tasks won't be
able to execute (since they depend on a truly previous frame considered as
being after) and c->task_thread.cur will be increased past that frame, with
no way of it being reset, eventually leading to a hang.
2024-11-28 17:56:13 +01:00
Victorien Le Couviour--Tuffet
a500abb750
x86: Add refmvs.load_tmvs asm
2023-06-30 21:34:31 +02:00
Victorien Le Couviour--Tuffet
f89dbc0717
threading: Fix a race on task_thread.init_done
...
Fixes a race where the tasks inserted by the init one could all be
executed, signaling frame completion, leading to another frame starting
before init_done could be set by the aforementioned init task, which then
sets it, preventing the init task of the new frame to be executed.
This then caused an assert to trigger down the task picking loop.
Credits to Oss-Fuzz.
2023-05-04 14:59:07 +02:00
Victorien Le Couviour--Tuffet
8c731791c7
checkasm: Improve mv generation for refmvs.save_tmvs
2023-03-23 15:44:03 +01:00
Victorien Le Couviour--Tuffet
16c943484e
x86: Add refmvs.save_tmvs AVX-512 (Ice Lake) asm
2023-03-16 16:09:46 +01:00
Victorien Le Couviour--Tuffet
7d23ec4a04
x86: Add refmvs.save_tmvs SSSE3 asm
2023-03-13 15:19:35 +00:00
Victorien Le Couviour--Tuffet
c77fb1f016
x86: Optimize refmvs.save_tmvs AVX2 asm
...
Process 2 blocks per iteration instead of 4.
Credits to gramner@twoorioles.com .
2023-03-13 15:19:35 +00:00
Victorien Le Couviour--Tuffet
cf617fdae0
threading: Ensure passing the correct retval to decode_frame_exit
...
We must reload error just before calling dav1d_decode_frame_exit, as
it may have become stale between the last load and that call.
This can result in crashes since we signal a seemingly successfully decoded
frame, when it's not.
Reloading error within the frame done condition's body ensures a non-stale
value, as we use 'f->task_thread.task_counter == 0' to ensure all other
threads / tasks have already completed when entering it. In other words, only
the last thread still working on this frame can execute this code, after
all other threads have returned to doing something else.
2023-03-13 13:54:36 +01:00
Victorien Le Couviour--Tuffet
6b8438b193
x86: Add refmvs.save_tmvs AVX2 asm
2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet
0d9fe4ea65
refmvs: Add refmvs_load/save_tmvs to dsp interface
2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet
19167a2c93
refmvs: Pack refmvs_temporal_block struct
...
Pack the 5 bytes of data to improve memory and perf.
2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet
9b4b244810
drain: Properly fix a desync between next and first
...
The code in dav1d_drain_picture could result in a desync between
c->task_thread.first (oldest submitted frame) and c->frame_thread.next (first
frame to retrieve and/or next submit location).
As we loop through drain, we always increment next, but first only if the
frame has data. If the frame is visible we return. The problem arises when
encountering (an) invisible frame(s), and the next entries haven't been fed
yet, we then keep on looping increasing next but not first, as these have no
data.
We should always return when we encountered data (visible or
invisible decoded frame): for visible, the code already returns, for
invisible, we can store a boolean indicating we drained at least one frame,
whenever we reach an empty entry after that, we return (all subsequent
entries are guaranteed to be empty anyway), not incrementing next nor first.
This will have the effect to insert the next frame at the first free spot
(which is much better than the weird skips it's doing now).
So basically, c->frame_thread.next could skip some (empty) entries.
Now it's contiguous.
Fixes #416 .
2023-02-10 15:11:32 +01:00
Victorien Le Couviour--Tuffet
3f19ece69f
Revert "Fix mismatch between first and next in drain"
...
This reverts commit a51b6ce417 .
We can't increment first when no data is there, otherwise we might do it
while the first frame was not yet decoded, messing up ordering: imagine
having a framedelay of 8, and a file with 7 frames. We feed 7 frames over 8
slots, now next points to [7] (empty entry), and we start draining cause EOF.
We do need next to be incremented to reach the first frame ([0]), so it can
be outputted, and only then first too.
Fixes #418 .
2023-02-09 16:36:57 +01:00
Victorien Le Couviour--Tuffet
a51b6ce417
Fix mismatch between first and next in drain
...
Fixes #416 .
2023-01-26 12:49:36 +01:00
Victorien Le Couviour--Tuffet
8f16314dba
threading: Add a pending list for async task insertion
2022-10-27 13:03:22 +00:00
Victorien Le Couviour--Tuffet
3e7886db54
threading: Fix a race around frame completion (frame-mt)
...
The completion of the first frame to decode while an async reset
request on that same frame is pending will render it stale. The
processing of such a stale request is likely to result in a hang.
One reason this happens is the skip condition at the beginning of
reset_task_cur().
=> Consume the async request before that check.
Another reason is several threads producing async reset requests in
parallel: an async request for the first frame could cascade through the
other threads (other frames) during completion of that frame, meaning
not being caught by the last synchronous reset_task_cur() after
signaling the main thread and before releasing the lock.
=> To solve this we need to add protections at the racy locations. That
means after we increase first, before returning from
reset_task_cur_async(), and after consuming the async request.
2022-10-20 14:23:30 +02:00
Victorien Le Couviour--Tuffet
6680d26f30
threading: Limit the progress bitfields to the used size
...
Store the used size instead of the allocated size.
The used size can be smaller than the allocated size, which results in
a wrong computation of the linear progress from the frame_progress
bitfield.
2022-09-08 14:50:25 +02:00
Victorien Le Couviour--Tuffet
895fed08e1
checkasm: Add short options
2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet
713a4f4e50
checkasm: Add pattern matching to --test
2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet
a63a7c9674
checkasm: Remove pattern matching from --bench
...
The pattern matching feature has been improved and is now performed
under the new --function parameter, rendering this one obsolete.
2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet
d5d37926b6
checkasm: Add a --function option
...
Allows to run checkasm only for functions matching a given pattern.
2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet
a3a55b1849
threading: Fix copy_lpf_progress initialization
...
The copy_lpf_progress bitfield might not be fully cleared when size goes
down.
Credit to Oss-Fuzz.
2022-08-30 17:31:28 +02:00
Victorien Le Couviour--Tuffet
9717802d01
checkasm/lpf: Use operating dimensions
...
Fixes use of uninitialized value.
2022-06-13 14:00:59 +02:00
Victorien Le Couviour--Tuffet and Ronald S. Bultje
b4f9eac858
checkasm: Fix uninitialized variable
...
fg_data->num_y_points is used in generate_grain_uv, but is only set
after the call: move the initialization above.
2022-05-31 16:28:34 +00:00
Victorien Le Couviour--Tuffet
ebeaac6d60
Fix typo
...
Insert missing space.
2022-05-25 19:10:00 +02:00
Victorien Le Couviour--Tuffet and Henrik Gramner
b4d70152fd
Fix delayed_fg.scaling alignment for AVX-512
2022-03-07 19:27:08 +00:00
Victorien Le Couviour--Tuffet
402b54fcae
Integrate film grain in the task threading system
2022-02-28 18:20:48 +01:00
Victorien Le Couviour--Tuffet
4a52aa4790
x86: Add mc.resize AVX-512 (Ice Lake) asm
...
resize_8bpc_c: 542599.0
resize_8bpc_ssse3: 87635.4
resize_8bpc_avx2: 67401.1
resize_8bpc_avx512icl: 50263.6
resize_16bpc_c: 573438.9
resize_16bpc_ssse3: 121505.2
resize_16bpc_avx2: 83293.4
resize_16bpc_avx512icl: 77974.8
2022-01-24 15:37:16 +01:00
Victorien Le Couviour--Tuffet
1cdde64f82
Run the init tasks for all frames first
2022-01-24 13:16:54 +01:00
Victorien Le Couviour--Tuffet
a8f3124a6c
Split the frame init task
...
Allows to run most of dav1d_decode_frame_init unconditionally by putting the
CDF and subsequent initializations in a separate task.
2022-01-24 13:16:54 +01:00
Victorien Le Couviour--Tuffet
1e3f0bea39
Move ENTROPY_PROGRESS task up
2022-01-19 20:17:31 +01:00
Victorien Le Couviour--Tuffet
6aaeeea689
Fix current frame selector wrapping condition
...
This could cause a desync between first and cur, which results in
skipping a frame, halting the decoding.
This desync typically doesn't occur "long enough" in the current state of the
project to trigger the bug, as some frames would fix this cur back.
In order to trigger this, one needs to call reset_task_cur() on the last
frame, this would be the call post insertion of the INIT task (during
dav1d_task_frame_init).
This doesn't happen as we would normally pick a task from a previous frame
already in the queue.
2022-01-19 20:17:31 +01:00
Victorien Le Couviour--Tuffet
5919517ff6
x86: Add high bitdepth mc(t)_scaled SSSE3 asm
...
mc_scaled_8tap_regular_w2_16bpc_c: 737.7
mc_scaled_8tap_regular_w2_16bpc_ssse3: 151.7
mc_scaled_8tap_regular_w2_16bpc_avx2: 141.2
mc_scaled_8tap_regular_w2_dy1_16bpc_c: 660.3
mc_scaled_8tap_regular_w2_dy1_16bpc_ssse3: 80.8
mc_scaled_8tap_regular_w2_dy1_16bpc_avx2: 73.2
mc_scaled_8tap_regular_w2_dy2_16bpc_c: 884.9
mc_scaled_8tap_regular_w2_dy2_16bpc_ssse3: 101.6
mc_scaled_8tap_regular_w2_dy2_16bpc_avx2: 87.2
mc_scaled_8tap_regular_w4_16bpc_c: 1356.3
mc_scaled_8tap_regular_w4_16bpc_ssse3: 172.3
mc_scaled_8tap_regular_w4_16bpc_avx2: 172.5
mc_scaled_8tap_regular_w4_dy1_16bpc_c: 1244.9
mc_scaled_8tap_regular_w4_dy1_16bpc_ssse3: 125.7
mc_scaled_8tap_regular_w4_dy1_16bpc_avx2: 96.1
mc_scaled_8tap_regular_w4_dy2_16bpc_c: 1665.6
mc_scaled_8tap_regular_w4_dy2_16bpc_ssse3: 150.2
mc_scaled_8tap_regular_w4_dy2_16bpc_avx2: 112.8
mc_scaled_8tap_regular_w8_16bpc_c: 2536.5
mc_scaled_8tap_regular_w8_16bpc_ssse3: 383.4
mc_scaled_8tap_regular_w8_16bpc_avx2: 256.2
mc_scaled_8tap_regular_w8_dy1_16bpc_c: 2331.8
mc_scaled_8tap_regular_w8_dy1_16bpc_ssse3: 350.0
mc_scaled_8tap_regular_w8_dy1_16bpc_avx2: 214.0
mc_scaled_8tap_regular_w8_dy2_16bpc_c: 3169.6
mc_scaled_8tap_regular_w8_dy2_16bpc_ssse3: 395.7
mc_scaled_8tap_regular_w8_dy2_16bpc_avx2: 265.7
mc_scaled_8tap_regular_w16_16bpc_c: 6384.6
mc_scaled_8tap_regular_w16_16bpc_ssse3: 1004.4
mc_scaled_8tap_regular_w16_16bpc_avx2: 665.0
mc_scaled_8tap_regular_w16_dy1_16bpc_c: 6103.4
mc_scaled_8tap_regular_w16_dy1_16bpc_ssse3: 896.3
mc_scaled_8tap_regular_w16_dy1_16bpc_avx2: 544.2
mc_scaled_8tap_regular_w16_dy2_16bpc_c: 8584.5
mc_scaled_8tap_regular_w16_dy2_16bpc_ssse3: 1049.0
mc_scaled_8tap_regular_w16_dy2_16bpc_avx2: 695.1
mc_scaled_8tap_regular_w32_16bpc_c: 19672.8
mc_scaled_8tap_regular_w32_16bpc_ssse3: 3204.3
mc_scaled_8tap_regular_w32_16bpc_avx2: 2109.6
mc_scaled_8tap_regular_w32_dy1_16bpc_c: 15964.6
mc_scaled_8tap_regular_w32_dy1_16bpc_ssse3: 2634.5
mc_scaled_8tap_regular_w32_dy1_16bpc_avx2: 1555.8
mc_scaled_8tap_regular_w32_dy2_16bpc_c: 24156.9
mc_scaled_8tap_regular_w32_dy2_16bpc_ssse3: 3217.3
mc_scaled_8tap_regular_w32_dy2_16bpc_avx2: 2088.8
mc_scaled_8tap_regular_w64_16bpc_c: 74356.3
mc_scaled_8tap_regular_w64_16bpc_ssse3: 11225.9
mc_scaled_8tap_regular_w64_16bpc_avx2: 7434.7
mc_scaled_8tap_regular_w64_dy1_16bpc_c: 60080.9
mc_scaled_8tap_regular_w64_dy1_16bpc_ssse3: 8912.8
mc_scaled_8tap_regular_w64_dy1_16bpc_avx2: 5222.2
mc_scaled_8tap_regular_w64_dy2_16bpc_c: 88891.4
mc_scaled_8tap_regular_w64_dy2_16bpc_ssse3: 10824.8
mc_scaled_8tap_regular_w64_dy2_16bpc_avx2: 7086.3
mc_scaled_8tap_regular_w128_16bpc_c: 171633.3
mc_scaled_8tap_regular_w128_16bpc_ssse3: 27089.3
mc_scaled_8tap_regular_w128_16bpc_avx2: 17998.2
mc_scaled_8tap_regular_w128_dy1_16bpc_c: 164399.9
mc_scaled_8tap_regular_w128_dy1_16bpc_ssse3: 24694.1
mc_scaled_8tap_regular_w128_dy1_16bpc_avx2: 14711.2
mc_scaled_8tap_regular_w128_dy2_16bpc_c: 244865.3
mc_scaled_8tap_regular_w128_dy2_16bpc_ssse3: 30599.1
mc_scaled_8tap_regular_w128_dy2_16bpc_avx2: 20341.1
mct_scaled_8tap_regular_w4_16bpc_c: 946.2
mct_scaled_8tap_regular_w4_16bpc_ssse3: 117.5
mct_scaled_8tap_regular_w4_16bpc_avx2: 112.5
mct_scaled_8tap_regular_w4_dy1_16bpc_c: 886.1
mct_scaled_8tap_regular_w4_dy1_16bpc_ssse3: 100.5
mct_scaled_8tap_regular_w4_dy1_16bpc_avx2: 76.8
mct_scaled_8tap_regular_w4_dy2_16bpc_c: 1170.1
mct_scaled_8tap_regular_w4_dy2_16bpc_ssse3: 117.6
mct_scaled_8tap_regular_w4_dy2_16bpc_avx2: 87.9
mct_scaled_8tap_regular_w8_16bpc_c: 2784.2
mct_scaled_8tap_regular_w8_16bpc_ssse3: 408.5
mct_scaled_8tap_regular_w8_16bpc_avx2: 280.3
mct_scaled_8tap_regular_w8_dy1_16bpc_c: 2530.5
mct_scaled_8tap_regular_w8_dy1_16bpc_ssse3: 358.2
mct_scaled_8tap_regular_w8_dy1_16bpc_avx2: 227.1
mct_scaled_8tap_regular_w8_dy2_16bpc_c: 3525.0
mct_scaled_8tap_regular_w8_dy2_16bpc_ssse3: 425.6
mct_scaled_8tap_regular_w8_dy2_16bpc_avx2: 283.6
mct_scaled_8tap_regular_w16_16bpc_c: 6773.8
mct_scaled_8tap_regular_w16_16bpc_ssse3: 1054.6
mct_scaled_8tap_regular_w16_16bpc_avx2: 696.4
mct_scaled_8tap_regular_w16_dy1_16bpc_c: 6418.0
mct_scaled_8tap_regular_w16_dy1_16bpc_ssse3: 938.7
mct_scaled_8tap_regular_w16_dy1_16bpc_avx2: 584.5
mct_scaled_8tap_regular_w16_dy2_16bpc_c: 9432.4
mct_scaled_8tap_regular_w16_dy2_16bpc_ssse3: 1125.3
mct_scaled_8tap_regular_w16_dy2_16bpc_avx2: 753.1
mct_scaled_8tap_regular_w32_16bpc_c: 26028.8
mct_scaled_8tap_regular_w32_16bpc_ssse3: 4128.4
mct_scaled_8tap_regular_w32_16bpc_avx2: 2748.4
mct_scaled_8tap_regular_w32_dy1_16bpc_c: 21604.3
mct_scaled_8tap_regular_w32_dy1_16bpc_ssse3: 3312.4
mct_scaled_8tap_regular_w32_dy1_16bpc_avx2: 2051.1
mct_scaled_8tap_regular_w32_dy2_16bpc_c: 32844.3
mct_scaled_8tap_regular_w32_dy2_16bpc_ssse3: 4102.9
mct_scaled_8tap_regular_w32_dy2_16bpc_avx2: 2741.6
mct_scaled_8tap_regular_w64_16bpc_c: 49101.8
mct_scaled_8tap_regular_w64_16bpc_ssse3: 8758.9
mct_scaled_8tap_regular_w64_16bpc_avx2: 5822.2
mct_scaled_8tap_regular_w64_dy1_16bpc_c: 53557.7
mct_scaled_8tap_regular_w64_dy1_16bpc_ssse3: 8469.7
mct_scaled_8tap_regular_w64_dy1_16bpc_avx2: 5264.3
mct_scaled_8tap_regular_w64_dy2_16bpc_c: 83379.7
mct_scaled_8tap_regular_w64_dy2_16bpc_ssse3: 10623.7
mct_scaled_8tap_regular_w64_dy2_16bpc_avx2: 7164.0
mct_scaled_8tap_regular_w128_16bpc_c: 163182.2
mct_scaled_8tap_regular_w128_16bpc_ssse3: 26452.9
mct_scaled_8tap_regular_w128_16bpc_avx2: 18402.2
mct_scaled_8tap_regular_w128_dy1_16bpc_c: 148199.8
mct_scaled_8tap_regular_w128_dy1_16bpc_ssse3: 23584.9
mct_scaled_8tap_regular_w128_dy1_16bpc_avx2: 14808.1
mct_scaled_8tap_regular_w128_dy2_16bpc_c: 234702.2
mct_scaled_8tap_regular_w128_dy2_16bpc_ssse3: 29653.8
mct_scaled_8tap_regular_w128_dy2_16bpc_avx2: 20042.4
2022-01-12 19:08:56 +01:00
Victorien Le Couviour--Tuffet
42ad602ddd
x86: Add 8-bit mc(t)_scaled SSSE3 32-bit asm
...
mc_scaled_8tap_regular_w2_8bpc_c: 1070.7
mc_scaled_8tap_regular_w2_8bpc_ssse3: 253.0
mc_scaled_8tap_regular_w2_dy1_8bpc_c: 1079.9
mc_scaled_8tap_regular_w2_dy1_8bpc_ssse3: 114.8
mc_scaled_8tap_regular_w2_dy2_8bpc_c: 1466.1
mc_scaled_8tap_regular_w2_dy2_8bpc_ssse3: 145.7
mc_scaled_8tap_regular_w4_8bpc_c: 1965.4
mc_scaled_8tap_regular_w4_8bpc_ssse3: 251.4
mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1989.4
mc_scaled_8tap_regular_w4_dy1_8bpc_ssse3: 166.1
mc_scaled_8tap_regular_w4_dy2_8bpc_c: 2728.8
mc_scaled_8tap_regular_w4_dy2_8bpc_ssse3: 163.4
mc_scaled_8tap_regular_w8_8bpc_c: 3670.1
mc_scaled_8tap_regular_w8_8bpc_ssse3: 477.0
mc_scaled_8tap_regular_w8_dy1_8bpc_c: 3651.1
mc_scaled_8tap_regular_w8_dy1_8bpc_ssse3: 464.8
mc_scaled_8tap_regular_w8_dy2_8bpc_c: 5079.6
mc_scaled_8tap_regular_w8_dy2_8bpc_ssse3: 494.0
mc_scaled_8tap_regular_w16_8bpc_c: 8366.9
mc_scaled_8tap_regular_w16_8bpc_ssse3: 1197.4
mc_scaled_8tap_regular_w16_dy1_8bpc_c: 9088.5
mc_scaled_8tap_regular_w16_dy1_8bpc_ssse3: 1212.6
mc_scaled_8tap_regular_w16_dy2_8bpc_c: 13166.1
mc_scaled_8tap_regular_w16_dy2_8bpc_ssse3: 1301.4
mc_scaled_8tap_regular_w32_8bpc_c: 29883.7
mc_scaled_8tap_regular_w32_8bpc_ssse3: 3990.3
mc_scaled_8tap_regular_w32_dy1_8bpc_c: 23404.1
mc_scaled_8tap_regular_w32_dy1_8bpc_ssse3: 3617.4
mc_scaled_8tap_regular_w32_dy2_8bpc_c: 36248.3
mc_scaled_8tap_regular_w32_dy2_8bpc_ssse3: 3949.3
mc_scaled_8tap_regular_w64_8bpc_c: 57228.6
mc_scaled_8tap_regular_w64_8bpc_ssse3: 9359.4
mc_scaled_8tap_regular_w64_dy1_8bpc_c: 87271.8
mc_scaled_8tap_regular_w64_dy1_8bpc_ssse3: 12472.7
mc_scaled_8tap_regular_w64_dy2_8bpc_c: 135050.9
mc_scaled_8tap_regular_w64_dy2_8bpc_ssse3: 13585.4
mc_scaled_8tap_regular_w128_8bpc_c: 219123.0
mc_scaled_8tap_regular_w128_8bpc_ssse3: 31867.7
mc_scaled_8tap_regular_w128_dy1_8bpc_c: 240143.3
mc_scaled_8tap_regular_w128_dy1_8bpc_ssse3: 35275.7
mc_scaled_8tap_regular_w128_dy2_8bpc_c: 376357.7
mc_scaled_8tap_regular_w128_dy2_8bpc_ssse3: 39411.4
mct_scaled_8tap_regular_w4_8bpc_c: 1178.7
mct_scaled_8tap_regular_w4_8bpc_ssse3: 176.8
mct_scaled_8tap_regular_w4_dy1_8bpc_c: 1354.8
mct_scaled_8tap_regular_w4_dy1_8bpc_ssse3: 131.5
mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1832.2
mct_scaled_8tap_regular_w4_dy2_8bpc_ssse3: 123.0
mct_scaled_8tap_regular_w8_8bpc_c: 3547.6
mct_scaled_8tap_regular_w8_8bpc_ssse3: 526.0
mct_scaled_8tap_regular_w8_dy1_8bpc_c: 3683.8
mct_scaled_8tap_regular_w8_dy1_8bpc_ssse3: 513.8
mct_scaled_8tap_regular_w8_dy2_8bpc_c: 5260.7
mct_scaled_8tap_regular_w8_dy2_8bpc_ssse3: 566.1
mct_scaled_8tap_regular_w16_8bpc_c: 8424.5
mct_scaled_8tap_regular_w16_8bpc_ssse3: 1340.0
mct_scaled_8tap_regular_w16_dy1_8bpc_c: 9515.8
mct_scaled_8tap_regular_w16_dy1_8bpc_ssse3: 1337.0
mct_scaled_8tap_regular_w16_dy2_8bpc_c: 14247.3
mct_scaled_8tap_regular_w16_dy2_8bpc_ssse3: 1492.7
mct_scaled_8tap_regular_w32_8bpc_c: 32059.9
mct_scaled_8tap_regular_w32_8bpc_ssse3: 5177.5
mct_scaled_8tap_regular_w32_dy1_8bpc_c: 32557.6
mct_scaled_8tap_regular_w32_dy1_8bpc_ssse3: 4889.9
mct_scaled_8tap_regular_w32_dy2_8bpc_c: 50844.2
mct_scaled_8tap_regular_w32_dy2_8bpc_ssse3: 5667.1
mct_scaled_8tap_regular_w64_8bpc_c: 59903.1
mct_scaled_8tap_regular_w64_8bpc_ssse3: 10453.6
mct_scaled_8tap_regular_w64_dy1_8bpc_c: 80298.8
mct_scaled_8tap_regular_w64_dy1_8bpc_ssse3: 12597.8
mct_scaled_8tap_regular_w64_dy2_8bpc_c: 127244.8
mct_scaled_8tap_regular_w64_dy2_8bpc_ssse3: 14677.9
mct_scaled_8tap_regular_w128_8bpc_c: 280097.0
mct_scaled_8tap_regular_w128_8bpc_ssse3: 41989.3
mct_scaled_8tap_regular_w128_dy1_8bpc_c: 208913.2
mct_scaled_8tap_regular_w128_dy1_8bpc_ssse3: 35525.2
mct_scaled_8tap_regular_w128_dy2_8bpc_c: 341367.6
mct_scaled_8tap_regular_w128_dy2_8bpc_ssse3: 41449.0
2021-12-13 14:27:00 +01:00
Victorien Le Couviour--Tuffet
3fd2ad938a
Fix a leak when threading is active
...
Credit to Oss-Fuzz.
2021-11-01 15:14:21 +01:00
Victorien Le Couviour--Tuffet
f7e0d4c032
Remove lpf_stride parameter from LR filters
2021-10-29 22:18:20 +02:00
Victorien Le Couviour--Tuffet
609fbaba84
Allow CDEF and LR to run sbrows in parallel
2021-10-29 22:18:20 +02:00
Victorien Le Couviour--Tuffet
8e6d5214a3
CI: Add tests for negative stride
2021-10-29 22:18:05 +02:00
Victorien Le Couviour--Tuffet and Henrik Gramner
82d6d950a2
x86: Add deblock loop filters AVX-512 (Ice Lake) asm
2021-10-18 14:49:05 +00:00
Victorien Le Couviour--Tuffet
5991883dc6
x86: Add high bitdepth mc(t)_scaled AVX2 asm
2021-09-20 13:47:49 +02:00
Victorien Le Couviour--Tuffet
69ff474a7f
Revert "Group lr_lpf_line re-allocation with lr_mask_sz"
...
This reverts commit e53314177a .
Causes issues when the sample has both 8 and 16 bit content.
Credit to Oss-Fuzz.
2021-09-10 19:39:05 +02:00
Victorien Le Couviour--Tuffet
833c818b87
Minor consistency fixes, purely cosmetic
2021-09-09 13:42:04 +00:00
Victorien Le Couviour--Tuffet
976b9e4965
Fix a potential hang when dav1d_submit_frame fails
...
Credit to Oss-Fuzz.
2021-09-09 13:42:04 +00:00
Victorien Le Couviour--Tuffet
e53314177a
Group lr_lpf_line re-allocation with lr_mask_sz
2021-09-07 17:13:44 +02:00
Victorien Le Couviour--Tuffet
159215a82d
Fix lr_lpf_line re-allocation check
...
Credit to Oss-Fuzz.
2021-09-07 17:13:41 +02:00
Victorien Le Couviour--Tuffet and Ronald S. Bultje
753eef833b
Merge the 3 threading models into a single one
...
Merges the 3 threading parameters into a single `--threads=` argument.
Frame threading can still be controlled via the `--framedelay=` argument.
Internally, the threading model is now a global thread/task pool design.
Co-authored-by: Ronald S. Bultje <rsbultje@gmail.com >
2021-09-03 16:06:31 +00:00
Victorien Le Couviour--Tuffet
b1adba65c9
x86: Add high bitdepth mc.resize SSSE3 asm
...
resize_16bpc_ssse3: 141122.7
resize_16bpc_avx2: 105971.7
2021-08-12 12:14:49 +02:00
Victorien Le Couviour--Tuffet
e647a54db9
x86: Fix minor things in mc.resize_8bpc_ssse3
...
- number of gpr and xmm regs in use
- some cosmetics (no need to specify x for xmm regs on SSSE3)
- a comment with wrong registers (unedited copy from AVX2 code)
2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet
e479e4a942
x86: Add high bitdepth mc.resize AVX2 asm
...
resize_8bpc_avx2: 82986.1
resize_16bpc_avx2: 103896.7
2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet
b7f5503159
x86: Add minor improvement to mc.resize_8bpc_avx2
...
Simplify some gpr extract and sign extend operations.
2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet
356387f6f6
x86: Add bpc suffix to mc functions
2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet
fe903da5b8
x86: Rewrite sgr8 SSSE3 asm
...
Old:
sgr_3x3_8bpc_ssse3: 140121.1
sgr_3x3_8bpc_avx2: 72965.4
sgr_5x5_8bpc_ssse3: 89859.1
sgr_5x5_8bpc_avx2: 48881.9
sgr_mix_8bpc_ssse3: 236626.5
sgr_mix_8bpc_avx2: 110552.6
New:
sgr_3x3_8bpc_ssse3: 117294.4
sgr_3x3_8bpc_avx2: 72243.5
sgr_5x5_8bpc_ssse3: 79929.6
sgr_5x5_8bpc_avx2: 49798.4
sgr_mix_8bpc_ssse3: 184183.9
sgr_mix_8bpc_avx2: 109771.7
2021-08-03 14:58:45 +00:00
Victorien Le Couviour--Tuffet
935175daa7
x86: Add minor improvements to sgr16 SSSE3 asm
...
Old:
sgr_5x5_10bpc_ssse3: 87026.6
sgr_5x5_10bpc_avx2: 51864.5
sgr_mix_10bpc_ssse3: 205460.2
sgr_mix_10bpc_avx2: 122199.7
New:
sgr_5x5_10bpc_ssse3: 84786.5
sgr_5x5_10bpc_avx2: 51651.3
sgr_mix_10bpc_ssse3: 202722.2
sgr_mix_10bpc_avx2: 122340.0
2021-08-03 14:58:45 +00:00
Victorien Le Couviour--Tuffet
513fd90c26
x86: Add high bitdepth (10-bit) sgr SSSE3 asm
2021-07-12 07:40:23 +00:00
Victorien Le Couviour--Tuffet
12f170c437
x86: Add minor improvements to wiener16 SSSE3 asm
2021-07-12 07:40:23 +00:00
Victorien Le Couviour--Tuffet
193db389e9
x86: Add high bitdepth wiener filter SSSE3 asm
2021-06-09 14:15:31 +02:00
Victorien Le Couviour--Tuffet and Henrik Gramner
dc7cdc0b58
x86: Add high bitdepth pal_pred AVX2 asm
2021-05-04 22:39:17 +02:00
Victorien Le Couviour--Tuffet and Henrik Gramner
0d42b3030b
x86: Add high bitdepth ipred_cfl_ac_422 AVX2 asm
2021-05-04 17:00:07 +02:00
Victorien Le Couviour--Tuffet and Henrik Gramner
ec5e93eecd
x86: Add high bitdepth ipred_cfl_ac_420 AVX2 asm
2021-05-04 17:00:07 +02:00
Victorien Le Couviour--Tuffet and Henrik Gramner
de6813f92c
x86: Add high bitdepth ipred_filter AVX2 asm
2021-05-04 17:00:07 +02:00
Victorien Le Couviour--Tuffet and Jean-Baptiste Kempf
8b1a96e481
Fix potential deadlock
...
If the postfilter tasks allocation fails, a deadlock would occur.
2021-02-05 23:54:58 +01:00
Victorien Le Couviour--Tuffet
288ed4b8ec
dav1dplay: Add pause and seek features
2021-02-01 11:18:04 +01:00
Victorien Le Couviour--Tuffet
549086e4d3
Add post-filters threading model
2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet
4db73f115e
tests: Refactor seek_stress decoding functions
2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet
66c8a1ec28
fuzzer: Remove redundant flush
...
Calling dav1d_close already takes care of flushing the internal state.
Calling it just before is superfluous.
2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet
5686e8355c
tests/seek_stress: Reduce the number of iterations
2021-01-21 09:54:50 +01:00
Victorien Le Couviour--Tuffet
05d05f9776
CI: Run the seek stress test
2021-01-18 13:59:26 +01:00
Victorien Le Couviour--Tuffet
63a918b487
tests: Add a seek stress test
...
Closes #203 .
2021-01-18 13:58:30 +01:00
Victorien Le Couviour--Tuffet
493d2b9157
input/ivf: Add seeking capability
2021-01-15 14:56:23 +01:00
Victorien Le Couviour--Tuffet
a40d3b5f0f
Abort frame decoding properly on reference error
...
This could cause a frame waiting on the current one to not be notified
on error.
Fixes #351 .
2020-10-21 14:37:12 +02:00
Victorien Le Couviour--Tuffet
06f12a8995
x86: Add {put/prep}_{8tap/bilin} SSSE3 asm (64-bit)
2020-08-06 15:34:40 +02:00
Victorien Le Couviour--Tuffet
652e5b38b0
x86: Minor changes to MC scaled AVX2 asm
2020-08-05 12:25:53 +02:00
Victorien Le Couviour--Tuffet
a75ee78bd9
x86: Add put/prep_bilin_scaled AVX2 asm
...
Bilin scaled being very rarely used, add a new table entry to
mc_subpel_filters, and jump to the put/prep_8tap_scaled code.
AVX2 performance is obviously the same as the 8tap code, the speed up is
much smaller though, as the C code is a true bilinear codepath,
auto-vectorized. Yet, the AVX2 performance are always better.
2020-06-18 11:37:00 +02:00
Victorien Le Couviour--Tuffet
ea74e3d513
x86: Add prep_8tap_scaled AVX2 asm
...
mct_scaled_8tap_regular_w4_8bpc_c: 872.1
mct_scaled_8tap_regular_w4_8bpc_avx2: 125.6
mct_scaled_8tap_regular_w4_dy1_8bpc_c: 886.3
mct_scaled_8tap_regular_w4_dy1_8bpc_avx2: 84.0
mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1189.1
mct_scaled_8tap_regular_w4_dy2_8bpc_avx2: 84.7
mct_scaled_8tap_regular_w8_8bpc_c: 2261.0
mct_scaled_8tap_regular_w8_8bpc_avx2: 306.2
mct_scaled_8tap_regular_w8_dy1_8bpc_c: 2189.9
mct_scaled_8tap_regular_w8_dy1_8bpc_avx2: 233.8
mct_scaled_8tap_regular_w8_dy2_8bpc_c: 3060.3
mct_scaled_8tap_regular_w8_dy2_8bpc_avx2: 282.8
mct_scaled_8tap_regular_w16_8bpc_c: 4335.3
mct_scaled_8tap_regular_w16_8bpc_avx2: 680.7
mct_scaled_8tap_regular_w16_dy1_8bpc_c: 5137.2
mct_scaled_8tap_regular_w16_dy1_8bpc_avx2: 578.6
mct_scaled_8tap_regular_w16_dy2_8bpc_c: 7878.4
mct_scaled_8tap_regular_w16_dy2_8bpc_avx2: 774.6
mct_scaled_8tap_regular_w32_8bpc_c: 17871.9
mct_scaled_8tap_regular_w32_8bpc_avx2: 2954.8
mct_scaled_8tap_regular_w32_dy1_8bpc_c: 18594.7
mct_scaled_8tap_regular_w32_dy1_8bpc_avx2: 2073.9
mct_scaled_8tap_regular_w32_dy2_8bpc_c: 28696.0
mct_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2852.1
mct_scaled_8tap_regular_w64_8bpc_c: 46967.5
mct_scaled_8tap_regular_w64_8bpc_avx2: 7527.5
mct_scaled_8tap_regular_w64_dy1_8bpc_c: 45564.2
mct_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5262.9
mct_scaled_8tap_regular_w64_dy2_8bpc_c: 72793.3
mct_scaled_8tap_regular_w64_dy2_8bpc_avx2: 7535.9
mct_scaled_8tap_regular_w128_8bpc_c: 111190.8
mct_scaled_8tap_regular_w128_8bpc_avx2: 19386.8
mct_scaled_8tap_regular_w128_dy1_8bpc_c: 122625.0
mct_scaled_8tap_regular_w128_dy1_8bpc_avx2: 15376.1
mct_scaled_8tap_regular_w128_dy2_8bpc_c: 197120.6
mct_scaled_8tap_regular_w128_dy2_8bpc_avx2: 21871.0
2020-06-18 11:37:00 +02:00
Victorien Le Couviour--Tuffet
22fb8a42a1
x86: Adapt SSSE3 prep_8tap to SSE2
...
---------------------
x86_64:
------------------------------------------
mct_8tap_regular_w4_h_8bpc_c: 302.3
mct_8tap_regular_w4_h_8bpc_sse2: 47.3
mct_8tap_regular_w4_h_8bpc_ssse3: 19.5
---------------------
mct_8tap_regular_w8_h_8bpc_c: 745.5
mct_8tap_regular_w8_h_8bpc_sse2: 235.2
mct_8tap_regular_w8_h_8bpc_ssse3: 70.4
---------------------
mct_8tap_regular_w16_h_8bpc_c: 1844.3
mct_8tap_regular_w16_h_8bpc_sse2: 755.6
mct_8tap_regular_w16_h_8bpc_ssse3: 225.9
---------------------
mct_8tap_regular_w32_h_8bpc_c: 6685.5
mct_8tap_regular_w32_h_8bpc_sse2: 2954.4
mct_8tap_regular_w32_h_8bpc_ssse3: 795.8
---------------------
mct_8tap_regular_w64_h_8bpc_c: 15633.5
mct_8tap_regular_w64_h_8bpc_sse2: 7120.4
mct_8tap_regular_w64_h_8bpc_ssse3: 1900.4
---------------------
mct_8tap_regular_w128_h_8bpc_c: 37772.1
mct_8tap_regular_w128_h_8bpc_sse2: 17698.1
mct_8tap_regular_w128_h_8bpc_ssse3: 4665.5
------------------------------------------
mct_8tap_regular_w4_v_8bpc_c: 306.5
mct_8tap_regular_w4_v_8bpc_sse2: 71.7
mct_8tap_regular_w4_v_8bpc_ssse3: 37.9
---------------------
mct_8tap_regular_w8_v_8bpc_c: 923.3
mct_8tap_regular_w8_v_8bpc_sse2: 168.7
mct_8tap_regular_w8_v_8bpc_ssse3: 71.3
---------------------
mct_8tap_regular_w16_v_8bpc_c: 3040.1
mct_8tap_regular_w16_v_8bpc_sse2: 505.1
mct_8tap_regular_w16_v_8bpc_ssse3: 199.7
---------------------
mct_8tap_regular_w32_v_8bpc_c: 12354.8
mct_8tap_regular_w32_v_8bpc_sse2: 1942.0
mct_8tap_regular_w32_v_8bpc_ssse3: 714.2
---------------------
mct_8tap_regular_w64_v_8bpc_c: 29427.9
mct_8tap_regular_w64_v_8bpc_sse2: 4637.4
mct_8tap_regular_w64_v_8bpc_ssse3: 1829.2
---------------------
mct_8tap_regular_w128_v_8bpc_c: 72756.9
mct_8tap_regular_w128_v_8bpc_sse2: 11301.0
mct_8tap_regular_w128_v_8bpc_ssse3: 5020.6
------------------------------------------
mct_8tap_regular_w4_hv_8bpc_c: 876.9
mct_8tap_regular_w4_hv_8bpc_sse2: 171.7
mct_8tap_regular_w4_hv_8bpc_ssse3: 112.2
---------------------
mct_8tap_regular_w8_hv_8bpc_c: 2215.1
mct_8tap_regular_w8_hv_8bpc_sse2: 730.2
mct_8tap_regular_w8_hv_8bpc_ssse3: 330.9
---------------------
mct_8tap_regular_w16_hv_8bpc_c: 6075.5
mct_8tap_regular_w16_hv_8bpc_sse2: 2252.1
mct_8tap_regular_w16_hv_8bpc_ssse3: 973.4
---------------------
mct_8tap_regular_w32_hv_8bpc_c: 22182.7
mct_8tap_regular_w32_hv_8bpc_sse2: 7692.6
mct_8tap_regular_w32_hv_8bpc_ssse3: 3599.8
---------------------
mct_8tap_regular_w64_hv_8bpc_c: 50876.8
mct_8tap_regular_w64_hv_8bpc_sse2: 18499.6
mct_8tap_regular_w64_hv_8bpc_ssse3: 8815.6
---------------------
mct_8tap_regular_w128_hv_8bpc_c: 122926.3
mct_8tap_regular_w128_hv_8bpc_sse2: 45120.0
mct_8tap_regular_w128_hv_8bpc_ssse3: 22085.7
------------------------------------------
2020-06-11 12:37:36 +02:00
Victorien Le Couviour--Tuffet
83956bf10e
x86: Adapt SSSE3 prep_bilin to SSE2
...
---------------------
x86_64:
------------------------------------------
mct_bilinear_w4_h_8bpc_c: 98.9
mct_bilinear_w4_h_8bpc_sse2: 30.2
mct_bilinear_w4_h_8bpc_ssse3: 11.5
---------------------
mct_bilinear_w8_h_8bpc_c: 175.3
mct_bilinear_w8_h_8bpc_sse2: 57.0
mct_bilinear_w8_h_8bpc_ssse3: 19.7
---------------------
mct_bilinear_w16_h_8bpc_c: 396.2
mct_bilinear_w16_h_8bpc_sse2: 179.3
mct_bilinear_w16_h_8bpc_ssse3: 50.9
---------------------
mct_bilinear_w32_h_8bpc_c: 1311.2
mct_bilinear_w32_h_8bpc_sse2: 718.8
mct_bilinear_w32_h_8bpc_ssse3: 243.9
---------------------
mct_bilinear_w64_h_8bpc_c: 2892.7
mct_bilinear_w64_h_8bpc_sse2: 1746.0
mct_bilinear_w64_h_8bpc_ssse3: 568.0
---------------------
mct_bilinear_w128_h_8bpc_c: 7192.6
mct_bilinear_w128_h_8bpc_sse2: 4339.8
mct_bilinear_w128_h_8bpc_ssse3: 1619.2
------------------------------------------
mct_bilinear_w4_v_8bpc_c: 129.7
mct_bilinear_w4_v_8bpc_sse2: 26.6
mct_bilinear_w4_v_8bpc_ssse3: 16.7
---------------------
mct_bilinear_w8_v_8bpc_c: 233.3
mct_bilinear_w8_v_8bpc_sse2: 55.0
mct_bilinear_w8_v_8bpc_ssse3: 24.7
---------------------
mct_bilinear_w16_v_8bpc_c: 498.9
mct_bilinear_w16_v_8bpc_sse2: 146.0
mct_bilinear_w16_v_8bpc_ssse3: 54.2
---------------------
mct_bilinear_w32_v_8bpc_c: 1562.2
mct_bilinear_w32_v_8bpc_sse2: 560.6
mct_bilinear_w32_v_8bpc_ssse3: 201.0
---------------------
mct_bilinear_w64_v_8bpc_c: 3221.3
mct_bilinear_w64_v_8bpc_sse2: 1380.6
mct_bilinear_w64_v_8bpc_ssse3: 499.3
---------------------
mct_bilinear_w128_v_8bpc_c: 7357.7
mct_bilinear_w128_v_8bpc_sse2: 3439.0
mct_bilinear_w128_v_8bpc_ssse3: 1489.1
------------------------------------------
mct_bilinear_w4_hv_8bpc_c: 185.0
mct_bilinear_w4_hv_8bpc_sse2: 54.5
mct_bilinear_w4_hv_8bpc_ssse3: 22.1
---------------------
mct_bilinear_w8_hv_8bpc_c: 377.8
mct_bilinear_w8_hv_8bpc_sse2: 104.3
mct_bilinear_w8_hv_8bpc_ssse3: 35.8
---------------------
mct_bilinear_w16_hv_8bpc_c: 1159.4
mct_bilinear_w16_hv_8bpc_sse2: 311.0
mct_bilinear_w16_hv_8bpc_ssse3: 106.3
---------------------
mct_bilinear_w32_hv_8bpc_c: 4436.2
mct_bilinear_w32_hv_8bpc_sse2: 1230.7
mct_bilinear_w32_hv_8bpc_ssse3: 400.7
---------------------
mct_bilinear_w64_hv_8bpc_c: 10627.7
mct_bilinear_w64_hv_8bpc_sse2: 2934.2
mct_bilinear_w64_hv_8bpc_ssse3: 957.2
---------------------
mct_bilinear_w128_hv_8bpc_c: 26048.9
mct_bilinear_w128_hv_8bpc_sse2: 7590.3
mct_bilinear_w128_hv_8bpc_ssse3: 2947.0
------------------------------------------
2020-06-11 12:37:36 +02:00
Victorien Le Couviour--Tuffet
a755541faa
x86: Add put_8tap_scaled AVX2 asm
...
mc_scaled_8tap_regular_w2_8bpc_c: 764.4
mc_scaled_8tap_regular_w2_8bpc_avx2: 191.3
mc_scaled_8tap_regular_w2_dy1_8bpc_c: 705.8
mc_scaled_8tap_regular_w2_dy1_8bpc_avx2: 89.5
mc_scaled_8tap_regular_w2_dy2_8bpc_c: 964.0
mc_scaled_8tap_regular_w2_dy2_8bpc_avx2: 120.3
mc_scaled_8tap_regular_w4_8bpc_c: 1355.7
mc_scaled_8tap_regular_w4_8bpc_avx2: 180.9
mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1233.2
mc_scaled_8tap_regular_w4_dy1_8bpc_avx2: 115.3
mc_scaled_8tap_regular_w4_dy2_8bpc_c: 1707.6
mc_scaled_8tap_regular_w4_dy2_8bpc_avx2: 117.9
mc_scaled_8tap_regular_w8_8bpc_c: 2483.2
mc_scaled_8tap_regular_w8_8bpc_avx2: 294.8
mc_scaled_8tap_regular_w8_dy1_8bpc_c: 2166.4
mc_scaled_8tap_regular_w8_dy1_8bpc_avx2: 222.0
mc_scaled_8tap_regular_w8_dy2_8bpc_c: 3133.7
mc_scaled_8tap_regular_w8_dy2_8bpc_avx2: 292.6
mc_scaled_8tap_regular_w16_8bpc_c: 5239.2
mc_scaled_8tap_regular_w16_8bpc_avx2: 729.9
mc_scaled_8tap_regular_w16_dy1_8bpc_c: 5156.5
mc_scaled_8tap_regular_w16_dy1_8bpc_avx2: 602.2
mc_scaled_8tap_regular_w16_dy2_8bpc_c: 8018.4
mc_scaled_8tap_regular_w16_dy2_8bpc_avx2: 783.1
mc_scaled_8tap_regular_w32_8bpc_c: 14745.0
mc_scaled_8tap_regular_w32_8bpc_avx2: 2205.0
mc_scaled_8tap_regular_w32_dy1_8bpc_c: 14862.3
mc_scaled_8tap_regular_w32_dy1_8bpc_avx2: 1721.3
mc_scaled_8tap_regular_w32_dy2_8bpc_c: 23607.6
mc_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2325.7
mc_scaled_8tap_regular_w64_8bpc_c: 54891.7
mc_scaled_8tap_regular_w64_8bpc_avx2: 8351.4
mc_scaled_8tap_regular_w64_dy1_8bpc_c: 50249.0
mc_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5864.4
mc_scaled_8tap_regular_w64_dy2_8bpc_c: 79400.1
mc_scaled_8tap_regular_w64_dy2_8bpc_avx2: 8295.7
mc_scaled_8tap_regular_w128_8bpc_c: 121046.8
mc_scaled_8tap_regular_w128_8bpc_avx2: 21809.1
mc_scaled_8tap_regular_w128_dy1_8bpc_c: 133720.4
mc_scaled_8tap_regular_w128_dy1_8bpc_avx2: 16197.8
mc_scaled_8tap_regular_w128_dy2_8bpc_c: 218774.8
mc_scaled_8tap_regular_w128_dy2_8bpc_avx2: 22993.1
2020-06-01 15:30:36 +02:00
Victorien Le Couviour--Tuffet
98ed9be69b
Fix MC masks alignment for sizes >= 64 for AVX-512
...
Those need to be aligned when w*h >= 64, as we will try to load by 64 bytes.
(also realigns the 4x4 masks to 16 as a 32-byte alignment is unnecessary)
2020-04-16 11:43:08 +02:00
Victorien Le Couviour--Tuffet
604d93c5f7
x86: Split AVX2 / AVX-512 CDEF into dedicated files
2020-04-07 16:21:53 +02:00
Victorien Le Couviour--Tuffet
95068df6a6
x86: Add cdef_filter_{4,8}x8 AVX-512 (Ice Lake) asm
...
cdef_filter_4x8_8bpc_avx2: 54.0
cdef_filter_4x8_8bpc_avx512icl: 35.5
=> +52.1%
cdef_filter_8x8_8bpc_avx2: 71.0
cdef_filter_8x8_8bpc_avx512icl: 49.0
=> +44.9%
2020-04-07 16:10:44 +02:00
Victorien Le Couviour--Tuffet
71f27407dd
x86: add some explanatory comment to wiener_filter_h
...
Explains how the clipping to the range defined in the spec works.
2020-04-03 14:21:36 +02:00
Victorien Le Couviour--Tuffet
22080aa30c
x86: optimize cdef_filter_{4x{4,8},8x8}_avx2
...
Add 2 seperate code paths for pri/sec strengths equal 0.
Having both strengths not equal to 0 is uncommon, branching to skip
unnecessary computations is therefore beneficial.
------------------------------------------
before: cdef_filter_4x4_8bpc_avx2: 93.8
after: cdef_filter_4x4_8bpc_avx2: 71.7
---------------------
before: cdef_filter_4x8_8bpc_avx2: 161.5
after: cdef_filter_4x8_8bpc_avx2: 116.3
---------------------
before: cdef_filter_8x8_8bpc_avx2: 221.8
after: cdef_filter_8x8_8bpc_avx2: 156.4
------------------------------------------
2020-02-24 11:23:20 +01:00
Victorien Le Couviour--Tuffet
1bd078c2e5
x86: add a seperate fully edged case to cdef_filter_avx2
...
---------------------
fully edged blocks perf
------------------------------------------
before: cdef_filter_4x4_8bpc_avx2: 91.0
after: cdef_filter_4x4_8bpc_avx2: 75.7
---------------------
before: cdef_filter_4x8_8bpc_avx2: 154.6
after: cdef_filter_4x8_8bpc_avx2: 131.8
---------------------
before: cdef_filter_8x8_8bpc_avx2: 214.1
after: cdef_filter_8x8_8bpc_avx2: 195.9
------------------------------------------
2020-02-24 11:23:20 +01:00
Victorien Le Couviour--Tuffet and Victorien Le Couviour--Tuffet
e706fac9cf
x86: add prep_8tap AVX512 asm
2020-01-20 11:42:53 +01:00
Victorien Le Couviour--Tuffet and Victorien Le Couviour--Tuffet
b83cb9643b
x86: replace "mov hb, Xb" by "movzx hd, Xb" in MC
...
It's a little easier for the CPU to simply overwrite a 32-bit reg rather
than writing it's low 8 bits while conserving bits 8 to 31.
In order to do that it actually fetches those bits, merge to a 32-bit
value, and write that back to the 32-bit GPR.
As those are always cleared, perform a zero extend mov to dword instead.
2020-01-20 11:18:07 +01:00
Victorien Le Couviour--Tuffet and Ronald S. Bultje
5462c2a80d
x86: add prep_bilin AVX512 asm
...
------------------------------------------
mct_bilinear_w4_0_8bpc_avx2: 3.8
mct_bilinear_w4_0_8bpc_avx512icl: 3.7
---------------------
mct_bilinear_w8_0_8bpc_avx2: 5.0
mct_bilinear_w8_0_8bpc_avx512icl: 4.8
---------------------
mct_bilinear_w16_0_8bpc_avx2: 8.5
mct_bilinear_w16_0_8bpc_avx512icl: 7.1
---------------------
mct_bilinear_w32_0_8bpc_avx2: 29.5
mct_bilinear_w32_0_8bpc_avx512icl: 17.1
---------------------
mct_bilinear_w64_0_8bpc_avx2: 68.1
mct_bilinear_w64_0_8bpc_avx512icl: 34.7
---------------------
mct_bilinear_w128_0_8bpc_avx2: 180.5
mct_bilinear_w128_0_8bpc_avx512icl: 138.0
------------------------------------------
mct_bilinear_w4_h_8bpc_avx2: 4.0
mct_bilinear_w4_h_8bpc_avx512icl: 3.9
---------------------
mct_bilinear_w8_h_8bpc_avx2: 5.3
mct_bilinear_w8_h_8bpc_avx512icl: 5.0
---------------------
mct_bilinear_w16_h_8bpc_avx2: 11.7
mct_bilinear_w16_h_8bpc_avx512icl: 7.5
---------------------
mct_bilinear_w32_h_8bpc_avx2: 41.8
mct_bilinear_w32_h_8bpc_avx512icl: 20.3
---------------------
mct_bilinear_w64_h_8bpc_avx2: 94.9
mct_bilinear_w64_h_8bpc_avx512icl: 35.0
---------------------
mct_bilinear_w128_h_8bpc_avx2: 240.1
mct_bilinear_w128_h_8bpc_avx512icl: 143.8
------------------------------------------
mct_bilinear_w4_v_8bpc_avx2: 4.1
mct_bilinear_w4_v_8bpc_avx512icl: 4.0
---------------------
mct_bilinear_w8_v_8bpc_avx2: 6.0
mct_bilinear_w8_v_8bpc_avx512icl: 5.4
---------------------
mct_bilinear_w16_v_8bpc_avx2: 10.3
mct_bilinear_w16_v_8bpc_avx512icl: 8.9
---------------------
mct_bilinear_w32_v_8bpc_avx2: 29.5
mct_bilinear_w32_v_8bpc_avx512icl: 25.9
---------------------
mct_bilinear_w64_v_8bpc_avx2: 64.3
mct_bilinear_w64_v_8bpc_avx512icl: 41.3
---------------------
mct_bilinear_w128_v_8bpc_avx2: 198.2
mct_bilinear_w128_v_8bpc_avx512icl: 139.6
------------------------------------------
mct_bilinear_w4_hv_8bpc_avx2: 5.6
mct_bilinear_w4_hv_8bpc_avx512icl: 5.2
---------------------
mct_bilinear_w8_hv_8bpc_avx2: 8.3
mct_bilinear_w8_hv_8bpc_avx512icl: 7.0
---------------------
mct_bilinear_w16_hv_8bpc_avx2: 19.4
mct_bilinear_w16_hv_8bpc_avx512icl: 12.1
---------------------
mct_bilinear_w32_hv_8bpc_avx2: 69.1
mct_bilinear_w32_hv_8bpc_avx512icl: 32.5
---------------------
mct_bilinear_w64_hv_8bpc_avx2: 164.4
mct_bilinear_w64_hv_8bpc_avx512icl: 71.1
---------------------
mct_bilinear_w128_hv_8bpc_avx2: 405.2
mct_bilinear_w128_hv_8bpc_avx512icl: 193.1
------------------------------------------
2020-01-09 14:56:42 +01:00
Victorien Le Couviour--Tuffet and Ronald S. Bultje
40891aab9b
x86: add avx512icl cpu flag to x86inc.asm
2020-01-09 14:56:42 +01:00
Victorien Le Couviour--Tuffet and Ronald S. Bultje
430967a627
checkasm: x86: ensure all SIMD lanes are turned on at all times
...
YMM and ZMM registers on x86 are turned off to save power when they haven't
been used for some period of time. When they are used there will be a
"warmup" period during which performance will be reduced and inconsistent
which is problematic when trying to benchmark individual functions.
Periodically issue "dummy" instructions that uses those registers to
prevent them from being powered down. The end result is more consistent
benchmark results.
Credits to Henrik Gramner's commit
1878c7f2af0a9c73e291488209109782c428cfcf from x264.
2020-01-09 14:56:42 +01:00
Victorien Le Couviour--Tuffet
36d615d120
x86: adapt SSSE3 wiener filter to SSE2
...
Also slightly optimized the 32-bit SSSE3, especially by the removal of
an XMM store/load.
---------------------
x86_64:
------------------------------------------
wiener_chroma_8bpc_c: 193155.1
wiener_chroma_8bpc_sse2: 48973.4
wiener_chroma_8bpc_ssse3: 31486.3
---------------------
wiener_luma_8bpc_c: 192787.5
wiener_luma_8bpc_sse2: 48674.9
wiener_luma_8bpc_ssse3: 30446.3
------------------------------------------
---------------------
x86_32:
------------------------------------------
wiener_chroma_8bpc_c: 309861.0
wiener_chroma_8bpc_sse2: 52345.9
wiener_chroma_8bpc_ssse3: 32983.2
---------------------
wiener_luma_8bpc_c: 317909.1
wiener_luma_8bpc_sse2: 52522.1
wiener_luma_8bpc_ssse3: 33323.1
------------------------------------------
2019-10-24 20:42:52 +02:00
Victorien Le Couviour--Tuffet
4866abab1f
x86: adapt SSSE3 warp_affine_8x8{,t} to SSE2
...
---------------------
x86_64:
------------------------------------------
warp_8x8_8bpc_c: 1761.5
warp_8x8_8bpc_sse2: 583.0
warp_8x8_8bpc_ssse3: 329.3
---------------------
warp_8x8t_8bpc_c: 1694.3
warp_8x8t_8bpc_sse2: 577.6
warp_8x8t_8bpc_ssse3: 334.1
------------------------------------------
---------------------
x86_32:
------------------------------------------
warp_8x8_8bpc_c: 1842.6
warp_8x8_8bpc_sse2: 677.1
warp_8x8_8bpc_ssse3: 394.9
---------------------
warp_8x8t_8bpc_c: 1741.1
warp_8x8t_8bpc_sse2: 648.5
warp_8x8t_8bpc_ssse3: 372.6
------------------------------------------
2019-10-24 20:42:52 +02:00
Victorien Le Couviour--Tuffet and Henrik Gramner
477905413d
x86inc: fix LOAD_MM_PERMUTATION for AVX512
...
Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION)
is redondant. It causes the register mapping to be the same as without
the initial AVX512_MM_PERMUTATION, with the user SWAPs applied.
For example...
INIT_YMM avx512
SWAP m0, m16
SAVE_MM_PERMUTATION
; do whatever
LOAD_MM_PERMUTATION
... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1
instead of ymm17.
2019-10-21 20:21:38 +02:00
Victorien Le Couviour--Tuffet
3e9f967640
x86: adapt SSSE3 cdef_filter_{4x4,4x8,8x8} to SSE2
...
---------------------
x86_64:
------------------------------------------
cdef_filter_4x4_8bpc_c: 1376.0
cdef_filter_4x4_8bpc_sse2: 177.6
cdef_filter_4x4_8bpc_ssse3: 132.5
---------------------
cdef_filter_4x8_8bpc_c: 2725.0
cdef_filter_4x8_8bpc_sse2: 327.6
cdef_filter_4x8_8bpc_ssse3: 234.9
---------------------
cdef_filter_8x8_8bpc_c: 5938.8
cdef_filter_8x8_8bpc_sse2: 556.8
cdef_filter_8x8_8bpc_ssse3: 388.1
------------------------------------------
---------------------
x86_32:
------------------------------------------
cdef_filter_4x4_8bpc_c: 1569.5
cdef_filter_4x4_8bpc_sse2: 201.9
cdef_filter_4x4_8bpc_ssse3: 162.3
---------------------
cdef_filter_4x8_8bpc_c: 3141.6
cdef_filter_4x8_8bpc_sse2: 368.3
cdef_filter_4x8_8bpc_ssse3: 283.4
---------------------
cdef_filter_8x8_8bpc_c: 6534.5
cdef_filter_8x8_8bpc_sse2: 666.7
cdef_filter_8x8_8bpc_ssse3: 503.5
------------------------------------------
2019-10-18 11:05:11 +02:00
Victorien Le Couviour--Tuffet
11b7250644
tools: fix SSE2 cpu masking
2019-10-16 10:45:54 +02:00
Victorien Le Couviour--Tuffet
a91a03b0e1
x86: add warp_affine SSE4 and SSSE3 asm
...
------------------------------------------
x86_64: warp_8x8_8bpc_c: 1773.4
x86_32: warp_8x8_8bpc_c: 1740.4
----------
x86_64: warp_8x8_8bpc_ssse3: 317.5
x86_32: warp_8x8_8bpc_ssse3: 378.4
----------
x86_64: warp_8x8_8bpc_sse4: 303.7
x86_32: warp_8x8_8bpc_sse4: 367.7
----------
x86_64: warp_8x8_8bpc_avx2: 224.9
---------------------
---------------------
x86_64: warp_8x8t_8bpc_c: 1664.6
x86_32: warp_8x8t_8bpc_c: 1674.0
----------
x86_64: warp_8x8t_8bpc_ssse3: 320.7
x86_32: warp_8x8t_8bpc_ssse3: 379.5
----------
x86_64: warp_8x8t_8bpc_sse4: 304.8
x86_32: warp_8x8t_8bpc_sse4: 369.8
----------
x86_64: warp_8x8t_8bpc_avx2: 228.5
------------------------------------------
2019-09-30 15:40:43 +02:00
Victorien Le Couviour--Tuffet
c0865f35c7
x86: add 32-bit support to SSSE3 deblock lpf
...
------------------------------------------
x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6
x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6
x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4
---------------------
x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9
x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6
x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6
---------------------
x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7
x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3
x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3
x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7
---------------------
x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7
x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8
x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9
x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9
---------------------
x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9
x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5
x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0
---------------------
x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4
x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6
x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0
---------------------
x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6
x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7
x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2
---------------------
x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4
x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5
x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8
x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7
---------------------
x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2
x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9
x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3
x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9
---------------------
x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6
x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7
x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8
------------------------------------------
2019-09-19 12:07:23 +02:00
Victorien Le Couviour--Tuffet
beda6e0d1c
build: fix meson deprecation warning
...
'build_' prefix is reserved by meson, this will become an error in the
future, as indicated by a warning when configuring the build dir.
Closes #285 .
2019-07-02 14:02:40 +02:00