dav1d

x/dav1d

mirror of https://code.videolan.org/videolan/dav1d synced 2026-06-11 04:03:05 +00:00

Author	SHA1	Message	Date
Victorien Le Couviour--Tuffet	f995e1fbf9	threading: Schedule TILE tasks for all passes at once Closes #465.	2026-04-27 21:09:28 +02:00
victorien	575af25859	flush: Reset f->task_thread.error f->task_thread.error can be set during flushing, not resetting this can lead to c->task_thread.first being increased after having already submitted a frame post flushing. That's fine if it happens on the very first frame, but if that's the case on any subsequent frame it will incur a wrong frame ordering. Now that a non-first frame will be considered as such, its tasks won't be able to execute (since they depend on a truly previous frame considered as being after) and c->task_thread.cur will be increased past that frame, with no way of it being reset, eventually leading to a hang.	2024-11-28 17:56:13 +01:00
Victorien Le Couviour--Tuffet	a500abb750	x86: Add refmvs.load_tmvs asm	2023-06-30 21:34:31 +02:00
Victorien Le Couviour--Tuffet	f89dbc0717	threading: Fix a race on task_thread.init_done Fixes a race where the tasks inserted by the init one could all be executed, signaling frame completion, leading to another frame starting before init_done could be set by the aforementioned init task, which then sets it, preventing the init task of the new frame to be executed. This then caused an assert to trigger down the task picking loop. Credits to Oss-Fuzz.	2023-05-04 14:59:07 +02:00
Victorien Le Couviour--Tuffet	8c731791c7	checkasm: Improve mv generation for refmvs.save_tmvs	2023-03-23 15:44:03 +01:00
Victorien Le Couviour--Tuffet	16c943484e	x86: Add refmvs.save_tmvs AVX-512 (Ice Lake) asm	2023-03-16 16:09:46 +01:00
Victorien Le Couviour--Tuffet	7d23ec4a04	x86: Add refmvs.save_tmvs SSSE3 asm	2023-03-13 15:19:35 +00:00
Victorien Le Couviour--Tuffet	c77fb1f016	x86: Optimize refmvs.save_tmvs AVX2 asm Process 2 blocks per iteration instead of 4. Credits to gramner@twoorioles.com.	2023-03-13 15:19:35 +00:00
Victorien Le Couviour--Tuffet	cf617fdae0	threading: Ensure passing the correct retval to decode_frame_exit We must reload error just before calling dav1d_decode_frame_exit, as it may have become stale between the last load and that call. This can result in crashes since we signal a seemingly successfully decoded frame, when it's not. Reloading error within the frame done condition's body ensures a non-stale value, as we use 'f->task_thread.task_counter == 0' to ensure all other threads / tasks have already completed when entering it. In other words, only the last thread still working on this frame can execute this code, after all other threads have returned to doing something else.	2023-03-13 13:54:36 +01:00
Victorien Le Couviour--Tuffet	6b8438b193	x86: Add refmvs.save_tmvs AVX2 asm	2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet	0d9fe4ea65	refmvs: Add refmvs_load/save_tmvs to dsp interface	2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet	19167a2c93	refmvs: Pack refmvs_temporal_block struct Pack the 5 bytes of data to improve memory and perf.	2023-03-06 13:36:22 +00:00
Victorien Le Couviour--Tuffet	9b4b244810	drain: Properly fix a desync between next and first The code in dav1d_drain_picture could result in a desync between c->task_thread.first (oldest submitted frame) and c->frame_thread.next (first frame to retrieve and/or next submit location). As we loop through drain, we always increment next, but first only if the frame has data. If the frame is visible we return. The problem arises when encountering (an) invisible frame(s), and the next entries haven't been fed yet, we then keep on looping increasing next but not first, as these have no data. We should always return when we encountered data (visible or invisible decoded frame): for visible, the code already returns, for invisible, we can store a boolean indicating we drained at least one frame, whenever we reach an empty entry after that, we return (all subsequent entries are guaranteed to be empty anyway), not incrementing next nor first. This will have the effect to insert the next frame at the first free spot (which is much better than the weird skips it's doing now). So basically, c->frame_thread.next could skip some (empty) entries. Now it's contiguous. Fixes #416.	2023-02-10 15:11:32 +01:00
Victorien Le Couviour--Tuffet	3f19ece69f	Revert "Fix mismatch between first and next in drain" This reverts commit `a51b6ce417`. We can't increment first when no data is there, otherwise we might do it while the first frame was not yet decoded, messing up ordering: imagine having a framedelay of 8, and a file with 7 frames. We feed 7 frames over 8 slots, now next points to [7] (empty entry), and we start draining cause EOF. We do need next to be incremented to reach the first frame ([0]), so it can be outputted, and only then first too. Fixes #418.	2023-02-09 16:36:57 +01:00
Victorien Le Couviour--Tuffet	a51b6ce417	Fix mismatch between first and next in drain Fixes #416.	2023-01-26 12:49:36 +01:00
Victorien Le Couviour--Tuffet	8f16314dba	threading: Add a pending list for async task insertion	2022-10-27 13:03:22 +00:00
Victorien Le Couviour--Tuffet	3e7886db54	threading: Fix a race around frame completion (frame-mt) The completion of the first frame to decode while an async reset request on that same frame is pending will render it stale. The processing of such a stale request is likely to result in a hang. One reason this happens is the skip condition at the beginning of reset_task_cur(). => Consume the async request before that check. Another reason is several threads producing async reset requests in parallel: an async request for the first frame could cascade through the other threads (other frames) during completion of that frame, meaning not being caught by the last synchronous reset_task_cur() after signaling the main thread and before releasing the lock. => To solve this we need to add protections at the racy locations. That means after we increase first, before returning from reset_task_cur_async(), and after consuming the async request.	2022-10-20 14:23:30 +02:00
Victorien Le Couviour--Tuffet	6680d26f30	threading: Limit the progress bitfields to the used size Store the used size instead of the allocated size. The used size can be smaller than the allocated size, which results in a wrong computation of the linear progress from the frame_progress bitfield.	2022-09-08 14:50:25 +02:00
Victorien Le Couviour--Tuffet	895fed08e1	checkasm: Add short options	2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet	713a4f4e50	checkasm: Add pattern matching to --test	2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet	a63a7c9674	checkasm: Remove pattern matching from --bench The pattern matching feature has been improved and is now performed under the new --function parameter, rendering this one obsolete.	2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet	d5d37926b6	checkasm: Add a --function option Allows to run checkasm only for functions matching a given pattern.	2022-09-02 17:15:18 +02:00
Victorien Le Couviour--Tuffet	a3a55b1849	threading: Fix copy_lpf_progress initialization The copy_lpf_progress bitfield might not be fully cleared when size goes down. Credit to Oss-Fuzz.	2022-08-30 17:31:28 +02:00
Victorien Le Couviour--Tuffet	9717802d01	checkasm/lpf: Use operating dimensions Fixes use of uninitialized value.	2022-06-13 14:00:59 +02:00
Victorien Le Couviour--TuffetandRonald S. Bultje	b4f9eac858	checkasm: Fix uninitialized variable fg_data->num_y_points is used in generate_grain_uv, but is only set after the call: move the initialization above.	2022-05-31 16:28:34 +00:00
Victorien Le Couviour--Tuffet	ebeaac6d60	Fix typo Insert missing space.	2022-05-25 19:10:00 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner	b4d70152fd	Fix delayed_fg.scaling alignment for AVX-512	2022-03-07 19:27:08 +00:00
Victorien Le Couviour--Tuffet	402b54fcae	Integrate film grain in the task threading system	2022-02-28 18:20:48 +01:00
Victorien Le Couviour--Tuffet	4a52aa4790	x86: Add mc.resize AVX-512 (Ice Lake) asm resize_8bpc_c: 542599.0 resize_8bpc_ssse3: 87635.4 resize_8bpc_avx2: 67401.1 resize_8bpc_avx512icl: 50263.6 resize_16bpc_c: 573438.9 resize_16bpc_ssse3: 121505.2 resize_16bpc_avx2: 83293.4 resize_16bpc_avx512icl: 77974.8	2022-01-24 15:37:16 +01:00
Victorien Le Couviour--Tuffet	1cdde64f82	Run the init tasks for all frames first	2022-01-24 13:16:54 +01:00
Victorien Le Couviour--Tuffet	a8f3124a6c	Split the frame init task Allows to run most of dav1d_decode_frame_init unconditionally by putting the CDF and subsequent initializations in a separate task.	2022-01-24 13:16:54 +01:00
Victorien Le Couviour--Tuffet	1e3f0bea39	Move ENTROPY_PROGRESS task up	2022-01-19 20:17:31 +01:00
Victorien Le Couviour--Tuffet	6aaeeea689	Fix current frame selector wrapping condition This could cause a desync between first and cur, which results in skipping a frame, halting the decoding. This desync typically doesn't occur "long enough" in the current state of the project to trigger the bug, as some frames would fix this cur back. In order to trigger this, one needs to call reset_task_cur() on the last frame, this would be the call post insertion of the INIT task (during dav1d_task_frame_init). This doesn't happen as we would normally pick a task from a previous frame already in the queue.	2022-01-19 20:17:31 +01:00
Victorien Le Couviour--Tuffet	5919517ff6	x86: Add high bitdepth mc(t)_scaled SSSE3 asm mc_scaled_8tap_regular_w2_16bpc_c: 737.7 mc_scaled_8tap_regular_w2_16bpc_ssse3: 151.7 mc_scaled_8tap_regular_w2_16bpc_avx2: 141.2 mc_scaled_8tap_regular_w2_dy1_16bpc_c: 660.3 mc_scaled_8tap_regular_w2_dy1_16bpc_ssse3: 80.8 mc_scaled_8tap_regular_w2_dy1_16bpc_avx2: 73.2 mc_scaled_8tap_regular_w2_dy2_16bpc_c: 884.9 mc_scaled_8tap_regular_w2_dy2_16bpc_ssse3: 101.6 mc_scaled_8tap_regular_w2_dy2_16bpc_avx2: 87.2 mc_scaled_8tap_regular_w4_16bpc_c: 1356.3 mc_scaled_8tap_regular_w4_16bpc_ssse3: 172.3 mc_scaled_8tap_regular_w4_16bpc_avx2: 172.5 mc_scaled_8tap_regular_w4_dy1_16bpc_c: 1244.9 mc_scaled_8tap_regular_w4_dy1_16bpc_ssse3: 125.7 mc_scaled_8tap_regular_w4_dy1_16bpc_avx2: 96.1 mc_scaled_8tap_regular_w4_dy2_16bpc_c: 1665.6 mc_scaled_8tap_regular_w4_dy2_16bpc_ssse3: 150.2 mc_scaled_8tap_regular_w4_dy2_16bpc_avx2: 112.8 mc_scaled_8tap_regular_w8_16bpc_c: 2536.5 mc_scaled_8tap_regular_w8_16bpc_ssse3: 383.4 mc_scaled_8tap_regular_w8_16bpc_avx2: 256.2 mc_scaled_8tap_regular_w8_dy1_16bpc_c: 2331.8 mc_scaled_8tap_regular_w8_dy1_16bpc_ssse3: 350.0 mc_scaled_8tap_regular_w8_dy1_16bpc_avx2: 214.0 mc_scaled_8tap_regular_w8_dy2_16bpc_c: 3169.6 mc_scaled_8tap_regular_w8_dy2_16bpc_ssse3: 395.7 mc_scaled_8tap_regular_w8_dy2_16bpc_avx2: 265.7 mc_scaled_8tap_regular_w16_16bpc_c: 6384.6 mc_scaled_8tap_regular_w16_16bpc_ssse3: 1004.4 mc_scaled_8tap_regular_w16_16bpc_avx2: 665.0 mc_scaled_8tap_regular_w16_dy1_16bpc_c: 6103.4 mc_scaled_8tap_regular_w16_dy1_16bpc_ssse3: 896.3 mc_scaled_8tap_regular_w16_dy1_16bpc_avx2: 544.2 mc_scaled_8tap_regular_w16_dy2_16bpc_c: 8584.5 mc_scaled_8tap_regular_w16_dy2_16bpc_ssse3: 1049.0 mc_scaled_8tap_regular_w16_dy2_16bpc_avx2: 695.1 mc_scaled_8tap_regular_w32_16bpc_c: 19672.8 mc_scaled_8tap_regular_w32_16bpc_ssse3: 3204.3 mc_scaled_8tap_regular_w32_16bpc_avx2: 2109.6 mc_scaled_8tap_regular_w32_dy1_16bpc_c: 15964.6 mc_scaled_8tap_regular_w32_dy1_16bpc_ssse3: 2634.5 mc_scaled_8tap_regular_w32_dy1_16bpc_avx2: 1555.8 mc_scaled_8tap_regular_w32_dy2_16bpc_c: 24156.9 mc_scaled_8tap_regular_w32_dy2_16bpc_ssse3: 3217.3 mc_scaled_8tap_regular_w32_dy2_16bpc_avx2: 2088.8 mc_scaled_8tap_regular_w64_16bpc_c: 74356.3 mc_scaled_8tap_regular_w64_16bpc_ssse3: 11225.9 mc_scaled_8tap_regular_w64_16bpc_avx2: 7434.7 mc_scaled_8tap_regular_w64_dy1_16bpc_c: 60080.9 mc_scaled_8tap_regular_w64_dy1_16bpc_ssse3: 8912.8 mc_scaled_8tap_regular_w64_dy1_16bpc_avx2: 5222.2 mc_scaled_8tap_regular_w64_dy2_16bpc_c: 88891.4 mc_scaled_8tap_regular_w64_dy2_16bpc_ssse3: 10824.8 mc_scaled_8tap_regular_w64_dy2_16bpc_avx2: 7086.3 mc_scaled_8tap_regular_w128_16bpc_c: 171633.3 mc_scaled_8tap_regular_w128_16bpc_ssse3: 27089.3 mc_scaled_8tap_regular_w128_16bpc_avx2: 17998.2 mc_scaled_8tap_regular_w128_dy1_16bpc_c: 164399.9 mc_scaled_8tap_regular_w128_dy1_16bpc_ssse3: 24694.1 mc_scaled_8tap_regular_w128_dy1_16bpc_avx2: 14711.2 mc_scaled_8tap_regular_w128_dy2_16bpc_c: 244865.3 mc_scaled_8tap_regular_w128_dy2_16bpc_ssse3: 30599.1 mc_scaled_8tap_regular_w128_dy2_16bpc_avx2: 20341.1 mct_scaled_8tap_regular_w4_16bpc_c: 946.2 mct_scaled_8tap_regular_w4_16bpc_ssse3: 117.5 mct_scaled_8tap_regular_w4_16bpc_avx2: 112.5 mct_scaled_8tap_regular_w4_dy1_16bpc_c: 886.1 mct_scaled_8tap_regular_w4_dy1_16bpc_ssse3: 100.5 mct_scaled_8tap_regular_w4_dy1_16bpc_avx2: 76.8 mct_scaled_8tap_regular_w4_dy2_16bpc_c: 1170.1 mct_scaled_8tap_regular_w4_dy2_16bpc_ssse3: 117.6 mct_scaled_8tap_regular_w4_dy2_16bpc_avx2: 87.9 mct_scaled_8tap_regular_w8_16bpc_c: 2784.2 mct_scaled_8tap_regular_w8_16bpc_ssse3: 408.5 mct_scaled_8tap_regular_w8_16bpc_avx2: 280.3 mct_scaled_8tap_regular_w8_dy1_16bpc_c: 2530.5 mct_scaled_8tap_regular_w8_dy1_16bpc_ssse3: 358.2 mct_scaled_8tap_regular_w8_dy1_16bpc_avx2: 227.1 mct_scaled_8tap_regular_w8_dy2_16bpc_c: 3525.0 mct_scaled_8tap_regular_w8_dy2_16bpc_ssse3: 425.6 mct_scaled_8tap_regular_w8_dy2_16bpc_avx2: 283.6 mct_scaled_8tap_regular_w16_16bpc_c: 6773.8 mct_scaled_8tap_regular_w16_16bpc_ssse3: 1054.6 mct_scaled_8tap_regular_w16_16bpc_avx2: 696.4 mct_scaled_8tap_regular_w16_dy1_16bpc_c: 6418.0 mct_scaled_8tap_regular_w16_dy1_16bpc_ssse3: 938.7 mct_scaled_8tap_regular_w16_dy1_16bpc_avx2: 584.5 mct_scaled_8tap_regular_w16_dy2_16bpc_c: 9432.4 mct_scaled_8tap_regular_w16_dy2_16bpc_ssse3: 1125.3 mct_scaled_8tap_regular_w16_dy2_16bpc_avx2: 753.1 mct_scaled_8tap_regular_w32_16bpc_c: 26028.8 mct_scaled_8tap_regular_w32_16bpc_ssse3: 4128.4 mct_scaled_8tap_regular_w32_16bpc_avx2: 2748.4 mct_scaled_8tap_regular_w32_dy1_16bpc_c: 21604.3 mct_scaled_8tap_regular_w32_dy1_16bpc_ssse3: 3312.4 mct_scaled_8tap_regular_w32_dy1_16bpc_avx2: 2051.1 mct_scaled_8tap_regular_w32_dy2_16bpc_c: 32844.3 mct_scaled_8tap_regular_w32_dy2_16bpc_ssse3: 4102.9 mct_scaled_8tap_regular_w32_dy2_16bpc_avx2: 2741.6 mct_scaled_8tap_regular_w64_16bpc_c: 49101.8 mct_scaled_8tap_regular_w64_16bpc_ssse3: 8758.9 mct_scaled_8tap_regular_w64_16bpc_avx2: 5822.2 mct_scaled_8tap_regular_w64_dy1_16bpc_c: 53557.7 mct_scaled_8tap_regular_w64_dy1_16bpc_ssse3: 8469.7 mct_scaled_8tap_regular_w64_dy1_16bpc_avx2: 5264.3 mct_scaled_8tap_regular_w64_dy2_16bpc_c: 83379.7 mct_scaled_8tap_regular_w64_dy2_16bpc_ssse3: 10623.7 mct_scaled_8tap_regular_w64_dy2_16bpc_avx2: 7164.0 mct_scaled_8tap_regular_w128_16bpc_c: 163182.2 mct_scaled_8tap_regular_w128_16bpc_ssse3: 26452.9 mct_scaled_8tap_regular_w128_16bpc_avx2: 18402.2 mct_scaled_8tap_regular_w128_dy1_16bpc_c: 148199.8 mct_scaled_8tap_regular_w128_dy1_16bpc_ssse3: 23584.9 mct_scaled_8tap_regular_w128_dy1_16bpc_avx2: 14808.1 mct_scaled_8tap_regular_w128_dy2_16bpc_c: 234702.2 mct_scaled_8tap_regular_w128_dy2_16bpc_ssse3: 29653.8 mct_scaled_8tap_regular_w128_dy2_16bpc_avx2: 20042.4	2022-01-12 19:08:56 +01:00
Victorien Le Couviour--Tuffet	42ad602ddd	x86: Add 8-bit mc(t)_scaled SSSE3 32-bit asm mc_scaled_8tap_regular_w2_8bpc_c: 1070.7 mc_scaled_8tap_regular_w2_8bpc_ssse3: 253.0 mc_scaled_8tap_regular_w2_dy1_8bpc_c: 1079.9 mc_scaled_8tap_regular_w2_dy1_8bpc_ssse3: 114.8 mc_scaled_8tap_regular_w2_dy2_8bpc_c: 1466.1 mc_scaled_8tap_regular_w2_dy2_8bpc_ssse3: 145.7 mc_scaled_8tap_regular_w4_8bpc_c: 1965.4 mc_scaled_8tap_regular_w4_8bpc_ssse3: 251.4 mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1989.4 mc_scaled_8tap_regular_w4_dy1_8bpc_ssse3: 166.1 mc_scaled_8tap_regular_w4_dy2_8bpc_c: 2728.8 mc_scaled_8tap_regular_w4_dy2_8bpc_ssse3: 163.4 mc_scaled_8tap_regular_w8_8bpc_c: 3670.1 mc_scaled_8tap_regular_w8_8bpc_ssse3: 477.0 mc_scaled_8tap_regular_w8_dy1_8bpc_c: 3651.1 mc_scaled_8tap_regular_w8_dy1_8bpc_ssse3: 464.8 mc_scaled_8tap_regular_w8_dy2_8bpc_c: 5079.6 mc_scaled_8tap_regular_w8_dy2_8bpc_ssse3: 494.0 mc_scaled_8tap_regular_w16_8bpc_c: 8366.9 mc_scaled_8tap_regular_w16_8bpc_ssse3: 1197.4 mc_scaled_8tap_regular_w16_dy1_8bpc_c: 9088.5 mc_scaled_8tap_regular_w16_dy1_8bpc_ssse3: 1212.6 mc_scaled_8tap_regular_w16_dy2_8bpc_c: 13166.1 mc_scaled_8tap_regular_w16_dy2_8bpc_ssse3: 1301.4 mc_scaled_8tap_regular_w32_8bpc_c: 29883.7 mc_scaled_8tap_regular_w32_8bpc_ssse3: 3990.3 mc_scaled_8tap_regular_w32_dy1_8bpc_c: 23404.1 mc_scaled_8tap_regular_w32_dy1_8bpc_ssse3: 3617.4 mc_scaled_8tap_regular_w32_dy2_8bpc_c: 36248.3 mc_scaled_8tap_regular_w32_dy2_8bpc_ssse3: 3949.3 mc_scaled_8tap_regular_w64_8bpc_c: 57228.6 mc_scaled_8tap_regular_w64_8bpc_ssse3: 9359.4 mc_scaled_8tap_regular_w64_dy1_8bpc_c: 87271.8 mc_scaled_8tap_regular_w64_dy1_8bpc_ssse3: 12472.7 mc_scaled_8tap_regular_w64_dy2_8bpc_c: 135050.9 mc_scaled_8tap_regular_w64_dy2_8bpc_ssse3: 13585.4 mc_scaled_8tap_regular_w128_8bpc_c: 219123.0 mc_scaled_8tap_regular_w128_8bpc_ssse3: 31867.7 mc_scaled_8tap_regular_w128_dy1_8bpc_c: 240143.3 mc_scaled_8tap_regular_w128_dy1_8bpc_ssse3: 35275.7 mc_scaled_8tap_regular_w128_dy2_8bpc_c: 376357.7 mc_scaled_8tap_regular_w128_dy2_8bpc_ssse3: 39411.4 mct_scaled_8tap_regular_w4_8bpc_c: 1178.7 mct_scaled_8tap_regular_w4_8bpc_ssse3: 176.8 mct_scaled_8tap_regular_w4_dy1_8bpc_c: 1354.8 mct_scaled_8tap_regular_w4_dy1_8bpc_ssse3: 131.5 mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1832.2 mct_scaled_8tap_regular_w4_dy2_8bpc_ssse3: 123.0 mct_scaled_8tap_regular_w8_8bpc_c: 3547.6 mct_scaled_8tap_regular_w8_8bpc_ssse3: 526.0 mct_scaled_8tap_regular_w8_dy1_8bpc_c: 3683.8 mct_scaled_8tap_regular_w8_dy1_8bpc_ssse3: 513.8 mct_scaled_8tap_regular_w8_dy2_8bpc_c: 5260.7 mct_scaled_8tap_regular_w8_dy2_8bpc_ssse3: 566.1 mct_scaled_8tap_regular_w16_8bpc_c: 8424.5 mct_scaled_8tap_regular_w16_8bpc_ssse3: 1340.0 mct_scaled_8tap_regular_w16_dy1_8bpc_c: 9515.8 mct_scaled_8tap_regular_w16_dy1_8bpc_ssse3: 1337.0 mct_scaled_8tap_regular_w16_dy2_8bpc_c: 14247.3 mct_scaled_8tap_regular_w16_dy2_8bpc_ssse3: 1492.7 mct_scaled_8tap_regular_w32_8bpc_c: 32059.9 mct_scaled_8tap_regular_w32_8bpc_ssse3: 5177.5 mct_scaled_8tap_regular_w32_dy1_8bpc_c: 32557.6 mct_scaled_8tap_regular_w32_dy1_8bpc_ssse3: 4889.9 mct_scaled_8tap_regular_w32_dy2_8bpc_c: 50844.2 mct_scaled_8tap_regular_w32_dy2_8bpc_ssse3: 5667.1 mct_scaled_8tap_regular_w64_8bpc_c: 59903.1 mct_scaled_8tap_regular_w64_8bpc_ssse3: 10453.6 mct_scaled_8tap_regular_w64_dy1_8bpc_c: 80298.8 mct_scaled_8tap_regular_w64_dy1_8bpc_ssse3: 12597.8 mct_scaled_8tap_regular_w64_dy2_8bpc_c: 127244.8 mct_scaled_8tap_regular_w64_dy2_8bpc_ssse3: 14677.9 mct_scaled_8tap_regular_w128_8bpc_c: 280097.0 mct_scaled_8tap_regular_w128_8bpc_ssse3: 41989.3 mct_scaled_8tap_regular_w128_dy1_8bpc_c: 208913.2 mct_scaled_8tap_regular_w128_dy1_8bpc_ssse3: 35525.2 mct_scaled_8tap_regular_w128_dy2_8bpc_c: 341367.6 mct_scaled_8tap_regular_w128_dy2_8bpc_ssse3: 41449.0	2021-12-13 14:27:00 +01:00
Victorien Le Couviour--Tuffet	3fd2ad938a	Fix a leak when threading is active Credit to Oss-Fuzz.	2021-11-01 15:14:21 +01:00
Victorien Le Couviour--Tuffet	f7e0d4c032	Remove lpf_stride parameter from LR filters	2021-10-29 22:18:20 +02:00
Victorien Le Couviour--Tuffet	609fbaba84	Allow CDEF and LR to run sbrows in parallel	2021-10-29 22:18:20 +02:00
Victorien Le Couviour--Tuffet	8e6d5214a3	CI: Add tests for negative stride	2021-10-29 22:18:05 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner	82d6d950a2	x86: Add deblock loop filters AVX-512 (Ice Lake) asm	2021-10-18 14:49:05 +00:00
Victorien Le Couviour--Tuffet	5991883dc6	x86: Add high bitdepth mc(t)_scaled AVX2 asm	2021-09-20 13:47:49 +02:00
Victorien Le Couviour--Tuffet	69ff474a7f	Revert "Group lr_lpf_line re-allocation with lr_mask_sz" This reverts commit `e53314177a`. Causes issues when the sample has both 8 and 16 bit content. Credit to Oss-Fuzz.	2021-09-10 19:39:05 +02:00
Victorien Le Couviour--Tuffet	833c818b87	Minor consistency fixes, purely cosmetic	2021-09-09 13:42:04 +00:00
Victorien Le Couviour--Tuffet	976b9e4965	Fix a potential hang when dav1d_submit_frame fails Credit to Oss-Fuzz.	2021-09-09 13:42:04 +00:00
Victorien Le Couviour--Tuffet	e53314177a	Group lr_lpf_line re-allocation with lr_mask_sz	2021-09-07 17:13:44 +02:00
Victorien Le Couviour--Tuffet	159215a82d	Fix lr_lpf_line re-allocation check Credit to Oss-Fuzz.	2021-09-07 17:13:41 +02:00
Victorien Le Couviour--TuffetandRonald S. Bultje	753eef833b	Merge the 3 threading models into a single one Merges the 3 threading parameters into a single `--threads=` argument. Frame threading can still be controlled via the `--framedelay=` argument. Internally, the threading model is now a global thread/task pool design. Co-authored-by: Ronald S. Bultje <rsbultje@gmail.com>	2021-09-03 16:06:31 +00:00
Victorien Le Couviour--Tuffet	b1adba65c9	x86: Add high bitdepth mc.resize SSSE3 asm resize_16bpc_ssse3: 141122.7 resize_16bpc_avx2: 105971.7	2021-08-12 12:14:49 +02:00
Victorien Le Couviour--Tuffet	e647a54db9	x86: Fix minor things in mc.resize_8bpc_ssse3 - number of gpr and xmm regs in use - some cosmetics (no need to specify x for xmm regs on SSSE3) - a comment with wrong registers (unedited copy from AVX2 code)	2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet	e479e4a942	x86: Add high bitdepth mc.resize AVX2 asm resize_8bpc_avx2: 82986.1 resize_16bpc_avx2: 103896.7	2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet	b7f5503159	x86: Add minor improvement to mc.resize_8bpc_avx2 Simplify some gpr extract and sign extend operations.	2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet	356387f6f6	x86: Add bpc suffix to mc functions	2021-08-12 12:14:48 +02:00
Victorien Le Couviour--Tuffet	fe903da5b8	x86: Rewrite sgr8 SSSE3 asm Old: sgr_3x3_8bpc_ssse3: 140121.1 sgr_3x3_8bpc_avx2: 72965.4 sgr_5x5_8bpc_ssse3: 89859.1 sgr_5x5_8bpc_avx2: 48881.9 sgr_mix_8bpc_ssse3: 236626.5 sgr_mix_8bpc_avx2: 110552.6 New: sgr_3x3_8bpc_ssse3: 117294.4 sgr_3x3_8bpc_avx2: 72243.5 sgr_5x5_8bpc_ssse3: 79929.6 sgr_5x5_8bpc_avx2: 49798.4 sgr_mix_8bpc_ssse3: 184183.9 sgr_mix_8bpc_avx2: 109771.7	2021-08-03 14:58:45 +00:00
Victorien Le Couviour--Tuffet	935175daa7	x86: Add minor improvements to sgr16 SSSE3 asm Old: sgr_5x5_10bpc_ssse3: 87026.6 sgr_5x5_10bpc_avx2: 51864.5 sgr_mix_10bpc_ssse3: 205460.2 sgr_mix_10bpc_avx2: 122199.7 New: sgr_5x5_10bpc_ssse3: 84786.5 sgr_5x5_10bpc_avx2: 51651.3 sgr_mix_10bpc_ssse3: 202722.2 sgr_mix_10bpc_avx2: 122340.0	2021-08-03 14:58:45 +00:00
Victorien Le Couviour--Tuffet	513fd90c26	x86: Add high bitdepth (10-bit) sgr SSSE3 asm	2021-07-12 07:40:23 +00:00
Victorien Le Couviour--Tuffet	12f170c437	x86: Add minor improvements to wiener16 SSSE3 asm	2021-07-12 07:40:23 +00:00
Victorien Le Couviour--Tuffet	193db389e9	x86: Add high bitdepth wiener filter SSSE3 asm	2021-06-09 14:15:31 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner	dc7cdc0b58	x86: Add high bitdepth pal_pred AVX2 asm	2021-05-04 22:39:17 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner	0d42b3030b	x86: Add high bitdepth ipred_cfl_ac_422 AVX2 asm	2021-05-04 17:00:07 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner	ec5e93eecd	x86: Add high bitdepth ipred_cfl_ac_420 AVX2 asm	2021-05-04 17:00:07 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner	de6813f92c	x86: Add high bitdepth ipred_filter AVX2 asm	2021-05-04 17:00:07 +02:00
Victorien Le Couviour--TuffetandJean-Baptiste Kempf	8b1a96e481	Fix potential deadlock If the postfilter tasks allocation fails, a deadlock would occur.	2021-02-05 23:54:58 +01:00
Victorien Le Couviour--Tuffet	288ed4b8ec	dav1dplay: Add pause and seek features	2021-02-01 11:18:04 +01:00
Victorien Le Couviour--Tuffet	549086e4d3	Add post-filters threading model	2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet	4db73f115e	tests: Refactor seek_stress decoding functions	2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet	66c8a1ec28	fuzzer: Remove redundant flush Calling dav1d_close already takes care of flushing the internal state. Calling it just before is superfluous.	2021-01-28 15:08:10 +01:00
Victorien Le Couviour--Tuffet	5686e8355c	tests/seek_stress: Reduce the number of iterations	2021-01-21 09:54:50 +01:00
Victorien Le Couviour--Tuffet	05d05f9776	CI: Run the seek stress test	2021-01-18 13:59:26 +01:00
Victorien Le Couviour--Tuffet	63a918b487	tests: Add a seek stress test Closes #203.	2021-01-18 13:58:30 +01:00
Victorien Le Couviour--Tuffet	493d2b9157	input/ivf: Add seeking capability	2021-01-15 14:56:23 +01:00
Victorien Le Couviour--Tuffet	a40d3b5f0f	Abort frame decoding properly on reference error This could cause a frame waiting on the current one to not be notified on error. Fixes #351.	2020-10-21 14:37:12 +02:00
Victorien Le Couviour--Tuffet	06f12a8995	x86: Add {put/prep}_{8tap/bilin} SSSE3 asm (64-bit)	2020-08-06 15:34:40 +02:00
Victorien Le Couviour--Tuffet	652e5b38b0	x86: Minor changes to MC scaled AVX2 asm	2020-08-05 12:25:53 +02:00
Victorien Le Couviour--Tuffet	a75ee78bd9	x86: Add put/prep_bilin_scaled AVX2 asm Bilin scaled being very rarely used, add a new table entry to mc_subpel_filters, and jump to the put/prep_8tap_scaled code. AVX2 performance is obviously the same as the 8tap code, the speed up is much smaller though, as the C code is a true bilinear codepath, auto-vectorized. Yet, the AVX2 performance are always better.	2020-06-18 11:37:00 +02:00
Victorien Le Couviour--Tuffet	ea74e3d513	x86: Add prep_8tap_scaled AVX2 asm mct_scaled_8tap_regular_w4_8bpc_c: 872.1 mct_scaled_8tap_regular_w4_8bpc_avx2: 125.6 mct_scaled_8tap_regular_w4_dy1_8bpc_c: 886.3 mct_scaled_8tap_regular_w4_dy1_8bpc_avx2: 84.0 mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1189.1 mct_scaled_8tap_regular_w4_dy2_8bpc_avx2: 84.7 mct_scaled_8tap_regular_w8_8bpc_c: 2261.0 mct_scaled_8tap_regular_w8_8bpc_avx2: 306.2 mct_scaled_8tap_regular_w8_dy1_8bpc_c: 2189.9 mct_scaled_8tap_regular_w8_dy1_8bpc_avx2: 233.8 mct_scaled_8tap_regular_w8_dy2_8bpc_c: 3060.3 mct_scaled_8tap_regular_w8_dy2_8bpc_avx2: 282.8 mct_scaled_8tap_regular_w16_8bpc_c: 4335.3 mct_scaled_8tap_regular_w16_8bpc_avx2: 680.7 mct_scaled_8tap_regular_w16_dy1_8bpc_c: 5137.2 mct_scaled_8tap_regular_w16_dy1_8bpc_avx2: 578.6 mct_scaled_8tap_regular_w16_dy2_8bpc_c: 7878.4 mct_scaled_8tap_regular_w16_dy2_8bpc_avx2: 774.6 mct_scaled_8tap_regular_w32_8bpc_c: 17871.9 mct_scaled_8tap_regular_w32_8bpc_avx2: 2954.8 mct_scaled_8tap_regular_w32_dy1_8bpc_c: 18594.7 mct_scaled_8tap_regular_w32_dy1_8bpc_avx2: 2073.9 mct_scaled_8tap_regular_w32_dy2_8bpc_c: 28696.0 mct_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2852.1 mct_scaled_8tap_regular_w64_8bpc_c: 46967.5 mct_scaled_8tap_regular_w64_8bpc_avx2: 7527.5 mct_scaled_8tap_regular_w64_dy1_8bpc_c: 45564.2 mct_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5262.9 mct_scaled_8tap_regular_w64_dy2_8bpc_c: 72793.3 mct_scaled_8tap_regular_w64_dy2_8bpc_avx2: 7535.9 mct_scaled_8tap_regular_w128_8bpc_c: 111190.8 mct_scaled_8tap_regular_w128_8bpc_avx2: 19386.8 mct_scaled_8tap_regular_w128_dy1_8bpc_c: 122625.0 mct_scaled_8tap_regular_w128_dy1_8bpc_avx2: 15376.1 mct_scaled_8tap_regular_w128_dy2_8bpc_c: 197120.6 mct_scaled_8tap_regular_w128_dy2_8bpc_avx2: 21871.0	2020-06-18 11:37:00 +02:00
Victorien Le Couviour--Tuffet	22fb8a42a1	x86: Adapt SSSE3 prep_8tap to SSE2 --------------------- x86_64: ------------------------------------------ mct_8tap_regular_w4_h_8bpc_c: 302.3 mct_8tap_regular_w4_h_8bpc_sse2: 47.3 mct_8tap_regular_w4_h_8bpc_ssse3: 19.5 --------------------- mct_8tap_regular_w8_h_8bpc_c: 745.5 mct_8tap_regular_w8_h_8bpc_sse2: 235.2 mct_8tap_regular_w8_h_8bpc_ssse3: 70.4 --------------------- mct_8tap_regular_w16_h_8bpc_c: 1844.3 mct_8tap_regular_w16_h_8bpc_sse2: 755.6 mct_8tap_regular_w16_h_8bpc_ssse3: 225.9 --------------------- mct_8tap_regular_w32_h_8bpc_c: 6685.5 mct_8tap_regular_w32_h_8bpc_sse2: 2954.4 mct_8tap_regular_w32_h_8bpc_ssse3: 795.8 --------------------- mct_8tap_regular_w64_h_8bpc_c: 15633.5 mct_8tap_regular_w64_h_8bpc_sse2: 7120.4 mct_8tap_regular_w64_h_8bpc_ssse3: 1900.4 --------------------- mct_8tap_regular_w128_h_8bpc_c: 37772.1 mct_8tap_regular_w128_h_8bpc_sse2: 17698.1 mct_8tap_regular_w128_h_8bpc_ssse3: 4665.5 ------------------------------------------ mct_8tap_regular_w4_v_8bpc_c: 306.5 mct_8tap_regular_w4_v_8bpc_sse2: 71.7 mct_8tap_regular_w4_v_8bpc_ssse3: 37.9 --------------------- mct_8tap_regular_w8_v_8bpc_c: 923.3 mct_8tap_regular_w8_v_8bpc_sse2: 168.7 mct_8tap_regular_w8_v_8bpc_ssse3: 71.3 --------------------- mct_8tap_regular_w16_v_8bpc_c: 3040.1 mct_8tap_regular_w16_v_8bpc_sse2: 505.1 mct_8tap_regular_w16_v_8bpc_ssse3: 199.7 --------------------- mct_8tap_regular_w32_v_8bpc_c: 12354.8 mct_8tap_regular_w32_v_8bpc_sse2: 1942.0 mct_8tap_regular_w32_v_8bpc_ssse3: 714.2 --------------------- mct_8tap_regular_w64_v_8bpc_c: 29427.9 mct_8tap_regular_w64_v_8bpc_sse2: 4637.4 mct_8tap_regular_w64_v_8bpc_ssse3: 1829.2 --------------------- mct_8tap_regular_w128_v_8bpc_c: 72756.9 mct_8tap_regular_w128_v_8bpc_sse2: 11301.0 mct_8tap_regular_w128_v_8bpc_ssse3: 5020.6 ------------------------------------------ mct_8tap_regular_w4_hv_8bpc_c: 876.9 mct_8tap_regular_w4_hv_8bpc_sse2: 171.7 mct_8tap_regular_w4_hv_8bpc_ssse3: 112.2 --------------------- mct_8tap_regular_w8_hv_8bpc_c: 2215.1 mct_8tap_regular_w8_hv_8bpc_sse2: 730.2 mct_8tap_regular_w8_hv_8bpc_ssse3: 330.9 --------------------- mct_8tap_regular_w16_hv_8bpc_c: 6075.5 mct_8tap_regular_w16_hv_8bpc_sse2: 2252.1 mct_8tap_regular_w16_hv_8bpc_ssse3: 973.4 --------------------- mct_8tap_regular_w32_hv_8bpc_c: 22182.7 mct_8tap_regular_w32_hv_8bpc_sse2: 7692.6 mct_8tap_regular_w32_hv_8bpc_ssse3: 3599.8 --------------------- mct_8tap_regular_w64_hv_8bpc_c: 50876.8 mct_8tap_regular_w64_hv_8bpc_sse2: 18499.6 mct_8tap_regular_w64_hv_8bpc_ssse3: 8815.6 --------------------- mct_8tap_regular_w128_hv_8bpc_c: 122926.3 mct_8tap_regular_w128_hv_8bpc_sse2: 45120.0 mct_8tap_regular_w128_hv_8bpc_ssse3: 22085.7 ------------------------------------------	2020-06-11 12:37:36 +02:00
Victorien Le Couviour--Tuffet	83956bf10e	x86: Adapt SSSE3 prep_bilin to SSE2 --------------------- x86_64: ------------------------------------------ mct_bilinear_w4_h_8bpc_c: 98.9 mct_bilinear_w4_h_8bpc_sse2: 30.2 mct_bilinear_w4_h_8bpc_ssse3: 11.5 --------------------- mct_bilinear_w8_h_8bpc_c: 175.3 mct_bilinear_w8_h_8bpc_sse2: 57.0 mct_bilinear_w8_h_8bpc_ssse3: 19.7 --------------------- mct_bilinear_w16_h_8bpc_c: 396.2 mct_bilinear_w16_h_8bpc_sse2: 179.3 mct_bilinear_w16_h_8bpc_ssse3: 50.9 --------------------- mct_bilinear_w32_h_8bpc_c: 1311.2 mct_bilinear_w32_h_8bpc_sse2: 718.8 mct_bilinear_w32_h_8bpc_ssse3: 243.9 --------------------- mct_bilinear_w64_h_8bpc_c: 2892.7 mct_bilinear_w64_h_8bpc_sse2: 1746.0 mct_bilinear_w64_h_8bpc_ssse3: 568.0 --------------------- mct_bilinear_w128_h_8bpc_c: 7192.6 mct_bilinear_w128_h_8bpc_sse2: 4339.8 mct_bilinear_w128_h_8bpc_ssse3: 1619.2 ------------------------------------------ mct_bilinear_w4_v_8bpc_c: 129.7 mct_bilinear_w4_v_8bpc_sse2: 26.6 mct_bilinear_w4_v_8bpc_ssse3: 16.7 --------------------- mct_bilinear_w8_v_8bpc_c: 233.3 mct_bilinear_w8_v_8bpc_sse2: 55.0 mct_bilinear_w8_v_8bpc_ssse3: 24.7 --------------------- mct_bilinear_w16_v_8bpc_c: 498.9 mct_bilinear_w16_v_8bpc_sse2: 146.0 mct_bilinear_w16_v_8bpc_ssse3: 54.2 --------------------- mct_bilinear_w32_v_8bpc_c: 1562.2 mct_bilinear_w32_v_8bpc_sse2: 560.6 mct_bilinear_w32_v_8bpc_ssse3: 201.0 --------------------- mct_bilinear_w64_v_8bpc_c: 3221.3 mct_bilinear_w64_v_8bpc_sse2: 1380.6 mct_bilinear_w64_v_8bpc_ssse3: 499.3 --------------------- mct_bilinear_w128_v_8bpc_c: 7357.7 mct_bilinear_w128_v_8bpc_sse2: 3439.0 mct_bilinear_w128_v_8bpc_ssse3: 1489.1 ------------------------------------------ mct_bilinear_w4_hv_8bpc_c: 185.0 mct_bilinear_w4_hv_8bpc_sse2: 54.5 mct_bilinear_w4_hv_8bpc_ssse3: 22.1 --------------------- mct_bilinear_w8_hv_8bpc_c: 377.8 mct_bilinear_w8_hv_8bpc_sse2: 104.3 mct_bilinear_w8_hv_8bpc_ssse3: 35.8 --------------------- mct_bilinear_w16_hv_8bpc_c: 1159.4 mct_bilinear_w16_hv_8bpc_sse2: 311.0 mct_bilinear_w16_hv_8bpc_ssse3: 106.3 --------------------- mct_bilinear_w32_hv_8bpc_c: 4436.2 mct_bilinear_w32_hv_8bpc_sse2: 1230.7 mct_bilinear_w32_hv_8bpc_ssse3: 400.7 --------------------- mct_bilinear_w64_hv_8bpc_c: 10627.7 mct_bilinear_w64_hv_8bpc_sse2: 2934.2 mct_bilinear_w64_hv_8bpc_ssse3: 957.2 --------------------- mct_bilinear_w128_hv_8bpc_c: 26048.9 mct_bilinear_w128_hv_8bpc_sse2: 7590.3 mct_bilinear_w128_hv_8bpc_ssse3: 2947.0 ------------------------------------------	2020-06-11 12:37:36 +02:00
Victorien Le Couviour--Tuffet	a755541faa	x86: Add put_8tap_scaled AVX2 asm mc_scaled_8tap_regular_w2_8bpc_c: 764.4 mc_scaled_8tap_regular_w2_8bpc_avx2: 191.3 mc_scaled_8tap_regular_w2_dy1_8bpc_c: 705.8 mc_scaled_8tap_regular_w2_dy1_8bpc_avx2: 89.5 mc_scaled_8tap_regular_w2_dy2_8bpc_c: 964.0 mc_scaled_8tap_regular_w2_dy2_8bpc_avx2: 120.3 mc_scaled_8tap_regular_w4_8bpc_c: 1355.7 mc_scaled_8tap_regular_w4_8bpc_avx2: 180.9 mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1233.2 mc_scaled_8tap_regular_w4_dy1_8bpc_avx2: 115.3 mc_scaled_8tap_regular_w4_dy2_8bpc_c: 1707.6 mc_scaled_8tap_regular_w4_dy2_8bpc_avx2: 117.9 mc_scaled_8tap_regular_w8_8bpc_c: 2483.2 mc_scaled_8tap_regular_w8_8bpc_avx2: 294.8 mc_scaled_8tap_regular_w8_dy1_8bpc_c: 2166.4 mc_scaled_8tap_regular_w8_dy1_8bpc_avx2: 222.0 mc_scaled_8tap_regular_w8_dy2_8bpc_c: 3133.7 mc_scaled_8tap_regular_w8_dy2_8bpc_avx2: 292.6 mc_scaled_8tap_regular_w16_8bpc_c: 5239.2 mc_scaled_8tap_regular_w16_8bpc_avx2: 729.9 mc_scaled_8tap_regular_w16_dy1_8bpc_c: 5156.5 mc_scaled_8tap_regular_w16_dy1_8bpc_avx2: 602.2 mc_scaled_8tap_regular_w16_dy2_8bpc_c: 8018.4 mc_scaled_8tap_regular_w16_dy2_8bpc_avx2: 783.1 mc_scaled_8tap_regular_w32_8bpc_c: 14745.0 mc_scaled_8tap_regular_w32_8bpc_avx2: 2205.0 mc_scaled_8tap_regular_w32_dy1_8bpc_c: 14862.3 mc_scaled_8tap_regular_w32_dy1_8bpc_avx2: 1721.3 mc_scaled_8tap_regular_w32_dy2_8bpc_c: 23607.6 mc_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2325.7 mc_scaled_8tap_regular_w64_8bpc_c: 54891.7 mc_scaled_8tap_regular_w64_8bpc_avx2: 8351.4 mc_scaled_8tap_regular_w64_dy1_8bpc_c: 50249.0 mc_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5864.4 mc_scaled_8tap_regular_w64_dy2_8bpc_c: 79400.1 mc_scaled_8tap_regular_w64_dy2_8bpc_avx2: 8295.7 mc_scaled_8tap_regular_w128_8bpc_c: 121046.8 mc_scaled_8tap_regular_w128_8bpc_avx2: 21809.1 mc_scaled_8tap_regular_w128_dy1_8bpc_c: 133720.4 mc_scaled_8tap_regular_w128_dy1_8bpc_avx2: 16197.8 mc_scaled_8tap_regular_w128_dy2_8bpc_c: 218774.8 mc_scaled_8tap_regular_w128_dy2_8bpc_avx2: 22993.1	2020-06-01 15:30:36 +02:00
Victorien Le Couviour--Tuffet	98ed9be69b	Fix MC masks alignment for sizes >= 64 for AVX-512 Those need to be aligned when w*h >= 64, as we will try to load by 64 bytes. (also realigns the 4x4 masks to 16 as a 32-byte alignment is unnecessary)	2020-04-16 11:43:08 +02:00
Victorien Le Couviour--Tuffet	604d93c5f7	x86: Split AVX2 / AVX-512 CDEF into dedicated files	2020-04-07 16:21:53 +02:00
Victorien Le Couviour--Tuffet	95068df6a6	x86: Add cdef_filter_{4,8}x8 AVX-512 (Ice Lake) asm cdef_filter_4x8_8bpc_avx2: 54.0 cdef_filter_4x8_8bpc_avx512icl: 35.5 => +52.1% cdef_filter_8x8_8bpc_avx2: 71.0 cdef_filter_8x8_8bpc_avx512icl: 49.0 => +44.9%	2020-04-07 16:10:44 +02:00
Victorien Le Couviour--Tuffet	71f27407dd	x86: add some explanatory comment to wiener_filter_h Explains how the clipping to the range defined in the spec works.	2020-04-03 14:21:36 +02:00
Victorien Le Couviour--Tuffet	22080aa30c	x86: optimize cdef_filter_{4x{4,8},8x8}_avx2 Add 2 seperate code paths for pri/sec strengths equal 0. Having both strengths not equal to 0 is uncommon, branching to skip unnecessary computations is therefore beneficial. ------------------------------------------ before: cdef_filter_4x4_8bpc_avx2: 93.8 after: cdef_filter_4x4_8bpc_avx2: 71.7 --------------------- before: cdef_filter_4x8_8bpc_avx2: 161.5 after: cdef_filter_4x8_8bpc_avx2: 116.3 --------------------- before: cdef_filter_8x8_8bpc_avx2: 221.8 after: cdef_filter_8x8_8bpc_avx2: 156.4 ------------------------------------------	2020-02-24 11:23:20 +01:00
Victorien Le Couviour--Tuffet	1bd078c2e5	x86: add a seperate fully edged case to cdef_filter_avx2 --------------------- fully edged blocks perf ------------------------------------------ before: cdef_filter_4x4_8bpc_avx2: 91.0 after: cdef_filter_4x4_8bpc_avx2: 75.7 --------------------- before: cdef_filter_4x8_8bpc_avx2: 154.6 after: cdef_filter_4x8_8bpc_avx2: 131.8 --------------------- before: cdef_filter_8x8_8bpc_avx2: 214.1 after: cdef_filter_8x8_8bpc_avx2: 195.9 ------------------------------------------	2020-02-24 11:23:20 +01:00
Victorien Le Couviour--TuffetandVictorien Le Couviour--Tuffet	e706fac9cf	x86: add prep_8tap AVX512 asm	2020-01-20 11:42:53 +01:00
Victorien Le Couviour--TuffetandVictorien Le Couviour--Tuffet	b83cb9643b	x86: replace "mov hb, Xb" by "movzx hd, Xb" in MC It's a little easier for the CPU to simply overwrite a 32-bit reg rather than writing it's low 8 bits while conserving bits 8 to 31. In order to do that it actually fetches those bits, merge to a 32-bit value, and write that back to the 32-bit GPR. As those are always cleared, perform a zero extend mov to dword instead.	2020-01-20 11:18:07 +01:00
Victorien Le Couviour--TuffetandRonald S. Bultje	5462c2a80d	x86: add prep_bilin AVX512 asm ------------------------------------------ mct_bilinear_w4_0_8bpc_avx2: 3.8 mct_bilinear_w4_0_8bpc_avx512icl: 3.7 --------------------- mct_bilinear_w8_0_8bpc_avx2: 5.0 mct_bilinear_w8_0_8bpc_avx512icl: 4.8 --------------------- mct_bilinear_w16_0_8bpc_avx2: 8.5 mct_bilinear_w16_0_8bpc_avx512icl: 7.1 --------------------- mct_bilinear_w32_0_8bpc_avx2: 29.5 mct_bilinear_w32_0_8bpc_avx512icl: 17.1 --------------------- mct_bilinear_w64_0_8bpc_avx2: 68.1 mct_bilinear_w64_0_8bpc_avx512icl: 34.7 --------------------- mct_bilinear_w128_0_8bpc_avx2: 180.5 mct_bilinear_w128_0_8bpc_avx512icl: 138.0 ------------------------------------------ mct_bilinear_w4_h_8bpc_avx2: 4.0 mct_bilinear_w4_h_8bpc_avx512icl: 3.9 --------------------- mct_bilinear_w8_h_8bpc_avx2: 5.3 mct_bilinear_w8_h_8bpc_avx512icl: 5.0 --------------------- mct_bilinear_w16_h_8bpc_avx2: 11.7 mct_bilinear_w16_h_8bpc_avx512icl: 7.5 --------------------- mct_bilinear_w32_h_8bpc_avx2: 41.8 mct_bilinear_w32_h_8bpc_avx512icl: 20.3 --------------------- mct_bilinear_w64_h_8bpc_avx2: 94.9 mct_bilinear_w64_h_8bpc_avx512icl: 35.0 --------------------- mct_bilinear_w128_h_8bpc_avx2: 240.1 mct_bilinear_w128_h_8bpc_avx512icl: 143.8 ------------------------------------------ mct_bilinear_w4_v_8bpc_avx2: 4.1 mct_bilinear_w4_v_8bpc_avx512icl: 4.0 --------------------- mct_bilinear_w8_v_8bpc_avx2: 6.0 mct_bilinear_w8_v_8bpc_avx512icl: 5.4 --------------------- mct_bilinear_w16_v_8bpc_avx2: 10.3 mct_bilinear_w16_v_8bpc_avx512icl: 8.9 --------------------- mct_bilinear_w32_v_8bpc_avx2: 29.5 mct_bilinear_w32_v_8bpc_avx512icl: 25.9 --------------------- mct_bilinear_w64_v_8bpc_avx2: 64.3 mct_bilinear_w64_v_8bpc_avx512icl: 41.3 --------------------- mct_bilinear_w128_v_8bpc_avx2: 198.2 mct_bilinear_w128_v_8bpc_avx512icl: 139.6 ------------------------------------------ mct_bilinear_w4_hv_8bpc_avx2: 5.6 mct_bilinear_w4_hv_8bpc_avx512icl: 5.2 --------------------- mct_bilinear_w8_hv_8bpc_avx2: 8.3 mct_bilinear_w8_hv_8bpc_avx512icl: 7.0 --------------------- mct_bilinear_w16_hv_8bpc_avx2: 19.4 mct_bilinear_w16_hv_8bpc_avx512icl: 12.1 --------------------- mct_bilinear_w32_hv_8bpc_avx2: 69.1 mct_bilinear_w32_hv_8bpc_avx512icl: 32.5 --------------------- mct_bilinear_w64_hv_8bpc_avx2: 164.4 mct_bilinear_w64_hv_8bpc_avx512icl: 71.1 --------------------- mct_bilinear_w128_hv_8bpc_avx2: 405.2 mct_bilinear_w128_hv_8bpc_avx512icl: 193.1 ------------------------------------------	2020-01-09 14:56:42 +01:00
Victorien Le Couviour--TuffetandRonald S. Bultje	40891aab9b	x86: add avx512icl cpu flag to x86inc.asm	2020-01-09 14:56:42 +01:00
Victorien Le Couviour--TuffetandRonald S. Bultje	430967a627	checkasm: x86: ensure all SIMD lanes are turned on at all times YMM and ZMM registers on x86 are turned off to save power when they haven't been used for some period of time. When they are used there will be a "warmup" period during which performance will be reduced and inconsistent which is problematic when trying to benchmark individual functions. Periodically issue "dummy" instructions that uses those registers to prevent them from being powered down. The end result is more consistent benchmark results. Credits to Henrik Gramner's commit 1878c7f2af0a9c73e291488209109782c428cfcf from x264.	2020-01-09 14:56:42 +01:00
Victorien Le Couviour--Tuffet	36d615d120	x86: adapt SSSE3 wiener filter to SSE2 Also slightly optimized the 32-bit SSSE3, especially by the removal of an XMM store/load. --------------------- x86_64: ------------------------------------------ wiener_chroma_8bpc_c: 193155.1 wiener_chroma_8bpc_sse2: 48973.4 wiener_chroma_8bpc_ssse3: 31486.3 --------------------- wiener_luma_8bpc_c: 192787.5 wiener_luma_8bpc_sse2: 48674.9 wiener_luma_8bpc_ssse3: 30446.3 ------------------------------------------ --------------------- x86_32: ------------------------------------------ wiener_chroma_8bpc_c: 309861.0 wiener_chroma_8bpc_sse2: 52345.9 wiener_chroma_8bpc_ssse3: 32983.2 --------------------- wiener_luma_8bpc_c: 317909.1 wiener_luma_8bpc_sse2: 52522.1 wiener_luma_8bpc_ssse3: 33323.1 ------------------------------------------	2019-10-24 20:42:52 +02:00
Victorien Le Couviour--Tuffet	4866abab1f	x86: adapt SSSE3 warp_affine_8x8{,t} to SSE2 --------------------- x86_64: ------------------------------------------ warp_8x8_8bpc_c: 1761.5 warp_8x8_8bpc_sse2: 583.0 warp_8x8_8bpc_ssse3: 329.3 --------------------- warp_8x8t_8bpc_c: 1694.3 warp_8x8t_8bpc_sse2: 577.6 warp_8x8t_8bpc_ssse3: 334.1 ------------------------------------------ --------------------- x86_32: ------------------------------------------ warp_8x8_8bpc_c: 1842.6 warp_8x8_8bpc_sse2: 677.1 warp_8x8_8bpc_ssse3: 394.9 --------------------- warp_8x8t_8bpc_c: 1741.1 warp_8x8t_8bpc_sse2: 648.5 warp_8x8t_8bpc_ssse3: 372.6 ------------------------------------------	2019-10-24 20:42:52 +02:00
Victorien Le Couviour--TuffetandHenrik Gramner	477905413d	x86inc: fix LOAD_MM_PERMUTATION for AVX512 Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION) is redondant. It causes the register mapping to be the same as without the initial AVX512_MM_PERMUTATION, with the user SWAPs applied. For example... INIT_YMM avx512 SWAP m0, m16 SAVE_MM_PERMUTATION ; do whatever LOAD_MM_PERMUTATION ... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1 instead of ymm17.	2019-10-21 20:21:38 +02:00
Victorien Le Couviour--Tuffet	3e9f967640	x86: adapt SSSE3 cdef_filter_{4x4,4x8,8x8} to SSE2 --------------------- x86_64: ------------------------------------------ cdef_filter_4x4_8bpc_c: 1376.0 cdef_filter_4x4_8bpc_sse2: 177.6 cdef_filter_4x4_8bpc_ssse3: 132.5 --------------------- cdef_filter_4x8_8bpc_c: 2725.0 cdef_filter_4x8_8bpc_sse2: 327.6 cdef_filter_4x8_8bpc_ssse3: 234.9 --------------------- cdef_filter_8x8_8bpc_c: 5938.8 cdef_filter_8x8_8bpc_sse2: 556.8 cdef_filter_8x8_8bpc_ssse3: 388.1 ------------------------------------------ --------------------- x86_32: ------------------------------------------ cdef_filter_4x4_8bpc_c: 1569.5 cdef_filter_4x4_8bpc_sse2: 201.9 cdef_filter_4x4_8bpc_ssse3: 162.3 --------------------- cdef_filter_4x8_8bpc_c: 3141.6 cdef_filter_4x8_8bpc_sse2: 368.3 cdef_filter_4x8_8bpc_ssse3: 283.4 --------------------- cdef_filter_8x8_8bpc_c: 6534.5 cdef_filter_8x8_8bpc_sse2: 666.7 cdef_filter_8x8_8bpc_ssse3: 503.5 ------------------------------------------	2019-10-18 11:05:11 +02:00
Victorien Le Couviour--Tuffet	11b7250644	tools: fix SSE2 cpu masking	2019-10-16 10:45:54 +02:00
Victorien Le Couviour--Tuffet	a91a03b0e1	x86: add warp_affine SSE4 and SSSE3 asm ------------------------------------------ x86_64: warp_8x8_8bpc_c: 1773.4 x86_32: warp_8x8_8bpc_c: 1740.4 ---------- x86_64: warp_8x8_8bpc_ssse3: 317.5 x86_32: warp_8x8_8bpc_ssse3: 378.4 ---------- x86_64: warp_8x8_8bpc_sse4: 303.7 x86_32: warp_8x8_8bpc_sse4: 367.7 ---------- x86_64: warp_8x8_8bpc_avx2: 224.9 --------------------- --------------------- x86_64: warp_8x8t_8bpc_c: 1664.6 x86_32: warp_8x8t_8bpc_c: 1674.0 ---------- x86_64: warp_8x8t_8bpc_ssse3: 320.7 x86_32: warp_8x8t_8bpc_ssse3: 379.5 ---------- x86_64: warp_8x8t_8bpc_sse4: 304.8 x86_32: warp_8x8t_8bpc_sse4: 369.8 ---------- x86_64: warp_8x8t_8bpc_avx2: 228.5 ------------------------------------------	2019-09-30 15:40:43 +02:00
Victorien Le Couviour--Tuffet	c0865f35c7	x86: add 32-bit support to SSSE3 deblock lpf ------------------------------------------ x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6 x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6 x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0 x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4 --------------------- x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9 x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6 x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5 x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6 --------------------- x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7 x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3 x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3 x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7 --------------------- x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7 x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8 x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9 x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9 --------------------- x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9 x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5 x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7 x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0 --------------------- x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4 x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6 x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9 x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0 --------------------- x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6 x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7 x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2 x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2 --------------------- x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4 x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5 x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8 x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7 --------------------- x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2 x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9 x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3 x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9 --------------------- x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6 x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7 x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0 x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8 ------------------------------------------	2019-09-19 12:07:23 +02:00
Victorien Le Couviour--Tuffet	beda6e0d1c	build: fix meson deprecation warning 'build_' prefix is reserved by meson, this will become an error in the future, as indicated by a warning when configuring the build dir. Closes #285.	2019-07-02 14:02:40 +02:00