zhoupeng and Hecai Yuan
7c63bb1b6e
Loongarch: Optimized emu_edge_c function by LSX
...
emu_edge_w4_8bpc_c: 9.0 ( 1.00x)
emu_edge_w4_8bpc_lsx: 6.7 ( 1.34x)
emu_edge_w8_8bpc_c: 12.9 ( 1.00x)
emu_edge_w8_8bpc_lsx: 9.2 ( 1.40x)
emu_edge_w16_8bpc_c: 20.0 ( 1.00x)
emu_edge_w16_8bpc_lsx: 16.3 ( 1.23x)
emu_edge_w32_8bpc_c: 44.6 ( 1.00x)
emu_edge_w32_8bpc_lsx: 33.3 ( 1.34x)
emu_edge_w64_8bpc_c: 79.9 ( 1.00x)
emu_edge_w64_8bpc_lsx: 66.2 ( 1.21x)
emu_edge_w128_8bpc_c: 193.9 ( 1.00x)
emu_edge_w128_8bpc_lsx: 197.8 ( 0.98x)
Change-Id: I180c94d311509740b03793419d5790a931532980
2024-09-30 06:37:00 +00:00
guxiwei and Hecai Yuan
e3101ddc8b
LoongArch64: Implement checked_call()
...
Now checkasm calls the test function 'func_new' through
the wrapper 'checked_call' instead of calling it directly.
The purpose of the wrapper is to check if 'func_new' correctly
saves and restores static registers. The wrapper writes dirty
values to the static registers, and after calling 'func_new',
it checks if the dirty values in the static registers remain consistent.
Change-Id: Ia9290b55ab0f2dd87801f6fd175813d3f717d851
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
7f891597bf
Loongarch: Optimized ipred_filter 8bpc functions by LSX
...
intra_pred_filter_w4_8bpc_c: 17.9 ( 1.00x)
intra_pred_filter_w4_8bpc_lsx: 8.9 ( 2.00x)
intra_pred_filter_w8_8bpc_c: 55.3 ( 1.00x)
intra_pred_filter_w8_8bpc_lsx: 23.8 ( 2.33x)
intra_pred_filter_w16_8bpc_c: 109.4 ( 1.00x)
intra_pred_filter_w16_8bpc_lsx: 49.1 ( 2.23x)
intra_pred_filter_w32_8bpc_c: 270.2 ( 1.00x)
intra_pred_filter_w32_8bpc_lsx: 126.1 ( 2.14x)
Change-Id: Ic4c23cb1d54d5f8557c31cdfbbd54f8beaaa32c2
2024-09-30 06:37:00 +00:00
yuanhecai
f398bf968c
loongarch: Add the some optimization function about itx for 8bpc
...
1. inv_txfm_add_dct_dct_32x16_8bpc_lsx
2. inv_txfm_add_dct_dct_32x8_8bpc_lsx
3. inv_txfm_add_dct_dct_64x32_8bpc_lsx
4. inv_txfm_add_adst_flipadst_16x16_8bpc_lsx
5. inv_txfm_add_flipadst_adst_16x16_8bpc_lsx
6. inv_txfm_add_adst_adst_16x16_8bpc_lasx
Relative speedup over C code:
inv_txfm_add_32x16_dct_dct_0_8bpc_c: 78.4 ( 1.00x)
inv_txfm_add_32x16_dct_dct_0_8bpc_lsx: 5.7 (13.81x)
inv_txfm_add_32x16_dct_dct_1_8bpc_c: 710.1 ( 1.00x)
inv_txfm_add_32x16_dct_dct_1_8bpc_lsx: 102.9 ( 6.90x)
inv_txfm_add_32x16_dct_dct_2_8bpc_c: 918.0 ( 1.00x)
inv_txfm_add_32x16_dct_dct_2_8bpc_lsx: 103.2 ( 8.90x)
inv_txfm_add_32x16_dct_dct_3_8bpc_c: 914.3 ( 1.00x)
inv_txfm_add_32x16_dct_dct_3_8bpc_lsx: 103.2 ( 8.86x)
inv_txfm_add_32x16_dct_dct_4_8bpc_c: 929.8 ( 1.00x)
inv_txfm_add_32x16_dct_dct_4_8bpc_lsx: 102.9 ( 9.03x)
inv_txfm_add_32x8_dct_dct_0_8bpc_c: 39.6 ( 1.00x)
inv_txfm_add_32x8_dct_dct_0_8bpc_lsx: 3.0 (13.10x)
inv_txfm_add_32x8_dct_dct_1_8bpc_c: 431.6 ( 1.00x)
inv_txfm_add_32x8_dct_dct_1_8bpc_lsx: 42.6 (10.13x)
inv_txfm_add_32x8_dct_dct_2_8bpc_c: 431.5 ( 1.00x)
inv_txfm_add_32x8_dct_dct_2_8bpc_lsx: 42.6 (10.13x)
inv_txfm_add_32x8_dct_dct_3_8bpc_c: 432.0 ( 1.00x)
inv_txfm_add_32x8_dct_dct_3_8bpc_lsx: 42.6 (10.14x)
inv_txfm_add_32x8_dct_dct_4_8bpc_c: 431.3 ( 1.00x)
inv_txfm_add_32x8_dct_dct_4_8bpc_lsx: 42.6 (10.13x)
inv_txfm_add_64x32_dct_dct_0_8bpc_c: 304.3 ( 1.00x)
inv_txfm_add_64x32_dct_dct_0_8bpc_lsx: 20.3 (15.01x)
inv_txfm_add_64x32_dct_dct_1_8bpc_c: 2743.1 ( 1.00x)
inv_txfm_add_64x32_dct_dct_1_8bpc_lsx: 270.9 (10.13x)
inv_txfm_add_64x32_dct_dct_2_8bpc_c: 3197.1 ( 1.00x)
inv_txfm_add_64x32_dct_dct_2_8bpc_lsx: 327.7 ( 9.76x)
inv_txfm_add_64x32_dct_dct_3_8bpc_c: 3638.3 ( 1.00x)
inv_txfm_add_64x32_dct_dct_3_8bpc_lsx: 383.7 ( 9.48x)
inv_txfm_add_64x32_dct_dct_4_8bpc_c: 4084.5 ( 1.00x)
inv_txfm_add_64x32_dct_dct_4_8bpc_lsx: 441.7 ( 9.25x)
inv_txfm_add_16x16_adst_flipadst_0_8bpc_c: 277.3 ( 1.00x)
inv_txfm_add_16x16_adst_flipadst_0_8bpc_lsx: 58.7 ( 4.72x)
inv_txfm_add_16x16_adst_flipadst_1_8bpc_c: 358.1 ( 1.00x)
inv_txfm_add_16x16_adst_flipadst_1_8bpc_lsx: 58.7 ( 6.10x)
inv_txfm_add_16x16_adst_flipadst_2_8bpc_c: 449.3 ( 1.00x)
inv_txfm_add_16x16_adst_flipadst_2_8bpc_lsx: 58.7 ( 7.65x)
inv_txfm_add_16x16_flipadst_adst_0_8bpc_c: 277.2 ( 1.00x)
inv_txfm_add_16x16_flipadst_adst_0_8bpc_lsx: 58.7 ( 4.72x)
inv_txfm_add_16x16_flipadst_adst_1_8bpc_c: 358.7 ( 1.00x)
inv_txfm_add_16x16_flipadst_adst_1_8bpc_lsx: 58.7 ( 6.11x)
inv_txfm_add_16x16_flipadst_adst_2_8bpc_c: 450.4 ( 1.00x)
inv_txfm_add_16x16_flipadst_adst_2_8bpc_lsx: 58.7 ( 7.67x)
inv_txfm_add_16x16_adst_adst_0_8bpc_c: 253.4 ( 1.00x)
inv_txfm_add_16x16_adst_adst_0_8bpc_lasx: 23.1 (10.98x)
inv_txfm_add_16x16_adst_adst_1_8bpc_c: 325.2 ( 1.00x)
inv_txfm_add_16x16_adst_adst_1_8bpc_lasx: 23.1 (14.08x)
inv_txfm_add_16x16_adst_adst_2_8bpc_c: 405.9 ( 1.00x)
inv_txfm_add_16x16_adst_adst_2_8bpc_lasx: 23.1 (17.56x)
Change-Id: Iaa5419a830c3308e2c4c9ac5b3699c3a971301ed
2024-09-30 06:37:00 +00:00
yuanhecai
13a857d056
loongarch: add lsx implementation of itx_8bpc.add_16x8 series function for 8 bpc
...
Relative speedup over C code:
inv_txfm_add_16x8_adst_adst_0_8bpc_c: 127.7 ( 1.00x)
inv_txfm_add_16x8_adst_adst_0_8bpc_lsx: 29.6 ( 4.32x)
inv_txfm_add_16x8_adst_adst_1_8bpc_c: 206.6 ( 1.00x)
inv_txfm_add_16x8_adst_adst_1_8bpc_lsx: 29.6 ( 6.98x)
inv_txfm_add_16x8_adst_adst_2_8bpc_c: 206.6 ( 1.00x)
inv_txfm_add_16x8_adst_adst_2_8bpc_lsx: 29.6 ( 6.99x)
inv_txfm_add_16x8_adst_dct_0_8bpc_c: 126.7 ( 1.00x)
inv_txfm_add_16x8_adst_dct_0_8bpc_lsx: 25.8 ( 4.91x)
inv_txfm_add_16x8_adst_dct_1_8bpc_c: 205.1 ( 1.00x)
inv_txfm_add_16x8_adst_dct_1_8bpc_lsx: 25.8 ( 7.94x)
inv_txfm_add_16x8_adst_dct_2_8bpc_c: 205.2 ( 1.00x)
inv_txfm_add_16x8_adst_dct_2_8bpc_lsx: 25.8 ( 7.94x)
inv_txfm_add_16x8_adst_flipadst_0_8bpc_c: 128.3 ( 1.00x)
inv_txfm_add_16x8_adst_flipadst_0_8bpc_lsx: 29.8 ( 4.30x)
inv_txfm_add_16x8_adst_flipadst_1_8bpc_c: 207.2 ( 1.00x)
inv_txfm_add_16x8_adst_flipadst_1_8bpc_lsx: 29.9 ( 6.94x)
inv_txfm_add_16x8_adst_flipadst_2_8bpc_c: 207.1 ( 1.00x)
inv_txfm_add_16x8_adst_flipadst_2_8bpc_lsx: 29.8 ( 6.94x)
inv_txfm_add_16x8_adst_identity_0_8bpc_c: 78.3 ( 1.00x)
inv_txfm_add_16x8_adst_identity_0_8bpc_lsx: 18.6 ( 4.21x)
inv_txfm_add_16x8_adst_identity_1_8bpc_c: 157.1 ( 1.00x)
inv_txfm_add_16x8_adst_identity_1_8bpc_lsx: 18.6 ( 8.45x)
inv_txfm_add_16x8_adst_identity_2_8bpc_c: 157.2 ( 1.00x)
inv_txfm_add_16x8_adst_identity_2_8bpc_lsx: 18.6 ( 8.46x)
inv_txfm_add_16x8_dct_adst_0_8bpc_c: 127.4 ( 1.00x)
inv_txfm_add_16x8_dct_adst_0_8bpc_lsx: 25.4 ( 5.02x)
inv_txfm_add_16x8_dct_adst_1_8bpc_c: 201.2 ( 1.00x)
inv_txfm_add_16x8_dct_adst_1_8bpc_lsx: 25.4 ( 7.93x)
inv_txfm_add_16x8_dct_adst_2_8bpc_c: 201.2 ( 1.00x)
inv_txfm_add_16x8_dct_adst_2_8bpc_lsx: 25.4 ( 7.93x)
inv_txfm_add_16x8_dct_dct_0_8bpc_c: 21.8 ( 1.00x)
inv_txfm_add_16x8_dct_dct_0_8bpc_lsx: 2.1 (10.52x)
inv_txfm_add_16x8_dct_dct_1_8bpc_c: 200.2 ( 1.00x)
inv_txfm_add_16x8_dct_dct_1_8bpc_lsx: 21.6 ( 9.28x)
inv_txfm_add_16x8_dct_dct_2_8bpc_c: 200.2 ( 1.00x)
inv_txfm_add_16x8_dct_dct_2_8bpc_lsx: 21.6 ( 9.28x)
inv_txfm_add_16x8_dct_flipadst_0_8bpc_c: 127.2 ( 1.00x)
inv_txfm_add_16x8_dct_flipadst_0_8bpc_lsx: 25.6 ( 4.96x)
inv_txfm_add_16x8_dct_flipadst_1_8bpc_c: 201.2 ( 1.00x)
inv_txfm_add_16x8_dct_flipadst_1_8bpc_lsx: 25.7 ( 7.84x)
inv_txfm_add_16x8_dct_flipadst_2_8bpc_c: 201.7 ( 1.00x)
inv_txfm_add_16x8_dct_flipadst_2_8bpc_lsx: 25.7 ( 7.86x)
inv_txfm_add_16x8_dct_identity_0_8bpc_c: 77.3 ( 1.00x)
inv_txfm_add_16x8_dct_identity_0_8bpc_lsx: 14.5 ( 5.35x)
inv_txfm_add_16x8_dct_identity_1_8bpc_c: 151.2 ( 1.00x)
inv_txfm_add_16x8_dct_identity_1_8bpc_lsx: 14.5 (10.46x)
inv_txfm_add_16x8_dct_identity_2_8bpc_c: 151.5 ( 1.00x)
inv_txfm_add_16x8_dct_identity_2_8bpc_lsx: 14.5 (10.48x)
inv_txfm_add_16x8_flipadst_adst_0_8bpc_c: 128.5 ( 1.00x)
inv_txfm_add_16x8_flipadst_adst_0_8bpc_lsx: 29.7 ( 4.32x)
inv_txfm_add_16x8_flipadst_adst_1_8bpc_c: 207.3 ( 1.00x)
inv_txfm_add_16x8_flipadst_adst_1_8bpc_lsx: 29.7 ( 6.97x)
inv_txfm_add_16x8_flipadst_adst_2_8bpc_c: 207.4 ( 1.00x)
inv_txfm_add_16x8_flipadst_adst_2_8bpc_lsx: 29.7 ( 6.98x)
inv_txfm_add_16x8_flipadst_dct_0_8bpc_c: 126.8 ( 1.00x)
inv_txfm_add_16x8_flipadst_dct_0_8bpc_lsx: 25.9 ( 4.90x)
inv_txfm_add_16x8_flipadst_dct_1_8bpc_c: 204.8 ( 1.00x)
inv_txfm_add_16x8_flipadst_dct_1_8bpc_lsx: 25.9 ( 7.92x)
inv_txfm_add_16x8_flipadst_dct_2_8bpc_c: 205.4 ( 1.00x)
inv_txfm_add_16x8_flipadst_dct_2_8bpc_lsx: 25.9 ( 7.94x)
inv_txfm_add_16x8_flipadst_flipadst_0_8bpc_c: 128.6 ( 1.00x)
inv_txfm_add_16x8_flipadst_flipadst_0_8bpc_lsx: 30.0 ( 4.29x)
inv_txfm_add_16x8_flipadst_flipadst_1_8bpc_c: 206.6 ( 1.00x)
inv_txfm_add_16x8_flipadst_flipadst_1_8bpc_lsx: 29.9 ( 6.90x)
inv_txfm_add_16x8_flipadst_flipadst_2_8bpc_c: 206.5 ( 1.00x)
inv_txfm_add_16x8_flipadst_flipadst_2_8bpc_lsx: 29.9 ( 6.90x)
inv_txfm_add_16x8_flipadst_identity_0_8bpc_c: 77.8 ( 1.00x)
inv_txfm_add_16x8_flipadst_identity_0_8bpc_lsx: 18.6 ( 4.18x)
inv_txfm_add_16x8_flipadst_identity_1_8bpc_c: 156.3 ( 1.00x)
inv_txfm_add_16x8_flipadst_identity_1_8bpc_lsx: 18.6 ( 8.40x)
inv_txfm_add_16x8_flipadst_identity_2_8bpc_c: 156.6 ( 1.00x)
inv_txfm_add_16x8_flipadst_identity_2_8bpc_lsx: 18.6 ( 8.42x)
inv_txfm_add_16x8_identity_adst_0_8bpc_c: 120.7 ( 1.00x)
inv_txfm_add_16x8_identity_adst_0_8bpc_lsx: 21.1 ( 5.71x)
inv_txfm_add_16x8_identity_adst_1_8bpc_c: 120.8 ( 1.00x)
inv_txfm_add_16x8_identity_adst_1_8bpc_lsx: 21.1 ( 5.71x)
inv_txfm_add_16x8_identity_adst_2_8bpc_c: 145.5 ( 1.00x)
inv_txfm_add_16x8_identity_adst_2_8bpc_lsx: 21.2 ( 6.88x)
inv_txfm_add_16x8_identity_dct_0_8bpc_c: 119.1 ( 1.00x)
inv_txfm_add_16x8_identity_dct_0_8bpc_lsx: 17.9 ( 6.67x)
inv_txfm_add_16x8_identity_dct_1_8bpc_c: 119.1 ( 1.00x)
inv_txfm_add_16x8_identity_dct_1_8bpc_lsx: 17.9 ( 6.67x)
inv_txfm_add_16x8_identity_dct_2_8bpc_c: 143.8 ( 1.00x)
inv_txfm_add_16x8_identity_dct_2_8bpc_lsx: 17.9 ( 8.06x)
inv_txfm_add_16x8_identity_flipadst_0_8bpc_c: 120.7 ( 1.00x)
inv_txfm_add_16x8_identity_flipadst_0_8bpc_lsx: 21.3 ( 5.66x)
inv_txfm_add_16x8_identity_flipadst_1_8bpc_c: 120.4 ( 1.00x)
inv_txfm_add_16x8_identity_flipadst_1_8bpc_lsx: 21.3 ( 5.65x)
inv_txfm_add_16x8_identity_flipadst_2_8bpc_c: 144.9 ( 1.00x)
inv_txfm_add_16x8_identity_flipadst_2_8bpc_lsx: 21.3 ( 6.80x)
inv_txfm_add_16x8_identity_identity_0_8bpc_c: 70.2 ( 1.00x)
inv_txfm_add_16x8_identity_identity_0_8bpc_lsx: 9.5 ( 7.38x)
inv_txfm_add_16x8_identity_identity_1_8bpc_c: 95.6 ( 1.00x)
inv_txfm_add_16x8_identity_identity_1_8bpc_lsx: 9.5 (10.06x)
inv_txfm_add_16x8_identity_identity_2_8bpc_c: 95.6 ( 1.00x)
inv_txfm_add_16x8_identity_identity_2_8bpc_lsx: 9.5 (10.06x)
Change-Id: If1e274cab0e8441297a1eb44bd86be580f4c8f62
2024-09-30 06:37:00 +00:00
yuanhecai
843f00e531
loongarch: opt inv_txfm_add_adst_dct/dct_dct/identity_identity_16x4_8bpc_lsx
...
Relative speedup over C code:
inv_txfm_add_16x4_adst_dct_0_8bpc_c: 61.7 ( 1.00x)
inv_txfm_add_16x4_adst_dct_0_8bpc_lsx: 17.8 ( 3.46x)
inv_txfm_add_16x4_adst_dct_1_8bpc_c: 96.2 ( 1.00x)
inv_txfm_add_16x4_adst_dct_1_8bpc_lsx: 17.8 ( 5.39x)
inv_txfm_add_16x4_adst_dct_2_8bpc_c: 96.2 ( 1.00x)
inv_txfm_add_16x4_adst_dct_2_8bpc_lsx: 17.8 ( 5.39x)
inv_txfm_add_16x4_dct_dct_0_8bpc_c: 10.8 ( 1.00x)
inv_txfm_add_16x4_dct_dct_0_8bpc_lsx: 0.9 (12.23x)
inv_txfm_add_16x4_dct_dct_1_8bpc_c: 94.5 ( 1.00x)
inv_txfm_add_16x4_dct_dct_1_8bpc_lsx: 13.6 ( 6.94x)
inv_txfm_add_16x4_dct_dct_2_8bpc_c: 94.7 ( 1.00x)
inv_txfm_add_16x4_dct_dct_2_8bpc_lsx: 13.6 ( 6.95x)
inv_txfm_add_16x4_identity_identity_0_8bpc_c: 42.1 ( 1.00x)
inv_txfm_add_16x4_identity_identity_0_8bpc_lsx: 5.1 ( 8.21x)
inv_txfm_add_16x4_identity_identity_1_8bpc_c: 53.0 ( 1.00x)
inv_txfm_add_16x4_identity_identity_1_8bpc_lsx: 5.1 (10.35x)
inv_txfm_add_16x4_identity_identity_2_8bpc_c: 53.0 ( 1.00x)
inv_txfm_add_16x4_identity_identity_2_8bpc_lsx: 5.1 (10.35x)
Change-Id: I0be4f77e381da390e300070337fff404dcdcb862
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
083cf424ff
Loongarch: Optimized cfl_pred_cfl, cfl_pred_cfl_128, cfl_pred_cfl_top and cfl_pred_cfl_left 8bpc functions by LSX
...
cfl_pred_cfl_128_w4_8bpc_c: 19.4 ( 1.00x)
cfl_pred_cfl_128_w4_8bpc_lsx: 4.2 ( 4.63x)
cfl_pred_cfl_128_w8_8bpc_c: 66.3 ( 1.00x)
cfl_pred_cfl_128_w8_8bpc_lsx: 7.3 ( 9.11x)
cfl_pred_cfl_128_w16_8bpc_c: 150.1 ( 1.00x)
cfl_pred_cfl_128_w16_8bpc_lsx: 14.4 (10.45x)
cfl_pred_cfl_128_w32_8bpc_c: 403.6 ( 1.00x)
cfl_pred_cfl_128_w32_8bpc_lsx: 34.7 (11.65x)
cfl_pred_cfl_left_w4_8bpc_c: 20.5 ( 1.00x)
cfl_pred_cfl_left_w4_8bpc_lsx: 4.4 ( 4.63x)
cfl_pred_cfl_left_w8_8bpc_c: 67.9 ( 1.00x)
cfl_pred_cfl_left_w8_8bpc_lsx: 7.6 ( 8.94x)
cfl_pred_cfl_left_w16_8bpc_c: 152.0 ( 1.00x)
cfl_pred_cfl_left_w16_8bpc_lsx: 14.6 (10.38x)
cfl_pred_cfl_left_w32_8bpc_c: 405.8 ( 1.00x)
cfl_pred_cfl_left_w32_8bpc_lsx: 35.0 (11.58x)
cfl_pred_cfl_top_w4_8bpc_c: 20.0 ( 1.00x)
cfl_pred_cfl_top_w4_8bpc_lsx: 4.4 ( 4.51x)
cfl_pred_cfl_top_w8_8bpc_c: 67.6 ( 1.00x)
cfl_pred_cfl_top_w8_8bpc_lsx: 7.5 ( 8.99x)
cfl_pred_cfl_top_w16_8bpc_c: 152.5 ( 1.00x)
cfl_pred_cfl_top_w16_8bpc_lsx: 14.6 (10.41x)
cfl_pred_cfl_top_w32_8bpc_c: 408.0 ( 1.00x)
cfl_pred_cfl_top_w32_8bpc_lsx: 35.2 (11.58x)
cfl_pred_cfl_w4_8bpc_c: 21.1 ( 1.00x)
cfl_pred_cfl_w4_8bpc_lsx: 4.8 ( 4.43x)
cfl_pred_cfl_w8_8bpc_c: 68.6 ( 1.00x)
cfl_pred_cfl_w8_8bpc_lsx: 7.9 ( 8.73x)
cfl_pred_cfl_w16_8bpc_c: 154.4 ( 1.00x)
cfl_pred_cfl_w16_8bpc_lsx: 15.0 (10.29x)
cfl_pred_cfl_w32_8bpc_c: 410.3 ( 1.00x)
cfl_pred_cfl_w32_8bpc_lsx: 35.6 (11.54x)
Change-Id: I4ec7cc71483298d28379bfbd824e97a0d74d0c23
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
3f6c845d81
Loongarch: Optimized pal_pred 8bpc functions by LSX
...
pal_pred_w4_8bpc_c: 3.0 ( 1.00x)
pal_pred_w4_8bpc_lsx: 0.6 ( 5.46x)
pal_pred_w8_8bpc_c: 8.8 ( 1.00x)
pal_pred_w8_8bpc_lsx: 0.9 ( 9.49x)
pal_pred_w16_8bpc_c: 26.0 ( 1.00x)
pal_pred_w16_8bpc_lsx: 1.9 (13.70x)
pal_pred_w32_8bpc_c: 60.6 ( 1.00x)
pal_pred_w32_8bpc_lsx: 4.0 (15.10x)
pal_pred_w64_8bpc_c: 146.9 ( 1.00x)
pal_pred_w64_8bpc_lsx: 9.2 (15.97x)
Change-Id: I5414f096a23b09c3a512e727b93fa22104d141f9
2024-09-30 06:37:00 +00:00
jinbo and Hecai Yuan
b26f315d00
loongarch: Add prep_8tap_8bpc_lsx
...
mct_8tap_regular_w4_0_8bpc_c: 3.7 ( 1.00x)
mct_8tap_regular_w4_0_8bpc_lsx: 0.9 ( 4.21x)
mct_8tap_regular_w4_h_8bpc_c: 15.7 ( 1.00x)
mct_8tap_regular_w4_h_8bpc_lsx: 1.7 ( 9.24x)
mct_8tap_regular_w4_hv_8bpc_c: 44.1 ( 1.00x)
mct_8tap_regular_w4_hv_8bpc_lsx: 6.3 ( 6.96x)
mct_8tap_regular_w4_v_8bpc_c: 19.8 ( 1.00x)
mct_8tap_regular_w4_v_8bpc_lsx: 2.4 ( 8.21x)
mct_8tap_regular_w8_0_8bpc_c: 10.5 ( 1.00x)
mct_8tap_regular_w8_0_8bpc_lsx: 1.3 ( 8.27x)
mct_8tap_regular_w8_h_8bpc_c: 47.2 ( 1.00x)
mct_8tap_regular_w8_h_8bpc_lsx: 6.2 ( 7.61x)
mct_8tap_regular_w8_hv_8bpc_c: 119.5 ( 1.00x)
mct_8tap_regular_w8_hv_8bpc_lsx: 18.9 ( 6.32x)
mct_8tap_regular_w8_v_8bpc_c: 60.5 ( 1.00x)
mct_8tap_regular_w8_v_8bpc_lsx: 5.4 (11.12x)
mct_8tap_regular_w16_0_8bpc_c: 28.8 ( 1.00x)
mct_8tap_regular_w16_0_8bpc_lsx: 2.8 (10.32x)
mct_8tap_regular_w16_h_8bpc_c: 151.9 ( 1.00x)
mct_8tap_regular_w16_h_8bpc_lsx: 19.8 ( 7.67x)
mct_8tap_regular_w16_hv_8bpc_c: 357.5 ( 1.00x)
mct_8tap_regular_w16_hv_8bpc_lsx: 57.6 ( 6.21x)
mct_8tap_regular_w16_v_8bpc_c: 195.6 ( 1.00x)
mct_8tap_regular_w16_v_8bpc_lsx: 16.9 (11.61x)
mct_8tap_regular_w32_0_8bpc_c: 104.6 ( 1.00x)
mct_8tap_regular_w32_0_8bpc_lsx: 11.6 ( 9.03x)
mct_8tap_regular_w32_h_8bpc_c: 596.3 ( 1.00x)
mct_8tap_regular_w32_h_8bpc_lsx: 77.8 ( 7.67x)
mct_8tap_regular_w32_hv_8bpc_c: 1329.0 ( 1.00x)
mct_8tap_regular_w32_hv_8bpc_lsx: 217.9 ( 6.10x)
mct_8tap_regular_w32_v_8bpc_c: 771.0 ( 1.00x)
mct_8tap_regular_w32_v_8bpc_lsx: 65.7 (11.73x)
mct_8tap_regular_w64_0_8bpc_c: 242.0 ( 1.00x)
mct_8tap_regular_w64_0_8bpc_lsx: 27.0 ( 8.95x)
mct_8tap_regular_w64_h_8bpc_c: 1455.9 ( 1.00x)
mct_8tap_regular_w64_h_8bpc_lsx: 186.9 ( 7.79x)
mct_8tap_regular_w64_hv_8bpc_c: 3221.7 ( 1.00x)
mct_8tap_regular_w64_hv_8bpc_lsx: 521.8 ( 6.17x)
mct_8tap_regular_w64_v_8bpc_c: 1836.1 ( 1.00x)
mct_8tap_regular_w64_v_8bpc_lsx: 158.2 (11.61x)
mct_8tap_regular_w128_0_8bpc_c: 629.0 ( 1.00x)
mct_8tap_regular_w128_0_8bpc_lsx: 66.3 ( 9.49x)
mct_8tap_regular_w128_h_8bpc_c: 3617.5 ( 1.00x)
mct_8tap_regular_w128_h_8bpc_lsx: 463.6 ( 7.80x)
mct_8tap_regular_w128_hv_8bpc_c: 7881.7 ( 1.00x)
mct_8tap_regular_w128_hv_8bpc_lsx: 1290.3 ( 6.11x)
mct_8tap_regular_w128_v_8bpc_c: 4552.9 ( 1.00x)
mct_8tap_regular_w128_v_8bpc_lsx: 391.1 (11.64x)
Change-Id: I8c6046e4bd6c1fb19d5712234abece0355fb77fa
2024-09-30 06:37:00 +00:00
zhoupeng and Hecai Yuan
ce45ebdef4
Loongarch: Optimized blenc_h_c function by LSX/LASX
...
blend_h_w2_8bpc_c: 3.8 ( 1.00x)
blend_h_w2_8bpc_lsx: 1.9 ( 1.98x)
blend_h_w2_8bpc_lasx: 1.9 ( 1.98x)
blend_h_w4_8bpc_c: 6.4 ( 1.00x)
blend_h_w4_8bpc_lsx: 1.8 ( 3.49x)
blend_h_w4_8bpc_lasx: 1.8 ( 3.49x)
blend_h_w8_8bpc_c: 11.6 ( 1.00x)
blend_h_w8_8bpc_lsx: 1.8 ( 6.45x)
blend_h_w8_8bpc_lasx: 1.8 ( 6.48x)
blend_h_w16_8bpc_c: 21.5 ( 1.00x)
blend_h_w16_8bpc_lsx: 2.1 (10.47x)
blend_h_w16_8bpc_lasx: 2.1 (10.48x)
blend_h_w32_8bpc_c: 41.9 ( 1.00x)
blend_h_w32_8bpc_lsx: 3.8 (11.08x)
blend_h_w32_8bpc_lasx: 3.9 (10.67x)
blend_h_w64_8bpc_c: 82.0 ( 1.00x)
blend_h_w64_8bpc_lsx: 6.9 (11.89x)
blend_h_w64_8bpc_lasx: 4.6 (17.93x)
blend_h_w128_8bpc_c: 202.3 ( 1.00x)
blend_h_w128_8bpc_lsx: 16.4 (12.30x)
blend_h_w128_8bpc_lasx: 11.4 (17.77x)
Change-Id: I6d6599ccbaba8a62a629c4a52254b2369dba60f6
2024-09-30 06:37:00 +00:00
zhoupeng and Hecai Yuan
5319278dbe
Loongarch: Optimized blend_c/blenc_v_c function by LSX
...
blend_v_w2_8bpc_c: 5.7 ( 1.00x)
blend_v_w2_8bpc_lsx: 3.6 ( 1.60x)
blend_v_w4_8bpc_c: 22.8 ( 1.00x)
blend_v_w4_8bpc_lsx: 7.1 ( 3.20x)
blend_v_w8_8bpc_c: 40.2 ( 1.00x)
blend_v_w8_8bpc_lsx: 7.1 ( 5.63x)
blend_v_w16_8bpc_c: 74.6 ( 1.00x)
blend_v_w16_8bpc_lsx: 8.1 ( 9.26x)
blend_v_w32_8bpc_c: 144.0 ( 1.00x)
blend_v_w32_8bpc_lsx: 13.3 (10.83x)
blend_w4_8bpc_c: 4.9 ( 1.00x)
blend_w4_8bpc_lsx: 1.9 ( 2.49x)
blend_w8_8bpc_c: 14.1 ( 1.00x)
blend_w8_8bpc_lsx: 3.2 ( 4.37x)
blend_w16_8bpc_c: 51.5 ( 1.00x)
blend_w16_8bpc_lsx: 7.9 ( 6.51x)
blend_w32_8bpc_c: 127.5 ( 1.00x)
blend_w32_8bpc_lsx: 19.6 ( 6.52x)
Change-Id: I95e2dbc1f0735688f5473687f1a7e8d37ffbe417
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
0b9c756f42
Loongarch: Optimized ipred_smooth, ipred_smooth_h and ipred_smooth_v 8bpc functions by LSX
...
intra_pred_smooth_h_w4_8bpc_c: 7.3 ( 1.00x)
intra_pred_smooth_h_w4_8bpc_lsx: 3.1 ( 2.36x)
intra_pred_smooth_h_w8_8bpc_c: 21.3 ( 1.00x)
intra_pred_smooth_h_w8_8bpc_lsx: 4.5 ( 4.71x)
intra_pred_smooth_h_w16_8bpc_c: 66.3 ( 1.00x)
intra_pred_smooth_h_w16_8bpc_lsx: 13.4 ( 4.96x)
intra_pred_smooth_h_w32_8bpc_c: 160.0 ( 1.00x)
intra_pred_smooth_h_w32_8bpc_lsx: 29.3 ( 5.46x)
intra_pred_smooth_h_w64_8bpc_c: 400.2 ( 1.00x)
intra_pred_smooth_h_w64_8bpc_lsx: 68.3 ( 5.86x)
intra_pred_smooth_v_w4_8bpc_c: 6.6 ( 1.00x)
intra_pred_smooth_v_w4_8bpc_lsx: 3.1 ( 2.10x)
intra_pred_smooth_v_w8_8bpc_c: 19.3 ( 1.00x)
intra_pred_smooth_v_w8_8bpc_lsx: 4.9 ( 3.95x)
intra_pred_smooth_v_w16_8bpc_c: 58.6 ( 1.00x)
intra_pred_smooth_v_w16_8bpc_lsx: 24.0 ( 2.44x)
intra_pred_smooth_v_w32_8bpc_c: 139.4 ( 1.00x)
intra_pred_smooth_v_w32_8bpc_lsx: 27.0 ( 5.17x)
intra_pred_smooth_v_w64_8bpc_c: 344.8 ( 1.00x)
intra_pred_smooth_v_w64_8bpc_lsx: 70.8 ( 4.87x)
intra_pred_smooth_w4_8bpc_c: 10.2 ( 1.00x)
intra_pred_smooth_w4_8bpc_lsx: 7.9 ( 1.30x)
intra_pred_smooth_w8_8bpc_c: 30.3 ( 1.00x)
intra_pred_smooth_w8_8bpc_lsx: 20.0 ( 1.51x)
intra_pred_smooth_w16_8bpc_c: 96.3 ( 1.00x)
intra_pred_smooth_w16_8bpc_lsx: 58.3 ( 1.65x)
intra_pred_smooth_w32_8bpc_c: 231.1 ( 1.00x)
intra_pred_smooth_w32_8bpc_lsx: 134.3 ( 1.72x)
intra_pred_smooth_w64_8bpc_c: 571.5 ( 1.00x)
intra_pred_smooth_w64_8bpc_lsx: 326.5 ( 1.75x)
Change-Id: I22b6c2dcf27c5393bba374b4fbe8879c0463f828
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
7463c2af64
Loongarch: Optimized ipred_paeth 8bpc function by LSX
...
intra_pred_paeth_w4_8bpc_c: 12.3 ( 1.00x)
intra_pred_paeth_w4_8bpc_lsx: 3.9 ( 3.12x)
intra_pred_paeth_w8_8bpc_c: 39.7 ( 1.00x)
intra_pred_paeth_w8_8bpc_lsx: 6.4 ( 6.20x)
intra_pred_paeth_w16_8bpc_c: 133.6 ( 1.00x)
intra_pred_paeth_w16_8bpc_lsx: 17.0 ( 7.85x)
intra_pred_paeth_w32_8bpc_c: 342.8 ( 1.00x)
intra_pred_paeth_w32_8bpc_lsx: 52.7 ( 6.50x)
intra_pred_paeth_w64_8bpc_c: 903.8 ( 1.00x)
intra_pred_paeth_w64_8bpc_lsx: 107.3 ( 8.42x)
Change-Id: I457bdb24fdd6b5400ec030bffbdd40c79d8165c1
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
3e9d80d831
Loongarch: Optimized ipred_h and ipred_v 8bpc function by LSX
...
intra_pred_h_w4_8bpc_c: 4.3 ( 1.00x)
intra_pred_h_w4_8bpc_lsx: 3.5 ( 1.21x)
intra_pred_h_w8_8bpc_c: 5.7 ( 1.00x)
intra_pred_h_w8_8bpc_lsx: 5.1 ( 1.11x)
intra_pred_h_w16_8bpc_c: 13.2 ( 1.00x)
intra_pred_h_w16_8bpc_lsx: 7.1 ( 1.86x)
intra_pred_h_w32_8bpc_c: 12.4 ( 1.00x)
intra_pred_h_w32_8bpc_lsx: 6.3 ( 1.96x)
intra_pred_h_w64_8bpc_c: 25.9 ( 1.00x)
intra_pred_h_w64_8bpc_lsx: 5.8 ( 4.44x)
intra_pred_v_w4_8bpc_c: 4.6 ( 1.00x)
intra_pred_v_w4_8bpc_lsx: 2.5 ( 1.85x)
intra_pred_v_w8_8bpc_c: 6.9 ( 1.00x)
intra_pred_v_w8_8bpc_lsx: 4.5 ( 1.53x)
intra_pred_v_w16_8bpc_c: 13.3 ( 1.00x)
intra_pred_v_w16_8bpc_lsx: 5.2 ( 2.56x)
intra_pred_v_w32_8bpc_c: 16.1 ( 1.00x)
intra_pred_v_w32_8bpc_lsx: 5.1 ( 3.13x)
intra_pred_v_w64_8bpc_c: 21.7 ( 1.00x)
intra_pred_v_w64_8bpc_lsx: 7.7 ( 2.80x)
Change-Id: I51b3dd13877315b9c1c64590c19f1ad38bfc4bdf
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
2a9cbcc2f3
Loongarch: Optimized ipred_dc,ipred_dc_128 8bpc,ipred_dc_left and ipred_dc_top functions by LSX
...
intra_pred_dc_w4_8bpc_c: 2.1 ( 1.00x)
intra_pred_dc_w4_8bpc_lsx: 1.3 ( 1.54x)
intra_pred_dc_w8_8bpc_c: 3.6 ( 1.00x)
intra_pred_dc_w8_8bpc_lsx: 3.7 ( 0.97x)
intra_pred_dc_w16_8bpc_c: 6.9 ( 1.00x)
intra_pred_dc_w16_8bpc_lsx: 7.8 ( 0.88x)
intra_pred_dc_w32_8bpc_c: 14.1 ( 1.00x)
intra_pred_dc_w32_8bpc_lsx: 7.1 ( 1.97x)
intra_pred_dc_w64_8bpc_c: 25.3 ( 1.00x)
intra_pred_dc_w64_8bpc_lsx: 7.4 ( 3.41x)
intra_pred_dc_128_w4_8bpc_c: 0.6 ( 1.00x)
intra_pred_dc_128_w4_8bpc_lsx: 0.8 ( 0.76x)
intra_pred_dc_128_w8_8bpc_c: 1.4 ( 1.00x)
intra_pred_dc_128_w8_8bpc_lsx: 3.2 ( 0.45x)
intra_pred_dc_128_w16_8bpc_c: 3.4 ( 1.00x)
intra_pred_dc_128_w16_8bpc_lsx: 7.3 ( 0.47x)
intra_pred_dc_128_w32_8bpc_c: 8.8 ( 1.00x)
intra_pred_dc_128_w32_8bpc_lsx: 6.4 ( 1.38x)
intra_pred_dc_128_w64_8bpc_c: 17.0 ( 1.00x)
intra_pred_dc_128_w64_8bpc_lsx: 6.2 ( 2.74x)
intra_pred_dc_left_w4_8bpc_c: 1.1 ( 1.00x)
intra_pred_dc_left_w4_8bpc_lsx: 1.1 ( 1.00x)
intra_pred_dc_left_w8_8bpc_c: 2.1 ( 1.00x)
intra_pred_dc_left_w8_8bpc_lsx: 3.4 ( 0.64x)
intra_pred_dc_left_w16_8bpc_c: 4.6 ( 1.00x)
intra_pred_dc_left_w16_8bpc_lsx: 7.5 ( 0.62x)
intra_pred_dc_left_w32_8bpc_c: 10.3 ( 1.00x)
intra_pred_dc_left_w32_8bpc_lsx: 7.8 ( 1.32x)
intra_pred_dc_left_w64_8bpc_c: 18.7 ( 1.00x)
intra_pred_dc_left_w64_8bpc_lsx: 6.6 ( 2.83x)
intra_pred_dc_top_w4_8bpc_c: 0.9 ( 1.00x)
intra_pred_dc_top_w4_8bpc_lsx: 0.8 ( 1.10x)
intra_pred_dc_top_w8_8bpc_c: 1.9 ( 1.00x)
intra_pred_dc_top_w8_8bpc_lsx: 2.8 ( 0.67x)
intra_pred_dc_top_w16_8bpc_c: 4.2 ( 1.00x)
intra_pred_dc_top_w16_8bpc_lsx: 5.5 ( 0.77x)
intra_pred_dc_top_w32_8bpc_c: 10.4 ( 1.00x)
intra_pred_dc_top_w32_8bpc_lsx: 6.7 ( 1.54x)
intra_pred_dc_top_w64_8bpc_c: 19.9 ( 1.00x)
intra_pred_dc_top_w64_8bpc_lsx: 6.9 ( 2.87x)
Change-Id: Ib5349e2430302da0424a474ce0fedc457439c761
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
62c47f3558
Loongarch: Optimized cdef_filter_block 4x4,4x8,8x8 8bpc function by LSX
...
cdef_filter_4x4_01_8bpc_c: 420.8 ( 1.00x)
cdef_filter_4x4_01_8bpc_lsx: 117.2 ( 3.59x)
cdef_filter_4x4_10_8bpc_c: 265.8 ( 1.00x)
cdef_filter_4x4_10_8bpc_lsx: 98.9 ( 2.69x)
cdef_filter_4x4_11_8bpc_c: 1036.2 ( 1.00x)
cdef_filter_4x4_11_8bpc_lsx: 169.6 ( 6.11x)
cdef_filter_4x8_01_8bpc_c: 802.6 ( 1.00x)
cdef_filter_4x8_01_8bpc_lsx: 206.1 ( 3.89x)
cdef_filter_4x8_10_8bpc_c: 489.1 ( 1.00x)
cdef_filter_4x8_10_8bpc_lsx: 167.4 ( 2.92x)
cdef_filter_4x8_11_8bpc_c: 2028.9 ( 1.00x)
cdef_filter_4x8_11_8bpc_lsx: 309.4 ( 6.56x)
cdef_filter_8x8_01_8bpc_c: 1562.2 ( 1.00x)
cdef_filter_8x8_01_8bpc_lsx: 295.3 ( 5.29x)
cdef_filter_8x8_10_8bpc_c: 949.4 ( 1.00x)
cdef_filter_8x8_10_8bpc_lsx: 207.6 ( 4.57x)
cdef_filter_8x8_11_8bpc_c: 4009.6 ( 1.00x)
cdef_filter_8x8_11_8bpc_lsx: 466.8 ( 8.59x)
Change-Id: I8cd43426a27055e18c44a7701fa50f8835c712be
2024-09-30 06:37:00 +00:00
jinbo and Hecai Yuan
fa7b72d082
Refine mc_put_8tap
...
Performance speedup over lsx is around 68%~156%.
Change-Id: I0b39cd0e05e3cbd84fded121d29a91ea2a620f03
2024-09-30 06:37:00 +00:00
guxiwei and Hecai Yuan
02309b9f60
msac: Add msac_decode_bool_equia_lsx and msac_decode_hi_tok_lsx
...
The performance data is as follows:
msac_decode_bool_equi_c: 0.4 ( 1.00x)
msac_decode_bool_equi_lsx: 0.3 ( 1.07x)
msac_decode_hi_tok_c: 1.8 ( 1.00x)
msac_decode_hi_tok_lsx: 1.4 ( 1.27x)
Change-Id: Ic2f2678cf699bb22c579424af71ae2603e228482
2024-09-30 06:37:00 +00:00
pengxu and Hecai Yuan
2154425f70
Loongarch: Optimized cdef_find_dir_8bpc function by LSX
...
cdef_dir_8bpc_c: 28.8 ( 1.00x)
cdef_dir_8bpc_lsx: 19.1 ( 1.51x)
Change-Id: Ic7c1f32c5b1733b011f4c448cffc93f745b564f5
2024-09-30 06:37:00 +00:00
yuanhecai
f6ffdc90b3
loongarch: opt inv_txfm_add_identity_identity_8x32_8bpc_lsx
...
Relative speedup over C code:
inv_txfm_add_8x32_identity_identity_0_8bpc_c: 126.1 ( 1.00x)
inv_txfm_add_8x32_identity_identity_0_8bpc_lsx: 1.6 (78.59x)
inv_txfm_add_8x32_identity_identity_1_8bpc_c: 136.9 ( 1.00x)
inv_txfm_add_8x32_identity_identity_1_8bpc_lsx: 1.6 (85.31x)
inv_txfm_add_8x32_identity_identity_2_8bpc_c: 148.0 ( 1.00x)
inv_txfm_add_8x32_identity_identity_2_8bpc_lsx: 3.3 (45.47x)
inv_txfm_add_8x32_identity_identity_3_8bpc_c: 159.4 ( 1.00x)
inv_txfm_add_8x32_identity_identity_3_8bpc_lsx: 4.9 (32.78x)
inv_txfm_add_8x32_identity_identity_4_8bpc_c: 170.2 ( 1.00x)
inv_txfm_add_8x32_identity_identity_4_8bpc_lsx: 6.5 (26.17x)
Change-Id: Iabda6efcd8a17d26a205f90757dfea85af48848f
2024-09-30 06:37:00 +00:00
yuanhecai
5de878a4e1
loongarch: Minor improvement on identity4*, identity8* and dct32*
...
1. remove the code about identity8 in the 4x8/8x8/8x16 series
2. modify the code of the function dct_dct_8x32/32x32/64x64
3. modify the code about identity4 in the 4x4/4x8/8x4 series
After the modification, function performance has been improved by 20%
Change-Id: I1bc2e0fb25e508faf9fc220333460a99be3f5e49
2024-09-30 06:37:00 +00:00
yuanhecai
2fc656604b
loongarch: add lsx implementation of itx_8bpc.add_8x16 series function for 8 bpc
...
Relative speedup over C code:
inv_txfm_add_8x16_adst_adst_0_8bpc_c: 208.1
inv_txfm_add_8x16_adst_adst_0_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_adst_1_8bpc_c: 208.4
inv_txfm_add_8x16_adst_adst_1_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_adst_2_8bpc_c: 208.1
inv_txfm_add_8x16_adst_adst_2_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_dct_0_8bpc_c: 204.0
inv_txfm_add_8x16_adst_dct_0_8bpc_lsx: 27.2
inv_txfm_add_8x16_adst_dct_1_8bpc_c: 204.0
inv_txfm_add_8x16_adst_dct_1_8bpc_lsx: 27.2
inv_txfm_add_8x16_adst_dct_2_8bpc_c: 204.0
inv_txfm_add_8x16_adst_dct_2_8bpc_lsx: 27.2
inv_txfm_add_8x16_adst_flipadst_0_8bpc_c: 207.9
inv_txfm_add_8x16_adst_flipadst_0_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_flipadst_1_8bpc_c: 208.3
inv_txfm_add_8x16_adst_flipadst_1_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_flipadst_2_8bpc_c: 208.6
inv_txfm_add_8x16_adst_flipadst_2_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_identity_0_8bpc_c: 146.6
inv_txfm_add_8x16_adst_identity_0_8bpc_lsx: 21.8
inv_txfm_add_8x16_adst_identity_1_8bpc_c: 146.6
inv_txfm_add_8x16_adst_identity_1_8bpc_lsx: 21.8
inv_txfm_add_8x16_adst_identity_2_8bpc_c: 146.6
inv_txfm_add_8x16_adst_identity_2_8bpc_lsx: 21.8
inv_txfm_add_8x16_dct_adst_0_8bpc_c: 204.8
inv_txfm_add_8x16_dct_adst_0_8bpc_lsx: 26.2
inv_txfm_add_8x16_dct_adst_1_8bpc_c: 204.8
inv_txfm_add_8x16_dct_adst_1_8bpc_lsx: 26.1
inv_txfm_add_8x16_dct_adst_2_8bpc_c: 204.8
inv_txfm_add_8x16_dct_adst_2_8bpc_lsx: 26.2
inv_txfm_add_8x16_dct_dct_0_8bpc_c: 23.1
inv_txfm_add_8x16_dct_dct_0_8bpc_lsx: 2.3
inv_txfm_add_8x16_dct_dct_1_8bpc_c: 200.8
inv_txfm_add_8x16_dct_dct_1_8bpc_lsx: 21.9
inv_txfm_add_8x16_dct_dct_2_8bpc_c: 200.7
inv_txfm_add_8x16_dct_dct_2_8bpc_lsx: 21.9
inv_txfm_add_8x16_dct_flipadst_0_8bpc_c: 204.6
inv_txfm_add_8x16_dct_flipadst_0_8bpc_lsx: 26.3
inv_txfm_add_8x16_dct_flipadst_1_8bpc_c: 204.6
inv_txfm_add_8x16_dct_flipadst_1_8bpc_lsx: 26.3
inv_txfm_add_8x16_dct_flipadst_2_8bpc_c: 204.6
inv_txfm_add_8x16_dct_flipadst_2_8bpc_lsx: 26.3
inv_txfm_add_8x16_dct_identity_0_8bpc_c: 143.2
inv_txfm_add_8x16_dct_identity_0_8bpc_lsx: 16.7
inv_txfm_add_8x16_dct_identity_1_8bpc_c: 142.9
inv_txfm_add_8x16_dct_identity_1_8bpc_lsx: 16.7
inv_txfm_add_8x16_dct_identity_2_8bpc_c: 143.5
inv_txfm_add_8x16_dct_identity_2_8bpc_lsx: 16.7
inv_txfm_add_8x16_flipadst_adst_0_8bpc_c: 206.5
inv_txfm_add_8x16_flipadst_adst_0_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_adst_1_8bpc_c: 206.5
inv_txfm_add_8x16_flipadst_adst_1_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_adst_2_8bpc_c: 206.5
inv_txfm_add_8x16_flipadst_adst_2_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_dct_0_8bpc_c: 202.5
inv_txfm_add_8x16_flipadst_dct_0_8bpc_lsx: 26.8
inv_txfm_add_8x16_flipadst_dct_1_8bpc_c: 202.3
inv_txfm_add_8x16_flipadst_dct_1_8bpc_lsx: 26.8
inv_txfm_add_8x16_flipadst_dct_2_8bpc_c: 202.3
inv_txfm_add_8x16_flipadst_dct_2_8bpc_lsx: 26.8
inv_txfm_add_8x16_flipadst_flipadst_0_8bpc_c: 206.3
inv_txfm_add_8x16_flipadst_flipadst_0_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_flipadst_1_8bpc_c: 206.3
inv_txfm_add_8x16_flipadst_flipadst_1_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_flipadst_2_8bpc_c: 206.3
inv_txfm_add_8x16_flipadst_flipadst_2_8bpc_lsx: 31.3
inv_txfm_add_8x16_identity_adst_0_8bpc_c: 160.7
inv_txfm_add_8x16_identity_adst_0_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_adst_1_8bpc_c: 160.4
inv_txfm_add_8x16_identity_adst_1_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_adst_2_8bpc_c: 160.1
inv_txfm_add_8x16_identity_adst_2_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_dct_0_8bpc_c: 157.9
inv_txfm_add_8x16_identity_dct_0_8bpc_lsx: 17.7
inv_txfm_add_8x16_identity_dct_1_8bpc_c: 156.5
inv_txfm_add_8x16_identity_dct_1_8bpc_lsx: 17.7
inv_txfm_add_8x16_identity_dct_2_8bpc_c: 156.8
inv_txfm_add_8x16_identity_dct_2_8bpc_lsx: 17.7
inv_txfm_add_8x16_identity_flipadst_0_8bpc_c: 159.9
inv_txfm_add_8x16_identity_flipadst_0_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_flipadst_1_8bpc_c: 159.9
inv_txfm_add_8x16_identity_flipadst_1_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_flipadst_2_8bpc_c: 160.0
inv_txfm_add_8x16_identity_flipadst_2_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_identity_0_8bpc_c: 98.3
inv_txfm_add_8x16_identity_identity_0_8bpc_lsx: 12.3
inv_txfm_add_8x16_identity_identity_1_8bpc_c: 98.0
inv_txfm_add_8x16_identity_identity_1_8bpc_lsx: 12.3
inv_txfm_add_8x16_identity_identity_2_8bpc_c: 98.1
inv_txfm_add_8x16_identity_identity_2_8bpc_lsx: 12.3
Change-Id: Ida8d71e4eff782b9f81e0ad426eaa078b68528cf
2024-09-30 06:37:00 +00:00
yuanhecai
643ae52baa
loongarch: add lsx implementation of itx_8bpc.add_4x16 series function for 8 bpc
...
Relative speedup over C code:
inv_txfm_add_4x16_adst_adst_0_8bpc_c: 91.1
inv_txfm_add_4x16_adst_adst_0_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_adst_1_8bpc_c: 91.1
inv_txfm_add_4x16_adst_adst_1_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_adst_2_8bpc_c: 91.1
inv_txfm_add_4x16_adst_adst_2_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_dct_0_8bpc_c: 89.5
inv_txfm_add_4x16_adst_dct_0_8bpc_lsx: 14.3
inv_txfm_add_4x16_adst_dct_1_8bpc_c: 89.5
inv_txfm_add_4x16_adst_dct_1_8bpc_lsx: 14.3
inv_txfm_add_4x16_adst_dct_2_8bpc_c: 89.5
inv_txfm_add_4x16_adst_dct_2_8bpc_lsx: 14.3
inv_txfm_add_4x16_adst_flipadst_0_8bpc_c: 91.8
inv_txfm_add_4x16_adst_flipadst_0_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_flipadst_1_8bpc_c: 91.7
inv_txfm_add_4x16_adst_flipadst_1_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_flipadst_2_8bpc_c: 91.8
inv_txfm_add_4x16_adst_flipadst_2_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_identity_0_8bpc_c: 60.5
inv_txfm_add_4x16_adst_identity_0_8bpc_lsx: 6.3
inv_txfm_add_4x16_adst_identity_1_8bpc_c: 60.5
inv_txfm_add_4x16_adst_identity_1_8bpc_lsx: 6.3
inv_txfm_add_4x16_adst_identity_2_8bpc_c: 60.5
inv_txfm_add_4x16_adst_identity_2_8bpc_lsx: 6.3
inv_txfm_add_4x16_dct_adst_0_8bpc_c: 92.7
inv_txfm_add_4x16_dct_adst_0_8bpc_lsx: 18.4
inv_txfm_add_4x16_dct_adst_1_8bpc_c: 92.7
inv_txfm_add_4x16_dct_adst_1_8bpc_lsx: 18.4
inv_txfm_add_4x16_dct_adst_2_8bpc_c: 92.7
inv_txfm_add_4x16_dct_adst_2_8bpc_lsx: 18.4
inv_txfm_add_4x16_dct_dct_0_8bpc_c: 13.7
inv_txfm_add_4x16_dct_dct_0_8bpc_lsx: 1.9
inv_txfm_add_4x16_dct_dct_1_8bpc_c: 90.6
inv_txfm_add_4x16_dct_dct_1_8bpc_lsx: 14.5
inv_txfm_add_4x16_dct_dct_2_8bpc_c: 90.6
inv_txfm_add_4x16_dct_dct_2_8bpc_lsx: 14.5
inv_txfm_add_4x16_dct_flipadst_0_8bpc_c: 93.3
inv_txfm_add_4x16_dct_flipadst_0_8bpc_lsx: 18.6
inv_txfm_add_4x16_dct_flipadst_1_8bpc_c: 93.4
inv_txfm_add_4x16_dct_flipadst_1_8bpc_lsx: 18.6
inv_txfm_add_4x16_dct_flipadst_2_8bpc_c: 93.4
inv_txfm_add_4x16_dct_flipadst_2_8bpc_lsx: 18.6
inv_txfm_add_4x16_dct_identity_0_8bpc_c: 62.1
inv_txfm_add_4x16_dct_identity_0_8bpc_lsx: 6.5
inv_txfm_add_4x16_dct_identity_1_8bpc_c: 62.1
inv_txfm_add_4x16_dct_identity_1_8bpc_lsx: 6.5
inv_txfm_add_4x16_dct_identity_2_8bpc_c: 62.1
inv_txfm_add_4x16_dct_identity_2_8bpc_lsx: 6.5
inv_txfm_add_4x16_flipadst_adst_0_8bpc_c: 92.2
inv_txfm_add_4x16_flipadst_adst_0_8bpc_lsx: 18.1
inv_txfm_add_4x16_flipadst_adst_1_8bpc_c: 92.3
inv_txfm_add_4x16_flipadst_adst_1_8bpc_lsx: 18.1
inv_txfm_add_4x16_flipadst_adst_2_8bpc_c: 92.2
inv_txfm_add_4x16_flipadst_adst_2_8bpc_lsx: 18.1
inv_txfm_add_4x16_flipadst_dct_0_8bpc_c: 90.6
inv_txfm_add_4x16_flipadst_dct_0_8bpc_lsx: 14.3
inv_txfm_add_4x16_flipadst_dct_1_8bpc_c: 90.6
inv_txfm_add_4x16_flipadst_dct_1_8bpc_lsx: 14.3
inv_txfm_add_4x16_flipadst_dct_2_8bpc_c: 90.6
inv_txfm_add_4x16_flipadst_dct_2_8bpc_lsx: 14.3
inv_txfm_add_4x16_flipadst_flipadst_0_8bpc_c: 92.9
inv_txfm_add_4x16_flipadst_flipadst_0_8bpc_lsx: 18.2
inv_txfm_add_4x16_flipadst_flipadst_1_8bpc_c: 92.9
inv_txfm_add_4x16_flipadst_flipadst_1_8bpc_lsx: 18.2
inv_txfm_add_4x16_flipadst_flipadst_2_8bpc_c: 92.9
inv_txfm_add_4x16_flipadst_flipadst_2_8bpc_lsx: 18.2
inv_txfm_add_4x16_flipadst_identity_0_8bpc_c: 61.8
inv_txfm_add_4x16_flipadst_identity_0_8bpc_lsx: 6.3
inv_txfm_add_4x16_flipadst_identity_1_8bpc_c: 61.8
inv_txfm_add_4x16_flipadst_identity_1_8bpc_lsx: 6.3
inv_txfm_add_4x16_flipadst_identity_2_8bpc_c: 61.8
inv_txfm_add_4x16_flipadst_identity_2_8bpc_lsx: 6.3
inv_txfm_add_4x16_identity_adst_0_8bpc_c: 83.1
inv_txfm_add_4x16_identity_adst_0_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_adst_1_8bpc_c: 83.0
inv_txfm_add_4x16_identity_adst_1_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_adst_2_8bpc_c: 83.0
inv_txfm_add_4x16_identity_adst_2_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_dct_0_8bpc_c: 81.4
inv_txfm_add_4x16_identity_dct_0_8bpc_lsx: 13.9
inv_txfm_add_4x16_identity_dct_1_8bpc_c: 81.4
inv_txfm_add_4x16_identity_dct_1_8bpc_lsx: 13.9
inv_txfm_add_4x16_identity_dct_2_8bpc_c: 81.4
inv_txfm_add_4x16_identity_dct_2_8bpc_lsx: 13.9
inv_txfm_add_4x16_identity_flipadst_0_8bpc_c: 84.1
inv_txfm_add_4x16_identity_flipadst_0_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_flipadst_1_8bpc_c: 84.0
inv_txfm_add_4x16_identity_flipadst_1_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_flipadst_2_8bpc_c: 83.9
inv_txfm_add_4x16_identity_flipadst_2_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_identity_0_8bpc_c: 52.4
inv_txfm_add_4x16_identity_identity_0_8bpc_lsx: 5.5
inv_txfm_add_4x16_identity_identity_1_8bpc_c: 52.4
inv_txfm_add_4x16_identity_identity_1_8bpc_lsx: 5.5
inv_txfm_add_4x16_identity_identity_2_8bpc_c: 52.4
inv_txfm_add_4x16_identity_identity_2_8bpc_lsx: 5.5
Change-Id: I36322071eeea45df9289f2b1d533ee937904aec2
2024-09-30 06:37:00 +00:00
yuanhecai
d60d93a55c
loongarch: add lsx implementation of itx_8bpc.add_4x8 series function for 8 bpc
...
Relative speedup over C code:
inv_txfm_add_4x8_adst_adst_0_8bpc_c: 43.8
inv_txfm_add_4x8_adst_adst_0_8bpc_lsx: 8.6
inv_txfm_add_4x8_adst_adst_1_8bpc_c: 43.8
inv_txfm_add_4x8_adst_adst_1_8bpc_lsx: 8.6
inv_txfm_add_4x8_adst_dct_0_8bpc_c: 43.0
inv_txfm_add_4x8_adst_dct_0_8bpc_lsx: 6.5
inv_txfm_add_4x8_adst_dct_1_8bpc_c: 43.0
inv_txfm_add_4x8_adst_dct_1_8bpc_lsx: 6.5
inv_txfm_add_4x8_adst_flipadst_0_8bpc_c: 44.1
inv_txfm_add_4x8_adst_flipadst_0_8bpc_lsx: 8.8
inv_txfm_add_4x8_adst_flipadst_1_8bpc_c: 44.1
inv_txfm_add_4x8_adst_flipadst_1_8bpc_lsx: 8.8
inv_txfm_add_4x8_adst_identity_0_8bpc_c: 31.3
inv_txfm_add_4x8_adst_identity_0_8bpc_lsx: 2.9
inv_txfm_add_4x8_adst_identity_1_8bpc_c: 31.3
inv_txfm_add_4x8_adst_identity_1_8bpc_lsx: 2.9
inv_txfm_add_4x8_dct_adst_0_8bpc_c: 46.3
inv_txfm_add_4x8_dct_adst_0_8bpc_lsx: 8.8
inv_txfm_add_4x8_dct_adst_1_8bpc_c: 46.3
inv_txfm_add_4x8_dct_adst_1_8bpc_lsx: 8.8
inv_txfm_add_4x8_dct_dct_0_8bpc_c: 7.3
inv_txfm_add_4x8_dct_dct_0_8bpc_lsx: 1.5
inv_txfm_add_4x8_dct_dct_1_8bpc_c: 45.7
inv_txfm_add_4x8_dct_dct_1_8bpc_lsx: 6.7
inv_txfm_add_4x8_dct_flipadst_0_8bpc_c: 46.7
inv_txfm_add_4x8_dct_flipadst_0_8bpc_lsx: 8.8
inv_txfm_add_4x8_dct_flipadst_1_8bpc_c: 46.7
inv_txfm_add_4x8_dct_flipadst_1_8bpc_lsx: 8.8
inv_txfm_add_4x8_dct_identity_0_8bpc_c: 33.8
inv_txfm_add_4x8_dct_identity_0_8bpc_lsx: 2.9
inv_txfm_add_4x8_dct_identity_1_8bpc_c: 33.8
inv_txfm_add_4x8_dct_identity_1_8bpc_lsx: 2.9
inv_txfm_add_4x8_flipadst_adst_0_8bpc_c: 44.0
inv_txfm_add_4x8_flipadst_adst_0_8bpc_lsx: 8.6
inv_txfm_add_4x8_flipadst_adst_1_8bpc_c: 43.9
inv_txfm_add_4x8_flipadst_adst_1_8bpc_lsx: 8.6
inv_txfm_add_4x8_flipadst_dct_0_8bpc_c: 43.3
inv_txfm_add_4x8_flipadst_dct_0_8bpc_lsx: 6.5
inv_txfm_add_4x8_flipadst_dct_1_8bpc_c: 43.4
inv_txfm_add_4x8_flipadst_dct_1_8bpc_lsx: 6.5
inv_txfm_add_4x8_flipadst_flipadst_0_8bpc_c: 44.4
inv_txfm_add_4x8_flipadst_flipadst_0_8bpc_lsx: 8.8
inv_txfm_add_4x8_flipadst_flipadst_1_8bpc_c: 44.4
inv_txfm_add_4x8_flipadst_flipadst_1_8bpc_lsx: 8.8
inv_txfm_add_4x8_flipadst_identity_0_8bpc_c: 31.5
inv_txfm_add_4x8_flipadst_identity_0_8bpc_lsx: 2.9
inv_txfm_add_4x8_flipadst_identity_1_8bpc_c: 31.5
inv_txfm_add_4x8_flipadst_identity_1_8bpc_lsx: 2.9
inv_txfm_add_4x8_identity_adst_0_8bpc_c: 38.9
inv_txfm_add_4x8_identity_adst_0_8bpc_lsx: 8.2
inv_txfm_add_4x8_identity_adst_1_8bpc_c: 38.9
inv_txfm_add_4x8_identity_adst_1_8bpc_lsx: 8.2
inv_txfm_add_4x8_identity_dct_0_8bpc_c: 38.1
inv_txfm_add_4x8_identity_dct_0_8bpc_lsx: 6.1
inv_txfm_add_4x8_identity_dct_1_8bpc_c: 38.1
inv_txfm_add_4x8_identity_dct_1_8bpc_lsx: 6.1
inv_txfm_add_4x8_identity_flipadst_0_8bpc_c: 39.2
inv_txfm_add_4x8_identity_flipadst_0_8bpc_lsx: 8.3
inv_txfm_add_4x8_identity_flipadst_1_8bpc_c: 39.2
inv_txfm_add_4x8_identity_flipadst_1_8bpc_lsx: 8.3
inv_txfm_add_4x8_identity_identity_0_8bpc_c: 26.4
inv_txfm_add_4x8_identity_identity_0_8bpc_lsx: 2.4
inv_txfm_add_4x8_identity_identity_1_8bpc_c: 26.4
inv_txfm_add_4x8_identity_identity_1_8bpc_lsx: 2.4
Change-Id: Ibbaeca98118774a261cf55afd581196d93ac2004
2024-09-30 06:37:00 +00:00
yuanhecai
74e0eeb5ec
loongarch: Opt one functions of itx_8bpc.add_16x32 series
...
1. inv_txfm_add_dct_dct_16x32
Relative speedup over C code:
inv_txfm_add_16x32_dct_dct_0_8bpc_c: 63.4
inv_txfm_add_16x32_dct_dct_0_8bpc_lsx: 3.3
inv_txfm_add_16x32_dct_dct_1_8bpc_c: 687.0
inv_txfm_add_16x32_dct_dct_1_8bpc_lsx: 55.7
inv_txfm_add_16x32_dct_dct_2_8bpc_c: 686.4
inv_txfm_add_16x32_dct_dct_2_8bpc_lsx: 55.6
inv_txfm_add_16x32_dct_dct_3_8bpc_c: 686.4
inv_txfm_add_16x32_dct_dct_3_8bpc_lsx: 55.5
inv_txfm_add_16x32_dct_dct_4_8bpc_c: 686.4
inv_txfm_add_16x32_dct_dct_4_8bpc_lsx: 55.6
Change-Id: I9d22b8b3534b7ba17f6e85e42d08eb3165e2e8cb
2024-09-30 06:37:00 +00:00
MARBEAN and Jean-Baptiste Kempf
f2c3ccd6a6
meson: supports the iOS platform
2024-09-21 07:10:06 +00:00
Cameron Cawley and Martin Storsjö
a7a40a3fde
Define __ARM_ARCH with older compilers
...
This is needed for GCC 4.7 and earlier, as well as Visual Studio 2022 version 17.9 and earlier.
2024-09-18 18:29:36 +00:00
Cameron Cawley and Martin Storsjö
8e993f4d0b
Support older ARM versions with checkasm
2024-09-18 18:29:36 +00:00
Luca Barbato
8d9b1e26b3
ppc: Factor out dc_only itx
2024-09-17 12:34:37 +00:00
Luca Barbato
75d3ad14f2
ppc: itx 16x4 pwr9
2024-09-17 12:34:37 +00:00
Luca Barbato
0bf331a1bb
ppc: itx 4x16 pwr9
...
Initial i32x4 version, can be used as base for high bitdept.
2024-09-17 12:34:37 +00:00
Luca Barbato
19e122ee38
ppc: Remove high bitdepth macros from the 8bit-only code
2024-09-17 12:34:37 +00:00
Luca Barbato
b1d847beb5
ppc: itx 8x8 pwr9
2024-09-17 12:34:37 +00:00
Luca Barbato
da51b12322
ppc: itx 4x8 and 8x4 pwr9
2024-09-17 12:34:37 +00:00
Luca Barbato
33b9d5141f
ppc: itx 4x4 pwr9
2024-09-17 12:34:37 +00:00
Jean-Baptiste Kempf
212359662d
NEWS: get ready for 1.5.0
2024-09-17 12:11:45 +00:00
Jean-Baptiste Kempf
bd875480a9
Update NEWS for 1.4.3
2024-09-17 12:11:45 +00:00
Michael Bradshaw and Ronald S. Bultje
dd32cd5027
Use #if HAVE_* instead of #ifdef HAVE_*
2024-09-12 20:40:08 +00:00
Arpad Panyik and Martin Storsjö
82e9155c75
AArch64: Trim Armv8.0 Neon path of 6-tap and 8-tap MC functions
...
There are some instruction sequences we could merge after the lane
load/store patch (ec5c3052cf ).
This change will simplify the loading of filter weights to save 288
bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.
2024-09-12 11:31:07 +00:00
Kacper Michajłow
f4a0d7cb70
Remove dav1d/ prefix from dav1d.h
...
This is possible, because we no longer generate version.h at compile
time.
Reverts header change from 7629402bbd to
preserve the same behaviour as before.
2024-09-11 02:43:02 +02:00
Kacper Michajłow
74ccc93687
meson: don't generate version.h
...
Instead of generating version.h, move the so version there and parse it
in meson.
2024-09-10 23:25:16 +02:00
Kyle Siefring and Ronald S. Bultje
4385e7e161
Improve density of group context setting macros
...
Shared object binary size reduction:
x84_64 : 16112 bytes
ARM64 : 16008 bytes
ARM64(+Os) : 21592 bytes
ARMv7(+Os+mthumb): 18480 bytes
Size reduction of symbols:
x84_64 : 15712 bytes
ARM64 : 18688 bytes
ARM64(+Os) : 18404 bytes
ARMv7(+Os+mthumb): 17322 bytes
Compiles were done with clang version 18.1.8 and symbol sizes were
obtained using nm on the shared object.
Provides speed ups on older ARM64 cpus with very little impact on other
cpus.
Speedup:
c7i (skylake)
Nature1080p : x0.999
Chimera : x0.998
odroid C4
Nature1080p : x1.007
Chimera : x1.016
Models1080p : x1.005
MountainBike1080p: x1.009
Balloons1080p : x1.008
Raspberry Pi 4
Nature1080p : x1.005
Chimera : x0.999
Models1080p : x0.999
MountainBike1080p: x1.004
Balloons1080p : x1.003
Raspberry Pi 2 (Cortex-A7):
(using size optimized build)
Nature1080p : x1.003
Models1080p : x0.997
2024-09-06 22:12:56 +00:00
Martin Storsjö
166e1df543
tests: Add an option to dav1d_argon.bash for using a wrapper tool
...
This allows executing all the tools within e.g. valgrind.
This matches the "meson test --wrap <tool>" feature.
2024-09-06 20:32:45 +00:00
Kyle Siefring and Martin Storsjö
79db162487
AArch64: New method for calculating sgr table
...
For the 3x3 part, double the width of the vertical loop. This is done to
provide more latency in the new sgr calculation.
Initial (master): Cortex A53 A55 A72 A73 A76 Apple M1
sgr_3x3_8bpc_neon: 387702.8 383154.2 295742.4 302100.1 185420.7 472.2
sgr_5x5_8bpc_neon: 261725.1 256919.8 194205.1 197585.6 128311.3 332.9
sgr_mix_8bpc_neon: 628085.0 593664.2 453551.8 450553.8 281956.0 711.2
Current:
sgr_3x3_8bpc_neon: 368331.4 363949.7 275499.0 272056.3 169614.4 432.7
sgr_5x5_8bpc_neon: 257866.7 255265.5 195962.5 199557.8 120481.3 319.2
sgr_mix_8bpc_neon: 598234.1 572896.4 418500.4 438910.7 258977.7 659.3
Include a minor improvement that gets rid of a dup instruction.
2024-09-06 09:04:24 +00:00
Arpad Panyik and Martin Storsjö
ec5c3052cf
AArch64: Optimize lane load/store in MC functions
...
Partial register writes can create long dependency chains, which can
reduce performance on out-of-order CPUs. This patch removes most of
these kinds of problems in MC functions by filling the full register
before other lane loading instructions.
Most lane extracting stores can also be optimized using FP scalar
stores when the 0th lane would be extracted.
Relative runtime of micro benchmarks after this patch on some Neoverse
and Cortex CPU cores:
8bpc neon V2 V1 X3 X1 A715 A78 A76
avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x
w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x
mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x
w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x
w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x
blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x
blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x
blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x
blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x
blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x
blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x
blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x
blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x
16bpc neon V2 V1 X3 X1 A715 A78 A76
avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x
w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x
mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x
w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x
w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x
blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x
blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x
blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x
blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x
blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x
blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x
bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76
mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x
mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x
mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x
mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x
mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x
mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x
mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x
mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x
bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76
mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x
mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x
mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x
mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x
mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x
mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x
mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x
mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x
mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x
8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76
mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x
mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x
mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x
mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x
mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x
mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x
mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x
mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x
mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x
mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x
mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x
mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x
mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x
mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x
mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x
mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x
mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x
mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x
mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x
8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76
mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x
mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x
mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x
mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x
mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x
mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x
mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x
mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x
mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x
mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x
mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x
mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x
mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x
mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x
mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x
mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x
mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x
mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x
mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x
mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x
2024-09-06 11:40:46 +03:00
Arpad Panyik and Martin Storsjö
a992a9bede
AArch64: Optimize Armv8.0 Neon path of SBD H/HV 6-tap filters
...
The 6-tap horizontal and the horizontal parts of 6-tap HV subpel
filters can be further improved by some pointer arithmetic and saving
some instructions (EXTs) in their data rearrangement codes.
Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:
SBD mct h X1 A78 A76 A72 A55
regular w8: 0.878x 0.894x 0.990x 0.923x 0.944x
regular w16: 0.962x 0.931x 0.943x 0.949x 0.949x
regular w32: 0.937x 0.937x 0.972x 0.938x 0.947x
regular w64: 0.920x 0.965x 0.992x 0.936x 0.944x
SBD mct hv X1 A78 A76 A72 A55
regular w8: 0.931x 0.970x 0.951x 0.950x 0.971x
regular w16: 0.940x 0.971x 0.941x 0.952x 0.967x
regular w32: 0.943x 0.972x 0.946x 0.961x 0.974x
regular w64: 0.943x 0.973x 0.952x 0.944x 0.975x
2024-09-06 08:08:08 +00:00
Arpad Panyik and Martin Storsjö
2d808de191
AArch64: Optimize Armv8.0 Neon path of HBD HV 6-tap filters
...
The horizontal parts of 6-tap HV subpel filters can be further
improved by some pointer arithmetic and saving some instructions
(EXTs) in their data rearrangement codes.
Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:
HBD mct hv X1 A78 A76 A72 A55
regular w8: 0.952x 0.989x 0.924x 0.973x 0.976x
regular w16: 0.961x 0.993x 0.928x 0.952x 0.971x
regular w32: 0.964x 0.996x 0.930x 0.973x 0.972x
regular w64: 0.963x 0.997x 0.930x 0.969x 0.974x
2024-09-06 07:50:38 +00:00
Arpad Panyik and Martin Storsjö
93339ce857
AArch64: Optimize Armv8.0 Neon path of HBD horizontal 6-tap filters
...
The 6-tap horizontal subpel filters can be further improved by some
pointer arithmetic and saving some instructions (EXTs) in their data
rearrangement codes.
Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:
regular: X1 A78 A76 A55
mc w8: 0.915x 0.937x 0.900x 0.982x
mc w16: 0.917x 0.947x 0.911x 0.971x
mc w32: 0.914x 0.938x 0.873x 0.961x
mc w64: 0.918x 0.932x 0.882x 0.964x
2024-09-06 07:38:18 +00:00
Arpad Panyik and Martin Storsjö
109b24277b
AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters
...
The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+
SRSHL instruction sequences. In the horizontal case this can be
rewritten using a single SQSHRUN instruction with an additional
rounding value (34 for 10-bit and 40 for 12-bit).
Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:
regular: X1 A78 A76 A55
mc w2: 0.847x 0.864x 0.822x 0.859x
mc w4: 0.889x 0.994x 0.868x 0.917x
mc w8: 0.857x 0.911x 0.915x 0.978x
mc w16: 0.890x 0.982x 0.868x 0.974x
mc w32: 0.904x 0.991x 0.873x 0.967x
mc w64: 0.919x 1.003x 0.860x 0.970x
2024-09-06 07:38:18 +00:00
Cameron Cawley and Ronald S. Bultje
d268788467
Support using C11 aligned_alloc for dav1d_alloc_aligned
2024-09-05 12:36:00 +00:00