2799 Commits
Author SHA1 Message Date
zhoupengandHecai Yuan 7c63bb1b6e Loongarch: Optimized emu_edge_c function by LSX
emu_edge_w4_8bpc_c:        9.0 ( 1.00x)
emu_edge_w4_8bpc_lsx:      6.7 ( 1.34x)
emu_edge_w8_8bpc_c:       12.9 ( 1.00x)
emu_edge_w8_8bpc_lsx:      9.2 ( 1.40x)
emu_edge_w16_8bpc_c:       20.0 ( 1.00x)
emu_edge_w16_8bpc_lsx:     16.3 ( 1.23x)
emu_edge_w32_8bpc_c:       44.6 ( 1.00x)
emu_edge_w32_8bpc_lsx:     33.3 ( 1.34x)
emu_edge_w64_8bpc_c:       79.9 ( 1.00x)
emu_edge_w64_8bpc_lsx:     66.2 ( 1.21x)
emu_edge_w128_8bpc_c:      193.9 ( 1.00x)
emu_edge_w128_8bpc_lsx:    197.8 ( 0.98x)

Change-Id: I180c94d311509740b03793419d5790a931532980
2024-09-30 06:37:00 +00:00
guxiweiandHecai Yuan e3101ddc8b LoongArch64: Implement checked_call()
Now checkasm calls the test function 'func_new' through
the wrapper 'checked_call' instead of calling it directly.
The purpose of the wrapper is to check if 'func_new' correctly
saves and restores static registers. The wrapper writes dirty
values to the static registers, and after calling 'func_new',
it checks if the dirty values in the static registers remain consistent.

Change-Id: Ia9290b55ab0f2dd87801f6fd175813d3f717d851
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 7f891597bf Loongarch: Optimized ipred_filter 8bpc functions by LSX
intra_pred_filter_w4_8bpc_c:          17.9 ( 1.00x)
intra_pred_filter_w4_8bpc_lsx:         8.9 ( 2.00x)
intra_pred_filter_w8_8bpc_c:          55.3 ( 1.00x)
intra_pred_filter_w8_8bpc_lsx:        23.8 ( 2.33x)
intra_pred_filter_w16_8bpc_c:        109.4 ( 1.00x)
intra_pred_filter_w16_8bpc_lsx:       49.1 ( 2.23x)
intra_pred_filter_w32_8bpc_c:        270.2 ( 1.00x)
intra_pred_filter_w32_8bpc_lsx:      126.1 ( 2.14x)

Change-Id: Ic4c23cb1d54d5f8557c31cdfbbd54f8beaaa32c2
2024-09-30 06:37:00 +00:00
yuanhecai f398bf968c loongarch: Add the some optimization function about itx for 8bpc
1. inv_txfm_add_dct_dct_32x16_8bpc_lsx
2. inv_txfm_add_dct_dct_32x8_8bpc_lsx
3. inv_txfm_add_dct_dct_64x32_8bpc_lsx
4. inv_txfm_add_adst_flipadst_16x16_8bpc_lsx
5. inv_txfm_add_flipadst_adst_16x16_8bpc_lsx
6. inv_txfm_add_adst_adst_16x16_8bpc_lasx

Relative speedup over C code:

inv_txfm_add_32x16_dct_dct_0_8bpc_c:                 78.4 ( 1.00x)
inv_txfm_add_32x16_dct_dct_0_8bpc_lsx:                5.7 (13.81x)
inv_txfm_add_32x16_dct_dct_1_8bpc_c:                710.1 ( 1.00x)
inv_txfm_add_32x16_dct_dct_1_8bpc_lsx:              102.9 ( 6.90x)
inv_txfm_add_32x16_dct_dct_2_8bpc_c:                918.0 ( 1.00x)
inv_txfm_add_32x16_dct_dct_2_8bpc_lsx:              103.2 ( 8.90x)
inv_txfm_add_32x16_dct_dct_3_8bpc_c:                914.3 ( 1.00x)
inv_txfm_add_32x16_dct_dct_3_8bpc_lsx:              103.2 ( 8.86x)
inv_txfm_add_32x16_dct_dct_4_8bpc_c:                929.8 ( 1.00x)
inv_txfm_add_32x16_dct_dct_4_8bpc_lsx:              102.9 ( 9.03x)

inv_txfm_add_32x8_dct_dct_0_8bpc_c:                  39.6 ( 1.00x)
inv_txfm_add_32x8_dct_dct_0_8bpc_lsx:                 3.0 (13.10x)
inv_txfm_add_32x8_dct_dct_1_8bpc_c:                 431.6 ( 1.00x)
inv_txfm_add_32x8_dct_dct_1_8bpc_lsx:                42.6 (10.13x)
inv_txfm_add_32x8_dct_dct_2_8bpc_c:                 431.5 ( 1.00x)
inv_txfm_add_32x8_dct_dct_2_8bpc_lsx:                42.6 (10.13x)
inv_txfm_add_32x8_dct_dct_3_8bpc_c:                 432.0 ( 1.00x)
inv_txfm_add_32x8_dct_dct_3_8bpc_lsx:                42.6 (10.14x)
inv_txfm_add_32x8_dct_dct_4_8bpc_c:                 431.3 ( 1.00x)
inv_txfm_add_32x8_dct_dct_4_8bpc_lsx:                42.6 (10.13x)

inv_txfm_add_64x32_dct_dct_0_8bpc_c:                304.3 ( 1.00x)
inv_txfm_add_64x32_dct_dct_0_8bpc_lsx:               20.3 (15.01x)
inv_txfm_add_64x32_dct_dct_1_8bpc_c:               2743.1 ( 1.00x)
inv_txfm_add_64x32_dct_dct_1_8bpc_lsx:              270.9 (10.13x)
inv_txfm_add_64x32_dct_dct_2_8bpc_c:               3197.1 ( 1.00x)
inv_txfm_add_64x32_dct_dct_2_8bpc_lsx:              327.7 ( 9.76x)
inv_txfm_add_64x32_dct_dct_3_8bpc_c:               3638.3 ( 1.00x)
inv_txfm_add_64x32_dct_dct_3_8bpc_lsx:              383.7 ( 9.48x)
inv_txfm_add_64x32_dct_dct_4_8bpc_c:               4084.5 ( 1.00x)
inv_txfm_add_64x32_dct_dct_4_8bpc_lsx:              441.7 ( 9.25x)

inv_txfm_add_16x16_adst_flipadst_0_8bpc_c:          277.3 ( 1.00x)
inv_txfm_add_16x16_adst_flipadst_0_8bpc_lsx:         58.7 ( 4.72x)
inv_txfm_add_16x16_adst_flipadst_1_8bpc_c:          358.1 ( 1.00x)
inv_txfm_add_16x16_adst_flipadst_1_8bpc_lsx:         58.7 ( 6.10x)
inv_txfm_add_16x16_adst_flipadst_2_8bpc_c:          449.3 ( 1.00x)
inv_txfm_add_16x16_adst_flipadst_2_8bpc_lsx:         58.7 ( 7.65x)

inv_txfm_add_16x16_flipadst_adst_0_8bpc_c:          277.2 ( 1.00x)
inv_txfm_add_16x16_flipadst_adst_0_8bpc_lsx:         58.7 ( 4.72x)
inv_txfm_add_16x16_flipadst_adst_1_8bpc_c:          358.7 ( 1.00x)
inv_txfm_add_16x16_flipadst_adst_1_8bpc_lsx:         58.7 ( 6.11x)
inv_txfm_add_16x16_flipadst_adst_2_8bpc_c:          450.4 ( 1.00x)
inv_txfm_add_16x16_flipadst_adst_2_8bpc_lsx:         58.7 ( 7.67x)

inv_txfm_add_16x16_adst_adst_0_8bpc_c:              253.4 ( 1.00x)
inv_txfm_add_16x16_adst_adst_0_8bpc_lasx:            23.1 (10.98x)
inv_txfm_add_16x16_adst_adst_1_8bpc_c:              325.2 ( 1.00x)
inv_txfm_add_16x16_adst_adst_1_8bpc_lasx:            23.1 (14.08x)
inv_txfm_add_16x16_adst_adst_2_8bpc_c:              405.9 ( 1.00x)
inv_txfm_add_16x16_adst_adst_2_8bpc_lasx:            23.1 (17.56x)

Change-Id: Iaa5419a830c3308e2c4c9ac5b3699c3a971301ed
2024-09-30 06:37:00 +00:00
yuanhecai 13a857d056 loongarch: add lsx implementation of itx_8bpc.add_16x8 series function for 8 bpc
Relative speedup over C code:

inv_txfm_add_16x8_adst_adst_0_8bpc_c:               127.7 ( 1.00x)
inv_txfm_add_16x8_adst_adst_0_8bpc_lsx:              29.6 ( 4.32x)
inv_txfm_add_16x8_adst_adst_1_8bpc_c:               206.6 ( 1.00x)
inv_txfm_add_16x8_adst_adst_1_8bpc_lsx:              29.6 ( 6.98x)
inv_txfm_add_16x8_adst_adst_2_8bpc_c:               206.6 ( 1.00x)
inv_txfm_add_16x8_adst_adst_2_8bpc_lsx:              29.6 ( 6.99x)
inv_txfm_add_16x8_adst_dct_0_8bpc_c:                126.7 ( 1.00x)
inv_txfm_add_16x8_adst_dct_0_8bpc_lsx:               25.8 ( 4.91x)
inv_txfm_add_16x8_adst_dct_1_8bpc_c:                205.1 ( 1.00x)
inv_txfm_add_16x8_adst_dct_1_8bpc_lsx:               25.8 ( 7.94x)
inv_txfm_add_16x8_adst_dct_2_8bpc_c:                205.2 ( 1.00x)
inv_txfm_add_16x8_adst_dct_2_8bpc_lsx:               25.8 ( 7.94x)
inv_txfm_add_16x8_adst_flipadst_0_8bpc_c:           128.3 ( 1.00x)
inv_txfm_add_16x8_adst_flipadst_0_8bpc_lsx:          29.8 ( 4.30x)
inv_txfm_add_16x8_adst_flipadst_1_8bpc_c:           207.2 ( 1.00x)
inv_txfm_add_16x8_adst_flipadst_1_8bpc_lsx:          29.9 ( 6.94x)
inv_txfm_add_16x8_adst_flipadst_2_8bpc_c:           207.1 ( 1.00x)
inv_txfm_add_16x8_adst_flipadst_2_8bpc_lsx:          29.8 ( 6.94x)
inv_txfm_add_16x8_adst_identity_0_8bpc_c:            78.3 ( 1.00x)
inv_txfm_add_16x8_adst_identity_0_8bpc_lsx:          18.6 ( 4.21x)
inv_txfm_add_16x8_adst_identity_1_8bpc_c:           157.1 ( 1.00x)
inv_txfm_add_16x8_adst_identity_1_8bpc_lsx:          18.6 ( 8.45x)
inv_txfm_add_16x8_adst_identity_2_8bpc_c:           157.2 ( 1.00x)
inv_txfm_add_16x8_adst_identity_2_8bpc_lsx:          18.6 ( 8.46x)
inv_txfm_add_16x8_dct_adst_0_8bpc_c:                127.4 ( 1.00x)
inv_txfm_add_16x8_dct_adst_0_8bpc_lsx:               25.4 ( 5.02x)
inv_txfm_add_16x8_dct_adst_1_8bpc_c:                201.2 ( 1.00x)
inv_txfm_add_16x8_dct_adst_1_8bpc_lsx:               25.4 ( 7.93x)
inv_txfm_add_16x8_dct_adst_2_8bpc_c:                201.2 ( 1.00x)
inv_txfm_add_16x8_dct_adst_2_8bpc_lsx:               25.4 ( 7.93x)
inv_txfm_add_16x8_dct_dct_0_8bpc_c:                  21.8 ( 1.00x)
inv_txfm_add_16x8_dct_dct_0_8bpc_lsx:                 2.1 (10.52x)
inv_txfm_add_16x8_dct_dct_1_8bpc_c:                 200.2 ( 1.00x)
inv_txfm_add_16x8_dct_dct_1_8bpc_lsx:                21.6 ( 9.28x)
inv_txfm_add_16x8_dct_dct_2_8bpc_c:                 200.2 ( 1.00x)
inv_txfm_add_16x8_dct_dct_2_8bpc_lsx:                21.6 ( 9.28x)
inv_txfm_add_16x8_dct_flipadst_0_8bpc_c:            127.2 ( 1.00x)
inv_txfm_add_16x8_dct_flipadst_0_8bpc_lsx:           25.6 ( 4.96x)
inv_txfm_add_16x8_dct_flipadst_1_8bpc_c:            201.2 ( 1.00x)
inv_txfm_add_16x8_dct_flipadst_1_8bpc_lsx:           25.7 ( 7.84x)
inv_txfm_add_16x8_dct_flipadst_2_8bpc_c:            201.7 ( 1.00x)
inv_txfm_add_16x8_dct_flipadst_2_8bpc_lsx:           25.7 ( 7.86x)
inv_txfm_add_16x8_dct_identity_0_8bpc_c:             77.3 ( 1.00x)
inv_txfm_add_16x8_dct_identity_0_8bpc_lsx:           14.5 ( 5.35x)
inv_txfm_add_16x8_dct_identity_1_8bpc_c:            151.2 ( 1.00x)
inv_txfm_add_16x8_dct_identity_1_8bpc_lsx:           14.5 (10.46x)
inv_txfm_add_16x8_dct_identity_2_8bpc_c:            151.5 ( 1.00x)
inv_txfm_add_16x8_dct_identity_2_8bpc_lsx:           14.5 (10.48x)
inv_txfm_add_16x8_flipadst_adst_0_8bpc_c:           128.5 ( 1.00x)
inv_txfm_add_16x8_flipadst_adst_0_8bpc_lsx:          29.7 ( 4.32x)
inv_txfm_add_16x8_flipadst_adst_1_8bpc_c:           207.3 ( 1.00x)
inv_txfm_add_16x8_flipadst_adst_1_8bpc_lsx:          29.7 ( 6.97x)
inv_txfm_add_16x8_flipadst_adst_2_8bpc_c:           207.4 ( 1.00x)
inv_txfm_add_16x8_flipadst_adst_2_8bpc_lsx:          29.7 ( 6.98x)
inv_txfm_add_16x8_flipadst_dct_0_8bpc_c:            126.8 ( 1.00x)
inv_txfm_add_16x8_flipadst_dct_0_8bpc_lsx:           25.9 ( 4.90x)
inv_txfm_add_16x8_flipadst_dct_1_8bpc_c:            204.8 ( 1.00x)
inv_txfm_add_16x8_flipadst_dct_1_8bpc_lsx:           25.9 ( 7.92x)
inv_txfm_add_16x8_flipadst_dct_2_8bpc_c:            205.4 ( 1.00x)
inv_txfm_add_16x8_flipadst_dct_2_8bpc_lsx:           25.9 ( 7.94x)
inv_txfm_add_16x8_flipadst_flipadst_0_8bpc_c:       128.6 ( 1.00x)
inv_txfm_add_16x8_flipadst_flipadst_0_8bpc_lsx:      30.0 ( 4.29x)
inv_txfm_add_16x8_flipadst_flipadst_1_8bpc_c:       206.6 ( 1.00x)
inv_txfm_add_16x8_flipadst_flipadst_1_8bpc_lsx:      29.9 ( 6.90x)
inv_txfm_add_16x8_flipadst_flipadst_2_8bpc_c:       206.5 ( 1.00x)
inv_txfm_add_16x8_flipadst_flipadst_2_8bpc_lsx:      29.9 ( 6.90x)
inv_txfm_add_16x8_flipadst_identity_0_8bpc_c:        77.8 ( 1.00x)
inv_txfm_add_16x8_flipadst_identity_0_8bpc_lsx:      18.6 ( 4.18x)
inv_txfm_add_16x8_flipadst_identity_1_8bpc_c:       156.3 ( 1.00x)
inv_txfm_add_16x8_flipadst_identity_1_8bpc_lsx:      18.6 ( 8.40x)
inv_txfm_add_16x8_flipadst_identity_2_8bpc_c:       156.6 ( 1.00x)
inv_txfm_add_16x8_flipadst_identity_2_8bpc_lsx:      18.6 ( 8.42x)
inv_txfm_add_16x8_identity_adst_0_8bpc_c:           120.7 ( 1.00x)
inv_txfm_add_16x8_identity_adst_0_8bpc_lsx:          21.1 ( 5.71x)
inv_txfm_add_16x8_identity_adst_1_8bpc_c:           120.8 ( 1.00x)
inv_txfm_add_16x8_identity_adst_1_8bpc_lsx:          21.1 ( 5.71x)
inv_txfm_add_16x8_identity_adst_2_8bpc_c:           145.5 ( 1.00x)
inv_txfm_add_16x8_identity_adst_2_8bpc_lsx:          21.2 ( 6.88x)
inv_txfm_add_16x8_identity_dct_0_8bpc_c:            119.1 ( 1.00x)
inv_txfm_add_16x8_identity_dct_0_8bpc_lsx:           17.9 ( 6.67x)
inv_txfm_add_16x8_identity_dct_1_8bpc_c:            119.1 ( 1.00x)
inv_txfm_add_16x8_identity_dct_1_8bpc_lsx:           17.9 ( 6.67x)
inv_txfm_add_16x8_identity_dct_2_8bpc_c:            143.8 ( 1.00x)
inv_txfm_add_16x8_identity_dct_2_8bpc_lsx:           17.9 ( 8.06x)
inv_txfm_add_16x8_identity_flipadst_0_8bpc_c:       120.7 ( 1.00x)
inv_txfm_add_16x8_identity_flipadst_0_8bpc_lsx:      21.3 ( 5.66x)
inv_txfm_add_16x8_identity_flipadst_1_8bpc_c:       120.4 ( 1.00x)
inv_txfm_add_16x8_identity_flipadst_1_8bpc_lsx:      21.3 ( 5.65x)
inv_txfm_add_16x8_identity_flipadst_2_8bpc_c:       144.9 ( 1.00x)
inv_txfm_add_16x8_identity_flipadst_2_8bpc_lsx:      21.3 ( 6.80x)
inv_txfm_add_16x8_identity_identity_0_8bpc_c:        70.2 ( 1.00x)
inv_txfm_add_16x8_identity_identity_0_8bpc_lsx:       9.5 ( 7.38x)
inv_txfm_add_16x8_identity_identity_1_8bpc_c:        95.6 ( 1.00x)
inv_txfm_add_16x8_identity_identity_1_8bpc_lsx:       9.5 (10.06x)
inv_txfm_add_16x8_identity_identity_2_8bpc_c:        95.6 ( 1.00x)
inv_txfm_add_16x8_identity_identity_2_8bpc_lsx:       9.5 (10.06x)

Change-Id: If1e274cab0e8441297a1eb44bd86be580f4c8f62
2024-09-30 06:37:00 +00:00
yuanhecai 843f00e531 loongarch: opt inv_txfm_add_adst_dct/dct_dct/identity_identity_16x4_8bpc_lsx
Relative speedup over C code:

inv_txfm_add_16x4_adst_dct_0_8bpc_c:                 61.7 ( 1.00x)
inv_txfm_add_16x4_adst_dct_0_8bpc_lsx:               17.8 ( 3.46x)
inv_txfm_add_16x4_adst_dct_1_8bpc_c:                 96.2 ( 1.00x)
inv_txfm_add_16x4_adst_dct_1_8bpc_lsx:               17.8 ( 5.39x)
inv_txfm_add_16x4_adst_dct_2_8bpc_c:                 96.2 ( 1.00x)
inv_txfm_add_16x4_adst_dct_2_8bpc_lsx:               17.8 ( 5.39x)
inv_txfm_add_16x4_dct_dct_0_8bpc_c:                  10.8 ( 1.00x)
inv_txfm_add_16x4_dct_dct_0_8bpc_lsx:                 0.9 (12.23x)
inv_txfm_add_16x4_dct_dct_1_8bpc_c:                  94.5 ( 1.00x)
inv_txfm_add_16x4_dct_dct_1_8bpc_lsx:                13.6 ( 6.94x)
inv_txfm_add_16x4_dct_dct_2_8bpc_c:                  94.7 ( 1.00x)
inv_txfm_add_16x4_dct_dct_2_8bpc_lsx:                13.6 ( 6.95x)
inv_txfm_add_16x4_identity_identity_0_8bpc_c:        42.1 ( 1.00x)
inv_txfm_add_16x4_identity_identity_0_8bpc_lsx:       5.1 ( 8.21x)
inv_txfm_add_16x4_identity_identity_1_8bpc_c:        53.0 ( 1.00x)
inv_txfm_add_16x4_identity_identity_1_8bpc_lsx:       5.1 (10.35x)
inv_txfm_add_16x4_identity_identity_2_8bpc_c:        53.0 ( 1.00x)
inv_txfm_add_16x4_identity_identity_2_8bpc_lsx:       5.1 (10.35x)

Change-Id: I0be4f77e381da390e300070337fff404dcdcb862
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 083cf424ff Loongarch: Optimized cfl_pred_cfl, cfl_pred_cfl_128, cfl_pred_cfl_top and cfl_pred_cfl_left 8bpc functions by LSX
cfl_pred_cfl_128_w4_8bpc_c:         19.4 ( 1.00x)
cfl_pred_cfl_128_w4_8bpc_lsx:        4.2 ( 4.63x)
cfl_pred_cfl_128_w8_8bpc_c:         66.3 ( 1.00x)
cfl_pred_cfl_128_w8_8bpc_lsx:        7.3 ( 9.11x)
cfl_pred_cfl_128_w16_8bpc_c:       150.1 ( 1.00x)
cfl_pred_cfl_128_w16_8bpc_lsx:      14.4 (10.45x)
cfl_pred_cfl_128_w32_8bpc_c:       403.6 ( 1.00x)
cfl_pred_cfl_128_w32_8bpc_lsx:      34.7 (11.65x)
cfl_pred_cfl_left_w4_8bpc_c:        20.5 ( 1.00x)
cfl_pred_cfl_left_w4_8bpc_lsx:       4.4 ( 4.63x)
cfl_pred_cfl_left_w8_8bpc_c:        67.9 ( 1.00x)
cfl_pred_cfl_left_w8_8bpc_lsx:       7.6 ( 8.94x)
cfl_pred_cfl_left_w16_8bpc_c:      152.0 ( 1.00x)
cfl_pred_cfl_left_w16_8bpc_lsx:     14.6 (10.38x)
cfl_pred_cfl_left_w32_8bpc_c:      405.8 ( 1.00x)
cfl_pred_cfl_left_w32_8bpc_lsx:     35.0 (11.58x)
cfl_pred_cfl_top_w4_8bpc_c:         20.0 ( 1.00x)
cfl_pred_cfl_top_w4_8bpc_lsx:        4.4 ( 4.51x)
cfl_pred_cfl_top_w8_8bpc_c:         67.6 ( 1.00x)
cfl_pred_cfl_top_w8_8bpc_lsx:        7.5 ( 8.99x)
cfl_pred_cfl_top_w16_8bpc_c:       152.5 ( 1.00x)
cfl_pred_cfl_top_w16_8bpc_lsx:      14.6 (10.41x)
cfl_pred_cfl_top_w32_8bpc_c:       408.0 ( 1.00x)
cfl_pred_cfl_top_w32_8bpc_lsx:      35.2 (11.58x)
cfl_pred_cfl_w4_8bpc_c:             21.1 ( 1.00x)
cfl_pred_cfl_w4_8bpc_lsx:            4.8 ( 4.43x)
cfl_pred_cfl_w8_8bpc_c:             68.6 ( 1.00x)
cfl_pred_cfl_w8_8bpc_lsx:            7.9 ( 8.73x)
cfl_pred_cfl_w16_8bpc_c:           154.4 ( 1.00x)
cfl_pred_cfl_w16_8bpc_lsx:          15.0 (10.29x)
cfl_pred_cfl_w32_8bpc_c:           410.3 ( 1.00x)
cfl_pred_cfl_w32_8bpc_lsx:          35.6 (11.54x)

Change-Id: I4ec7cc71483298d28379bfbd824e97a0d74d0c23
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 3f6c845d81 Loongarch: Optimized pal_pred 8bpc functions by LSX
pal_pred_w4_8bpc_c:         3.0 ( 1.00x)
pal_pred_w4_8bpc_lsx:       0.6 ( 5.46x)
pal_pred_w8_8bpc_c:         8.8 ( 1.00x)
pal_pred_w8_8bpc_lsx:       0.9 ( 9.49x)
pal_pred_w16_8bpc_c:       26.0 ( 1.00x)
pal_pred_w16_8bpc_lsx:      1.9 (13.70x)
pal_pred_w32_8bpc_c:       60.6 ( 1.00x)
pal_pred_w32_8bpc_lsx:      4.0 (15.10x)
pal_pred_w64_8bpc_c:      146.9 ( 1.00x)
pal_pred_w64_8bpc_lsx:      9.2 (15.97x)

Change-Id: I5414f096a23b09c3a512e727b93fa22104d141f9
2024-09-30 06:37:00 +00:00
jinboandHecai Yuan b26f315d00 loongarch: Add prep_8tap_8bpc_lsx
mct_8tap_regular_w4_0_8bpc_c:                        3.7 ( 1.00x)
mct_8tap_regular_w4_0_8bpc_lsx:                      0.9 ( 4.21x)
mct_8tap_regular_w4_h_8bpc_c:                       15.7 ( 1.00x)
mct_8tap_regular_w4_h_8bpc_lsx:                      1.7 ( 9.24x)
mct_8tap_regular_w4_hv_8bpc_c:                      44.1 ( 1.00x)
mct_8tap_regular_w4_hv_8bpc_lsx:                     6.3 ( 6.96x)
mct_8tap_regular_w4_v_8bpc_c:                       19.8 ( 1.00x)
mct_8tap_regular_w4_v_8bpc_lsx:                      2.4 ( 8.21x)
mct_8tap_regular_w8_0_8bpc_c:                       10.5 ( 1.00x)
mct_8tap_regular_w8_0_8bpc_lsx:                      1.3 ( 8.27x)
mct_8tap_regular_w8_h_8bpc_c:                       47.2 ( 1.00x)
mct_8tap_regular_w8_h_8bpc_lsx:                      6.2 ( 7.61x)
mct_8tap_regular_w8_hv_8bpc_c:                     119.5 ( 1.00x)
mct_8tap_regular_w8_hv_8bpc_lsx:                    18.9 ( 6.32x)
mct_8tap_regular_w8_v_8bpc_c:                       60.5 ( 1.00x)
mct_8tap_regular_w8_v_8bpc_lsx:                      5.4 (11.12x)
mct_8tap_regular_w16_0_8bpc_c:                      28.8 ( 1.00x)
mct_8tap_regular_w16_0_8bpc_lsx:                     2.8 (10.32x)
mct_8tap_regular_w16_h_8bpc_c:                     151.9 ( 1.00x)
mct_8tap_regular_w16_h_8bpc_lsx:                    19.8 ( 7.67x)
mct_8tap_regular_w16_hv_8bpc_c:                    357.5 ( 1.00x)
mct_8tap_regular_w16_hv_8bpc_lsx:                   57.6 ( 6.21x)
mct_8tap_regular_w16_v_8bpc_c:                     195.6 ( 1.00x)
mct_8tap_regular_w16_v_8bpc_lsx:                    16.9 (11.61x)
mct_8tap_regular_w32_0_8bpc_c:                     104.6 ( 1.00x)
mct_8tap_regular_w32_0_8bpc_lsx:                    11.6 ( 9.03x)
mct_8tap_regular_w32_h_8bpc_c:                     596.3 ( 1.00x)
mct_8tap_regular_w32_h_8bpc_lsx:                    77.8 ( 7.67x)
mct_8tap_regular_w32_hv_8bpc_c:                   1329.0 ( 1.00x)
mct_8tap_regular_w32_hv_8bpc_lsx:                  217.9 ( 6.10x)
mct_8tap_regular_w32_v_8bpc_c:                     771.0 ( 1.00x)
mct_8tap_regular_w32_v_8bpc_lsx:                    65.7 (11.73x)
mct_8tap_regular_w64_0_8bpc_c:                     242.0 ( 1.00x)
mct_8tap_regular_w64_0_8bpc_lsx:                    27.0 ( 8.95x)
mct_8tap_regular_w64_h_8bpc_c:                    1455.9 ( 1.00x)
mct_8tap_regular_w64_h_8bpc_lsx:                   186.9 ( 7.79x)
mct_8tap_regular_w64_hv_8bpc_c:                   3221.7 ( 1.00x)
mct_8tap_regular_w64_hv_8bpc_lsx:                  521.8 ( 6.17x)
mct_8tap_regular_w64_v_8bpc_c:                    1836.1 ( 1.00x)
mct_8tap_regular_w64_v_8bpc_lsx:                   158.2 (11.61x)
mct_8tap_regular_w128_0_8bpc_c:                    629.0 ( 1.00x)
mct_8tap_regular_w128_0_8bpc_lsx:                   66.3 ( 9.49x)
mct_8tap_regular_w128_h_8bpc_c:                   3617.5 ( 1.00x)
mct_8tap_regular_w128_h_8bpc_lsx:                  463.6 ( 7.80x)
mct_8tap_regular_w128_hv_8bpc_c:                  7881.7 ( 1.00x)
mct_8tap_regular_w128_hv_8bpc_lsx:                1290.3 ( 6.11x)
mct_8tap_regular_w128_v_8bpc_c:                   4552.9 ( 1.00x)
mct_8tap_regular_w128_v_8bpc_lsx:                  391.1 (11.64x)

Change-Id: I8c6046e4bd6c1fb19d5712234abece0355fb77fa
2024-09-30 06:37:00 +00:00
zhoupengandHecai Yuan ce45ebdef4 Loongarch: Optimized blenc_h_c function by LSX/LASX
blend_h_w2_8bpc_c:                                   3.8 ( 1.00x)
blend_h_w2_8bpc_lsx:                                 1.9 ( 1.98x)
blend_h_w2_8bpc_lasx:                                1.9 ( 1.98x)
blend_h_w4_8bpc_c:                                   6.4 ( 1.00x)
blend_h_w4_8bpc_lsx:                                 1.8 ( 3.49x)
blend_h_w4_8bpc_lasx:                                1.8 ( 3.49x)
blend_h_w8_8bpc_c:                                  11.6 ( 1.00x)
blend_h_w8_8bpc_lsx:                                 1.8 ( 6.45x)
blend_h_w8_8bpc_lasx:                                1.8 ( 6.48x)
blend_h_w16_8bpc_c:                                 21.5 ( 1.00x)
blend_h_w16_8bpc_lsx:                                2.1 (10.47x)
blend_h_w16_8bpc_lasx:                               2.1 (10.48x)
blend_h_w32_8bpc_c:                                 41.9 ( 1.00x)
blend_h_w32_8bpc_lsx:                                3.8 (11.08x)
blend_h_w32_8bpc_lasx:                               3.9 (10.67x)
blend_h_w64_8bpc_c:                                 82.0 ( 1.00x)
blend_h_w64_8bpc_lsx:                                6.9 (11.89x)
blend_h_w64_8bpc_lasx:                               4.6 (17.93x)
blend_h_w128_8bpc_c:                               202.3 ( 1.00x)
blend_h_w128_8bpc_lsx:                              16.4 (12.30x)
blend_h_w128_8bpc_lasx:                             11.4 (17.77x)

Change-Id: I6d6599ccbaba8a62a629c4a52254b2369dba60f6
2024-09-30 06:37:00 +00:00
zhoupengandHecai Yuan 5319278dbe Loongarch: Optimized blend_c/blenc_v_c function by LSX
blend_v_w2_8bpc_c:                                   5.7 ( 1.00x)
blend_v_w2_8bpc_lsx:                                 3.6 ( 1.60x)
blend_v_w4_8bpc_c:                                  22.8 ( 1.00x)
blend_v_w4_8bpc_lsx:                                 7.1 ( 3.20x)
blend_v_w8_8bpc_c:                                  40.2 ( 1.00x)
blend_v_w8_8bpc_lsx:                                 7.1 ( 5.63x)
blend_v_w16_8bpc_c:                                 74.6 ( 1.00x)
blend_v_w16_8bpc_lsx:                                8.1 ( 9.26x)
blend_v_w32_8bpc_c:                                144.0 ( 1.00x)
blend_v_w32_8bpc_lsx:                               13.3 (10.83x)
blend_w4_8bpc_c:                                     4.9 ( 1.00x)
blend_w4_8bpc_lsx:                                   1.9 ( 2.49x)
blend_w8_8bpc_c:                                    14.1 ( 1.00x)
blend_w8_8bpc_lsx:                                   3.2 ( 4.37x)
blend_w16_8bpc_c:                                   51.5 ( 1.00x)
blend_w16_8bpc_lsx:                                  7.9 ( 6.51x)
blend_w32_8bpc_c:                                  127.5 ( 1.00x)
blend_w32_8bpc_lsx:                                 19.6 ( 6.52x)

Change-Id: I95e2dbc1f0735688f5473687f1a7e8d37ffbe417
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 0b9c756f42 Loongarch: Optimized ipred_smooth, ipred_smooth_h and ipred_smooth_v 8bpc functions by LSX
intra_pred_smooth_h_w4_8bpc_c:         7.3 ( 1.00x)
intra_pred_smooth_h_w4_8bpc_lsx:       3.1 ( 2.36x)
intra_pred_smooth_h_w8_8bpc_c:        21.3 ( 1.00x)
intra_pred_smooth_h_w8_8bpc_lsx:       4.5 ( 4.71x)
intra_pred_smooth_h_w16_8bpc_c:       66.3 ( 1.00x)
intra_pred_smooth_h_w16_8bpc_lsx:     13.4 ( 4.96x)
intra_pred_smooth_h_w32_8bpc_c:      160.0 ( 1.00x)
intra_pred_smooth_h_w32_8bpc_lsx:     29.3 ( 5.46x)
intra_pred_smooth_h_w64_8bpc_c:      400.2 ( 1.00x)
intra_pred_smooth_h_w64_8bpc_lsx:     68.3 ( 5.86x)
intra_pred_smooth_v_w4_8bpc_c:         6.6 ( 1.00x)
intra_pred_smooth_v_w4_8bpc_lsx:       3.1 ( 2.10x)
intra_pred_smooth_v_w8_8bpc_c:        19.3 ( 1.00x)
intra_pred_smooth_v_w8_8bpc_lsx:       4.9 ( 3.95x)
intra_pred_smooth_v_w16_8bpc_c:       58.6 ( 1.00x)
intra_pred_smooth_v_w16_8bpc_lsx:     24.0 ( 2.44x)
intra_pred_smooth_v_w32_8bpc_c:      139.4 ( 1.00x)
intra_pred_smooth_v_w32_8bpc_lsx:     27.0 ( 5.17x)
intra_pred_smooth_v_w64_8bpc_c:      344.8 ( 1.00x)
intra_pred_smooth_v_w64_8bpc_lsx:     70.8 ( 4.87x)
intra_pred_smooth_w4_8bpc_c:          10.2 ( 1.00x)
intra_pred_smooth_w4_8bpc_lsx:         7.9 ( 1.30x)
intra_pred_smooth_w8_8bpc_c:          30.3 ( 1.00x)
intra_pred_smooth_w8_8bpc_lsx:        20.0 ( 1.51x)
intra_pred_smooth_w16_8bpc_c:         96.3 ( 1.00x)
intra_pred_smooth_w16_8bpc_lsx:       58.3 ( 1.65x)
intra_pred_smooth_w32_8bpc_c:        231.1 ( 1.00x)
intra_pred_smooth_w32_8bpc_lsx:      134.3 ( 1.72x)
intra_pred_smooth_w64_8bpc_c:        571.5 ( 1.00x)
intra_pred_smooth_w64_8bpc_lsx:      326.5 ( 1.75x)

Change-Id: I22b6c2dcf27c5393bba374b4fbe8879c0463f828
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 7463c2af64 Loongarch: Optimized ipred_paeth 8bpc function by LSX
intra_pred_paeth_w4_8bpc_c:          12.3 ( 1.00x)
intra_pred_paeth_w4_8bpc_lsx:         3.9 ( 3.12x)
intra_pred_paeth_w8_8bpc_c:          39.7 ( 1.00x)
intra_pred_paeth_w8_8bpc_lsx:         6.4 ( 6.20x)
intra_pred_paeth_w16_8bpc_c:        133.6 ( 1.00x)
intra_pred_paeth_w16_8bpc_lsx:       17.0 ( 7.85x)
intra_pred_paeth_w32_8bpc_c:        342.8 ( 1.00x)
intra_pred_paeth_w32_8bpc_lsx:       52.7 ( 6.50x)
intra_pred_paeth_w64_8bpc_c:        903.8 ( 1.00x)
intra_pred_paeth_w64_8bpc_lsx:      107.3 ( 8.42x)

Change-Id: I457bdb24fdd6b5400ec030bffbdd40c79d8165c1
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 3e9d80d831 Loongarch: Optimized ipred_h and ipred_v 8bpc function by LSX
intra_pred_h_w4_8bpc_c:               4.3 ( 1.00x)
intra_pred_h_w4_8bpc_lsx:             3.5 ( 1.21x)
intra_pred_h_w8_8bpc_c:               5.7 ( 1.00x)
intra_pred_h_w8_8bpc_lsx:             5.1 ( 1.11x)
intra_pred_h_w16_8bpc_c:             13.2 ( 1.00x)
intra_pred_h_w16_8bpc_lsx:            7.1 ( 1.86x)
intra_pred_h_w32_8bpc_c:             12.4 ( 1.00x)
intra_pred_h_w32_8bpc_lsx:            6.3 ( 1.96x)
intra_pred_h_w64_8bpc_c:             25.9 ( 1.00x)
intra_pred_h_w64_8bpc_lsx:            5.8 ( 4.44x)
intra_pred_v_w4_8bpc_c:               4.6 ( 1.00x)
intra_pred_v_w4_8bpc_lsx:             2.5 ( 1.85x)
intra_pred_v_w8_8bpc_c:               6.9 ( 1.00x)
intra_pred_v_w8_8bpc_lsx:             4.5 ( 1.53x)
intra_pred_v_w16_8bpc_c:             13.3 ( 1.00x)
intra_pred_v_w16_8bpc_lsx:            5.2 ( 2.56x)
intra_pred_v_w32_8bpc_c:             16.1 ( 1.00x)
intra_pred_v_w32_8bpc_lsx:            5.1 ( 3.13x)
intra_pred_v_w64_8bpc_c:             21.7 ( 1.00x)
intra_pred_v_w64_8bpc_lsx:            7.7 ( 2.80x)

Change-Id: I51b3dd13877315b9c1c64590c19f1ad38bfc4bdf
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 2a9cbcc2f3 Loongarch: Optimized ipred_dc,ipred_dc_128 8bpc,ipred_dc_left and ipred_dc_top functions by LSX
intra_pred_dc_w4_8bpc_c:              2.1 ( 1.00x)
intra_pred_dc_w4_8bpc_lsx:            1.3 ( 1.54x)
intra_pred_dc_w8_8bpc_c:              3.6 ( 1.00x)
intra_pred_dc_w8_8bpc_lsx:            3.7 ( 0.97x)
intra_pred_dc_w16_8bpc_c:             6.9 ( 1.00x)
intra_pred_dc_w16_8bpc_lsx:           7.8 ( 0.88x)
intra_pred_dc_w32_8bpc_c:            14.1 ( 1.00x)
intra_pred_dc_w32_8bpc_lsx:           7.1 ( 1.97x)
intra_pred_dc_w64_8bpc_c:            25.3 ( 1.00x)
intra_pred_dc_w64_8bpc_lsx:           7.4 ( 3.41x)
intra_pred_dc_128_w4_8bpc_c:          0.6 ( 1.00x)
intra_pred_dc_128_w4_8bpc_lsx:        0.8 ( 0.76x)
intra_pred_dc_128_w8_8bpc_c:          1.4 ( 1.00x)
intra_pred_dc_128_w8_8bpc_lsx:        3.2 ( 0.45x)
intra_pred_dc_128_w16_8bpc_c:         3.4 ( 1.00x)
intra_pred_dc_128_w16_8bpc_lsx:       7.3 ( 0.47x)
intra_pred_dc_128_w32_8bpc_c:         8.8 ( 1.00x)
intra_pred_dc_128_w32_8bpc_lsx:       6.4 ( 1.38x)
intra_pred_dc_128_w64_8bpc_c:        17.0 ( 1.00x)
intra_pred_dc_128_w64_8bpc_lsx:       6.2 ( 2.74x)
intra_pred_dc_left_w4_8bpc_c:         1.1 ( 1.00x)
intra_pred_dc_left_w4_8bpc_lsx:       1.1 ( 1.00x)
intra_pred_dc_left_w8_8bpc_c:         2.1 ( 1.00x)
intra_pred_dc_left_w8_8bpc_lsx:       3.4 ( 0.64x)
intra_pred_dc_left_w16_8bpc_c:        4.6 ( 1.00x)
intra_pred_dc_left_w16_8bpc_lsx:      7.5 ( 0.62x)
intra_pred_dc_left_w32_8bpc_c:       10.3 ( 1.00x)
intra_pred_dc_left_w32_8bpc_lsx:      7.8 ( 1.32x)
intra_pred_dc_left_w64_8bpc_c:       18.7 ( 1.00x)
intra_pred_dc_left_w64_8bpc_lsx:      6.6 ( 2.83x)
intra_pred_dc_top_w4_8bpc_c:          0.9 ( 1.00x)
intra_pred_dc_top_w4_8bpc_lsx:        0.8 ( 1.10x)
intra_pred_dc_top_w8_8bpc_c:          1.9 ( 1.00x)
intra_pred_dc_top_w8_8bpc_lsx:        2.8 ( 0.67x)
intra_pred_dc_top_w16_8bpc_c:         4.2 ( 1.00x)
intra_pred_dc_top_w16_8bpc_lsx:       5.5 ( 0.77x)
intra_pred_dc_top_w32_8bpc_c:        10.4 ( 1.00x)
intra_pred_dc_top_w32_8bpc_lsx:       6.7 ( 1.54x)
intra_pred_dc_top_w64_8bpc_c:        19.9 ( 1.00x)
intra_pred_dc_top_w64_8bpc_lsx:       6.9 ( 2.87x)

Change-Id: Ib5349e2430302da0424a474ce0fedc457439c761
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 62c47f3558 Loongarch: Optimized cdef_filter_block 4x4,4x8,8x8 8bpc function by LSX
cdef_filter_4x4_01_8bpc_c:      420.8 ( 1.00x)
cdef_filter_4x4_01_8bpc_lsx:    117.2 ( 3.59x)
cdef_filter_4x4_10_8bpc_c:      265.8 ( 1.00x)
cdef_filter_4x4_10_8bpc_lsx:     98.9 ( 2.69x)
cdef_filter_4x4_11_8bpc_c:     1036.2 ( 1.00x)
cdef_filter_4x4_11_8bpc_lsx:    169.6 ( 6.11x)
cdef_filter_4x8_01_8bpc_c:      802.6 ( 1.00x)
cdef_filter_4x8_01_8bpc_lsx:    206.1 ( 3.89x)
cdef_filter_4x8_10_8bpc_c:      489.1 ( 1.00x)
cdef_filter_4x8_10_8bpc_lsx:    167.4 ( 2.92x)
cdef_filter_4x8_11_8bpc_c:     2028.9 ( 1.00x)
cdef_filter_4x8_11_8bpc_lsx:    309.4 ( 6.56x)
cdef_filter_8x8_01_8bpc_c:     1562.2 ( 1.00x)
cdef_filter_8x8_01_8bpc_lsx:    295.3 ( 5.29x)
cdef_filter_8x8_10_8bpc_c:      949.4 ( 1.00x)
cdef_filter_8x8_10_8bpc_lsx:    207.6 ( 4.57x)
cdef_filter_8x8_11_8bpc_c:     4009.6 ( 1.00x)
cdef_filter_8x8_11_8bpc_lsx:    466.8 ( 8.59x)

Change-Id: I8cd43426a27055e18c44a7701fa50f8835c712be
2024-09-30 06:37:00 +00:00
jinboandHecai Yuan fa7b72d082 Refine mc_put_8tap
Performance speedup over lsx is around 68%~156%.

Change-Id: I0b39cd0e05e3cbd84fded121d29a91ea2a620f03
2024-09-30 06:37:00 +00:00
guxiweiandHecai Yuan 02309b9f60 msac: Add msac_decode_bool_equia_lsx and msac_decode_hi_tok_lsx
The performance data is as follows:
msac_decode_bool_equi_c:             0.4 ( 1.00x)
msac_decode_bool_equi_lsx:           0.3 ( 1.07x)
msac_decode_hi_tok_c:                1.8 ( 1.00x)
msac_decode_hi_tok_lsx:              1.4 ( 1.27x)

Change-Id: Ic2f2678cf699bb22c579424af71ae2603e228482
2024-09-30 06:37:00 +00:00
pengxuandHecai Yuan 2154425f70 Loongarch: Optimized cdef_find_dir_8bpc function by LSX
cdef_dir_8bpc_c:                 28.8 ( 1.00x)
cdef_dir_8bpc_lsx:               19.1 ( 1.51x)

Change-Id: Ic7c1f32c5b1733b011f4c448cffc93f745b564f5
2024-09-30 06:37:00 +00:00
yuanhecai f6ffdc90b3 loongarch: opt inv_txfm_add_identity_identity_8x32_8bpc_lsx
Relative speedup over C code:

inv_txfm_add_8x32_identity_identity_0_8bpc_c:       126.1 ( 1.00x)
inv_txfm_add_8x32_identity_identity_0_8bpc_lsx:       1.6 (78.59x)
inv_txfm_add_8x32_identity_identity_1_8bpc_c:       136.9 ( 1.00x)
inv_txfm_add_8x32_identity_identity_1_8bpc_lsx:       1.6 (85.31x)
inv_txfm_add_8x32_identity_identity_2_8bpc_c:       148.0 ( 1.00x)
inv_txfm_add_8x32_identity_identity_2_8bpc_lsx:       3.3 (45.47x)
inv_txfm_add_8x32_identity_identity_3_8bpc_c:       159.4 ( 1.00x)
inv_txfm_add_8x32_identity_identity_3_8bpc_lsx:       4.9 (32.78x)
inv_txfm_add_8x32_identity_identity_4_8bpc_c:       170.2 ( 1.00x)
inv_txfm_add_8x32_identity_identity_4_8bpc_lsx:       6.5 (26.17x)

Change-Id: Iabda6efcd8a17d26a205f90757dfea85af48848f
2024-09-30 06:37:00 +00:00
yuanhecai 5de878a4e1 loongarch: Minor improvement on identity4*, identity8* and dct32*
1. remove the code about identity8 in the 4x8/8x8/8x16 series
2. modify the code of the function dct_dct_8x32/32x32/64x64
3. modify the code about identity4 in the 4x4/4x8/8x4 series

After the modification, function performance has been improved by 20%

Change-Id: I1bc2e0fb25e508faf9fc220333460a99be3f5e49
2024-09-30 06:37:00 +00:00
yuanhecai 2fc656604b loongarch: add lsx implementation of itx_8bpc.add_8x16 series function for 8 bpc
Relative speedup over C code:

inv_txfm_add_8x16_adst_adst_0_8bpc_c: 208.1
inv_txfm_add_8x16_adst_adst_0_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_adst_1_8bpc_c: 208.4
inv_txfm_add_8x16_adst_adst_1_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_adst_2_8bpc_c: 208.1
inv_txfm_add_8x16_adst_adst_2_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_dct_0_8bpc_c: 204.0
inv_txfm_add_8x16_adst_dct_0_8bpc_lsx: 27.2
inv_txfm_add_8x16_adst_dct_1_8bpc_c: 204.0
inv_txfm_add_8x16_adst_dct_1_8bpc_lsx: 27.2
inv_txfm_add_8x16_adst_dct_2_8bpc_c: 204.0
inv_txfm_add_8x16_adst_dct_2_8bpc_lsx: 27.2
inv_txfm_add_8x16_adst_flipadst_0_8bpc_c: 207.9
inv_txfm_add_8x16_adst_flipadst_0_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_flipadst_1_8bpc_c: 208.3
inv_txfm_add_8x16_adst_flipadst_1_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_flipadst_2_8bpc_c: 208.6
inv_txfm_add_8x16_adst_flipadst_2_8bpc_lsx: 31.3
inv_txfm_add_8x16_adst_identity_0_8bpc_c: 146.6
inv_txfm_add_8x16_adst_identity_0_8bpc_lsx: 21.8
inv_txfm_add_8x16_adst_identity_1_8bpc_c: 146.6
inv_txfm_add_8x16_adst_identity_1_8bpc_lsx: 21.8
inv_txfm_add_8x16_adst_identity_2_8bpc_c: 146.6
inv_txfm_add_8x16_adst_identity_2_8bpc_lsx: 21.8
inv_txfm_add_8x16_dct_adst_0_8bpc_c: 204.8
inv_txfm_add_8x16_dct_adst_0_8bpc_lsx: 26.2
inv_txfm_add_8x16_dct_adst_1_8bpc_c: 204.8
inv_txfm_add_8x16_dct_adst_1_8bpc_lsx: 26.1
inv_txfm_add_8x16_dct_adst_2_8bpc_c: 204.8
inv_txfm_add_8x16_dct_adst_2_8bpc_lsx: 26.2
inv_txfm_add_8x16_dct_dct_0_8bpc_c: 23.1
inv_txfm_add_8x16_dct_dct_0_8bpc_lsx: 2.3
inv_txfm_add_8x16_dct_dct_1_8bpc_c: 200.8
inv_txfm_add_8x16_dct_dct_1_8bpc_lsx: 21.9
inv_txfm_add_8x16_dct_dct_2_8bpc_c: 200.7
inv_txfm_add_8x16_dct_dct_2_8bpc_lsx: 21.9
inv_txfm_add_8x16_dct_flipadst_0_8bpc_c: 204.6
inv_txfm_add_8x16_dct_flipadst_0_8bpc_lsx: 26.3
inv_txfm_add_8x16_dct_flipadst_1_8bpc_c: 204.6
inv_txfm_add_8x16_dct_flipadst_1_8bpc_lsx: 26.3
inv_txfm_add_8x16_dct_flipadst_2_8bpc_c: 204.6
inv_txfm_add_8x16_dct_flipadst_2_8bpc_lsx: 26.3
inv_txfm_add_8x16_dct_identity_0_8bpc_c: 143.2
inv_txfm_add_8x16_dct_identity_0_8bpc_lsx: 16.7
inv_txfm_add_8x16_dct_identity_1_8bpc_c: 142.9
inv_txfm_add_8x16_dct_identity_1_8bpc_lsx: 16.7
inv_txfm_add_8x16_dct_identity_2_8bpc_c: 143.5
inv_txfm_add_8x16_dct_identity_2_8bpc_lsx: 16.7
inv_txfm_add_8x16_flipadst_adst_0_8bpc_c: 206.5
inv_txfm_add_8x16_flipadst_adst_0_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_adst_1_8bpc_c: 206.5
inv_txfm_add_8x16_flipadst_adst_1_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_adst_2_8bpc_c: 206.5
inv_txfm_add_8x16_flipadst_adst_2_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_dct_0_8bpc_c: 202.5
inv_txfm_add_8x16_flipadst_dct_0_8bpc_lsx: 26.8
inv_txfm_add_8x16_flipadst_dct_1_8bpc_c: 202.3
inv_txfm_add_8x16_flipadst_dct_1_8bpc_lsx: 26.8
inv_txfm_add_8x16_flipadst_dct_2_8bpc_c: 202.3
inv_txfm_add_8x16_flipadst_dct_2_8bpc_lsx: 26.8
inv_txfm_add_8x16_flipadst_flipadst_0_8bpc_c: 206.3
inv_txfm_add_8x16_flipadst_flipadst_0_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_flipadst_1_8bpc_c: 206.3
inv_txfm_add_8x16_flipadst_flipadst_1_8bpc_lsx: 31.3
inv_txfm_add_8x16_flipadst_flipadst_2_8bpc_c: 206.3
inv_txfm_add_8x16_flipadst_flipadst_2_8bpc_lsx: 31.3
inv_txfm_add_8x16_identity_adst_0_8bpc_c: 160.7
inv_txfm_add_8x16_identity_adst_0_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_adst_1_8bpc_c: 160.4
inv_txfm_add_8x16_identity_adst_1_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_adst_2_8bpc_c: 160.1
inv_txfm_add_8x16_identity_adst_2_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_dct_0_8bpc_c: 157.9
inv_txfm_add_8x16_identity_dct_0_8bpc_lsx: 17.7
inv_txfm_add_8x16_identity_dct_1_8bpc_c: 156.5
inv_txfm_add_8x16_identity_dct_1_8bpc_lsx: 17.7
inv_txfm_add_8x16_identity_dct_2_8bpc_c: 156.8
inv_txfm_add_8x16_identity_dct_2_8bpc_lsx: 17.7
inv_txfm_add_8x16_identity_flipadst_0_8bpc_c: 159.9
inv_txfm_add_8x16_identity_flipadst_0_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_flipadst_1_8bpc_c: 159.9
inv_txfm_add_8x16_identity_flipadst_1_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_flipadst_2_8bpc_c: 160.0
inv_txfm_add_8x16_identity_flipadst_2_8bpc_lsx: 21.8
inv_txfm_add_8x16_identity_identity_0_8bpc_c: 98.3
inv_txfm_add_8x16_identity_identity_0_8bpc_lsx: 12.3
inv_txfm_add_8x16_identity_identity_1_8bpc_c: 98.0
inv_txfm_add_8x16_identity_identity_1_8bpc_lsx: 12.3
inv_txfm_add_8x16_identity_identity_2_8bpc_c: 98.1
inv_txfm_add_8x16_identity_identity_2_8bpc_lsx: 12.3

Change-Id: Ida8d71e4eff782b9f81e0ad426eaa078b68528cf
2024-09-30 06:37:00 +00:00
yuanhecai 643ae52baa loongarch: add lsx implementation of itx_8bpc.add_4x16 series function for 8 bpc
Relative speedup over C code:

inv_txfm_add_4x16_adst_adst_0_8bpc_c: 91.1
inv_txfm_add_4x16_adst_adst_0_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_adst_1_8bpc_c: 91.1
inv_txfm_add_4x16_adst_adst_1_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_adst_2_8bpc_c: 91.1
inv_txfm_add_4x16_adst_adst_2_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_dct_0_8bpc_c: 89.5
inv_txfm_add_4x16_adst_dct_0_8bpc_lsx: 14.3
inv_txfm_add_4x16_adst_dct_1_8bpc_c: 89.5
inv_txfm_add_4x16_adst_dct_1_8bpc_lsx: 14.3
inv_txfm_add_4x16_adst_dct_2_8bpc_c: 89.5
inv_txfm_add_4x16_adst_dct_2_8bpc_lsx: 14.3
inv_txfm_add_4x16_adst_flipadst_0_8bpc_c: 91.8
inv_txfm_add_4x16_adst_flipadst_0_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_flipadst_1_8bpc_c: 91.7
inv_txfm_add_4x16_adst_flipadst_1_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_flipadst_2_8bpc_c: 91.8
inv_txfm_add_4x16_adst_flipadst_2_8bpc_lsx: 18.2
inv_txfm_add_4x16_adst_identity_0_8bpc_c: 60.5
inv_txfm_add_4x16_adst_identity_0_8bpc_lsx: 6.3
inv_txfm_add_4x16_adst_identity_1_8bpc_c: 60.5
inv_txfm_add_4x16_adst_identity_1_8bpc_lsx: 6.3
inv_txfm_add_4x16_adst_identity_2_8bpc_c: 60.5
inv_txfm_add_4x16_adst_identity_2_8bpc_lsx: 6.3
inv_txfm_add_4x16_dct_adst_0_8bpc_c: 92.7
inv_txfm_add_4x16_dct_adst_0_8bpc_lsx: 18.4
inv_txfm_add_4x16_dct_adst_1_8bpc_c: 92.7
inv_txfm_add_4x16_dct_adst_1_8bpc_lsx: 18.4
inv_txfm_add_4x16_dct_adst_2_8bpc_c: 92.7
inv_txfm_add_4x16_dct_adst_2_8bpc_lsx: 18.4
inv_txfm_add_4x16_dct_dct_0_8bpc_c: 13.7
inv_txfm_add_4x16_dct_dct_0_8bpc_lsx: 1.9
inv_txfm_add_4x16_dct_dct_1_8bpc_c: 90.6
inv_txfm_add_4x16_dct_dct_1_8bpc_lsx: 14.5
inv_txfm_add_4x16_dct_dct_2_8bpc_c: 90.6
inv_txfm_add_4x16_dct_dct_2_8bpc_lsx: 14.5
inv_txfm_add_4x16_dct_flipadst_0_8bpc_c: 93.3
inv_txfm_add_4x16_dct_flipadst_0_8bpc_lsx: 18.6
inv_txfm_add_4x16_dct_flipadst_1_8bpc_c: 93.4
inv_txfm_add_4x16_dct_flipadst_1_8bpc_lsx: 18.6
inv_txfm_add_4x16_dct_flipadst_2_8bpc_c: 93.4
inv_txfm_add_4x16_dct_flipadst_2_8bpc_lsx: 18.6
inv_txfm_add_4x16_dct_identity_0_8bpc_c: 62.1
inv_txfm_add_4x16_dct_identity_0_8bpc_lsx: 6.5
inv_txfm_add_4x16_dct_identity_1_8bpc_c: 62.1
inv_txfm_add_4x16_dct_identity_1_8bpc_lsx: 6.5
inv_txfm_add_4x16_dct_identity_2_8bpc_c: 62.1
inv_txfm_add_4x16_dct_identity_2_8bpc_lsx: 6.5
inv_txfm_add_4x16_flipadst_adst_0_8bpc_c: 92.2
inv_txfm_add_4x16_flipadst_adst_0_8bpc_lsx: 18.1
inv_txfm_add_4x16_flipadst_adst_1_8bpc_c: 92.3
inv_txfm_add_4x16_flipadst_adst_1_8bpc_lsx: 18.1
inv_txfm_add_4x16_flipadst_adst_2_8bpc_c: 92.2
inv_txfm_add_4x16_flipadst_adst_2_8bpc_lsx: 18.1
inv_txfm_add_4x16_flipadst_dct_0_8bpc_c: 90.6
inv_txfm_add_4x16_flipadst_dct_0_8bpc_lsx: 14.3
inv_txfm_add_4x16_flipadst_dct_1_8bpc_c: 90.6
inv_txfm_add_4x16_flipadst_dct_1_8bpc_lsx: 14.3
inv_txfm_add_4x16_flipadst_dct_2_8bpc_c: 90.6
inv_txfm_add_4x16_flipadst_dct_2_8bpc_lsx: 14.3
inv_txfm_add_4x16_flipadst_flipadst_0_8bpc_c: 92.9
inv_txfm_add_4x16_flipadst_flipadst_0_8bpc_lsx: 18.2
inv_txfm_add_4x16_flipadst_flipadst_1_8bpc_c: 92.9
inv_txfm_add_4x16_flipadst_flipadst_1_8bpc_lsx: 18.2
inv_txfm_add_4x16_flipadst_flipadst_2_8bpc_c: 92.9
inv_txfm_add_4x16_flipadst_flipadst_2_8bpc_lsx: 18.2
inv_txfm_add_4x16_flipadst_identity_0_8bpc_c: 61.8
inv_txfm_add_4x16_flipadst_identity_0_8bpc_lsx: 6.3
inv_txfm_add_4x16_flipadst_identity_1_8bpc_c: 61.8
inv_txfm_add_4x16_flipadst_identity_1_8bpc_lsx: 6.3
inv_txfm_add_4x16_flipadst_identity_2_8bpc_c: 61.8
inv_txfm_add_4x16_flipadst_identity_2_8bpc_lsx: 6.3
inv_txfm_add_4x16_identity_adst_0_8bpc_c: 83.1
inv_txfm_add_4x16_identity_adst_0_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_adst_1_8bpc_c: 83.0
inv_txfm_add_4x16_identity_adst_1_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_adst_2_8bpc_c: 83.0
inv_txfm_add_4x16_identity_adst_2_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_dct_0_8bpc_c: 81.4
inv_txfm_add_4x16_identity_dct_0_8bpc_lsx: 13.9
inv_txfm_add_4x16_identity_dct_1_8bpc_c: 81.4
inv_txfm_add_4x16_identity_dct_1_8bpc_lsx: 13.9
inv_txfm_add_4x16_identity_dct_2_8bpc_c: 81.4
inv_txfm_add_4x16_identity_dct_2_8bpc_lsx: 13.9
inv_txfm_add_4x16_identity_flipadst_0_8bpc_c: 84.1
inv_txfm_add_4x16_identity_flipadst_0_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_flipadst_1_8bpc_c: 84.0
inv_txfm_add_4x16_identity_flipadst_1_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_flipadst_2_8bpc_c: 83.9
inv_txfm_add_4x16_identity_flipadst_2_8bpc_lsx: 17.8
inv_txfm_add_4x16_identity_identity_0_8bpc_c: 52.4
inv_txfm_add_4x16_identity_identity_0_8bpc_lsx: 5.5
inv_txfm_add_4x16_identity_identity_1_8bpc_c: 52.4
inv_txfm_add_4x16_identity_identity_1_8bpc_lsx: 5.5
inv_txfm_add_4x16_identity_identity_2_8bpc_c: 52.4
inv_txfm_add_4x16_identity_identity_2_8bpc_lsx: 5.5

Change-Id: I36322071eeea45df9289f2b1d533ee937904aec2
2024-09-30 06:37:00 +00:00
yuanhecai d60d93a55c loongarch: add lsx implementation of itx_8bpc.add_4x8 series function for 8 bpc
Relative speedup over C code:

inv_txfm_add_4x8_adst_adst_0_8bpc_c: 43.8
inv_txfm_add_4x8_adst_adst_0_8bpc_lsx: 8.6
inv_txfm_add_4x8_adst_adst_1_8bpc_c: 43.8
inv_txfm_add_4x8_adst_adst_1_8bpc_lsx: 8.6
inv_txfm_add_4x8_adst_dct_0_8bpc_c: 43.0
inv_txfm_add_4x8_adst_dct_0_8bpc_lsx: 6.5
inv_txfm_add_4x8_adst_dct_1_8bpc_c: 43.0
inv_txfm_add_4x8_adst_dct_1_8bpc_lsx: 6.5
inv_txfm_add_4x8_adst_flipadst_0_8bpc_c: 44.1
inv_txfm_add_4x8_adst_flipadst_0_8bpc_lsx: 8.8
inv_txfm_add_4x8_adst_flipadst_1_8bpc_c: 44.1
inv_txfm_add_4x8_adst_flipadst_1_8bpc_lsx: 8.8
inv_txfm_add_4x8_adst_identity_0_8bpc_c: 31.3
inv_txfm_add_4x8_adst_identity_0_8bpc_lsx: 2.9
inv_txfm_add_4x8_adst_identity_1_8bpc_c: 31.3
inv_txfm_add_4x8_adst_identity_1_8bpc_lsx: 2.9
inv_txfm_add_4x8_dct_adst_0_8bpc_c: 46.3
inv_txfm_add_4x8_dct_adst_0_8bpc_lsx: 8.8
inv_txfm_add_4x8_dct_adst_1_8bpc_c: 46.3
inv_txfm_add_4x8_dct_adst_1_8bpc_lsx: 8.8
inv_txfm_add_4x8_dct_dct_0_8bpc_c: 7.3
inv_txfm_add_4x8_dct_dct_0_8bpc_lsx: 1.5
inv_txfm_add_4x8_dct_dct_1_8bpc_c: 45.7
inv_txfm_add_4x8_dct_dct_1_8bpc_lsx: 6.7
inv_txfm_add_4x8_dct_flipadst_0_8bpc_c: 46.7
inv_txfm_add_4x8_dct_flipadst_0_8bpc_lsx: 8.8
inv_txfm_add_4x8_dct_flipadst_1_8bpc_c: 46.7
inv_txfm_add_4x8_dct_flipadst_1_8bpc_lsx: 8.8
inv_txfm_add_4x8_dct_identity_0_8bpc_c: 33.8
inv_txfm_add_4x8_dct_identity_0_8bpc_lsx: 2.9
inv_txfm_add_4x8_dct_identity_1_8bpc_c: 33.8
inv_txfm_add_4x8_dct_identity_1_8bpc_lsx: 2.9
inv_txfm_add_4x8_flipadst_adst_0_8bpc_c: 44.0
inv_txfm_add_4x8_flipadst_adst_0_8bpc_lsx: 8.6
inv_txfm_add_4x8_flipadst_adst_1_8bpc_c: 43.9
inv_txfm_add_4x8_flipadst_adst_1_8bpc_lsx: 8.6
inv_txfm_add_4x8_flipadst_dct_0_8bpc_c: 43.3
inv_txfm_add_4x8_flipadst_dct_0_8bpc_lsx: 6.5
inv_txfm_add_4x8_flipadst_dct_1_8bpc_c: 43.4
inv_txfm_add_4x8_flipadst_dct_1_8bpc_lsx: 6.5
inv_txfm_add_4x8_flipadst_flipadst_0_8bpc_c: 44.4
inv_txfm_add_4x8_flipadst_flipadst_0_8bpc_lsx: 8.8
inv_txfm_add_4x8_flipadst_flipadst_1_8bpc_c: 44.4
inv_txfm_add_4x8_flipadst_flipadst_1_8bpc_lsx: 8.8
inv_txfm_add_4x8_flipadst_identity_0_8bpc_c: 31.5
inv_txfm_add_4x8_flipadst_identity_0_8bpc_lsx: 2.9
inv_txfm_add_4x8_flipadst_identity_1_8bpc_c: 31.5
inv_txfm_add_4x8_flipadst_identity_1_8bpc_lsx: 2.9
inv_txfm_add_4x8_identity_adst_0_8bpc_c: 38.9
inv_txfm_add_4x8_identity_adst_0_8bpc_lsx: 8.2
inv_txfm_add_4x8_identity_adst_1_8bpc_c: 38.9
inv_txfm_add_4x8_identity_adst_1_8bpc_lsx: 8.2
inv_txfm_add_4x8_identity_dct_0_8bpc_c: 38.1
inv_txfm_add_4x8_identity_dct_0_8bpc_lsx: 6.1
inv_txfm_add_4x8_identity_dct_1_8bpc_c: 38.1
inv_txfm_add_4x8_identity_dct_1_8bpc_lsx: 6.1
inv_txfm_add_4x8_identity_flipadst_0_8bpc_c: 39.2
inv_txfm_add_4x8_identity_flipadst_0_8bpc_lsx: 8.3
inv_txfm_add_4x8_identity_flipadst_1_8bpc_c: 39.2
inv_txfm_add_4x8_identity_flipadst_1_8bpc_lsx: 8.3
inv_txfm_add_4x8_identity_identity_0_8bpc_c: 26.4
inv_txfm_add_4x8_identity_identity_0_8bpc_lsx: 2.4
inv_txfm_add_4x8_identity_identity_1_8bpc_c: 26.4
inv_txfm_add_4x8_identity_identity_1_8bpc_lsx: 2.4

Change-Id: Ibbaeca98118774a261cf55afd581196d93ac2004
2024-09-30 06:37:00 +00:00
yuanhecai 74e0eeb5ec loongarch: Opt one functions of itx_8bpc.add_16x32 series
1. inv_txfm_add_dct_dct_16x32

Relative speedup over C code:

inv_txfm_add_16x32_dct_dct_0_8bpc_c: 63.4
inv_txfm_add_16x32_dct_dct_0_8bpc_lsx: 3.3
inv_txfm_add_16x32_dct_dct_1_8bpc_c: 687.0
inv_txfm_add_16x32_dct_dct_1_8bpc_lsx: 55.7
inv_txfm_add_16x32_dct_dct_2_8bpc_c: 686.4
inv_txfm_add_16x32_dct_dct_2_8bpc_lsx: 55.6
inv_txfm_add_16x32_dct_dct_3_8bpc_c: 686.4
inv_txfm_add_16x32_dct_dct_3_8bpc_lsx: 55.5
inv_txfm_add_16x32_dct_dct_4_8bpc_c: 686.4
inv_txfm_add_16x32_dct_dct_4_8bpc_lsx: 55.6

Change-Id: I9d22b8b3534b7ba17f6e85e42d08eb3165e2e8cb
2024-09-30 06:37:00 +00:00
MARBEANandJean-Baptiste Kempf f2c3ccd6a6 meson: supports the iOS platform 2024-09-21 07:10:06 +00:00
Cameron CawleyandMartin Storsjö a7a40a3fde Define __ARM_ARCH with older compilers
This is needed for GCC 4.7 and earlier, as well as Visual Studio 2022 version 17.9 and earlier.
2024-09-18 18:29:36 +00:00
Cameron CawleyandMartin Storsjö 8e993f4d0b Support older ARM versions with checkasm 2024-09-18 18:29:36 +00:00
Luca Barbato 8d9b1e26b3 ppc: Factor out dc_only itx 2024-09-17 12:34:37 +00:00
Luca Barbato 75d3ad14f2 ppc: itx 16x4 pwr9 2024-09-17 12:34:37 +00:00
Luca Barbato 0bf331a1bb ppc: itx 4x16 pwr9
Initial i32x4 version, can be used as base for high bitdept.
2024-09-17 12:34:37 +00:00
Luca Barbato 19e122ee38 ppc: Remove high bitdepth macros from the 8bit-only code 2024-09-17 12:34:37 +00:00
Luca Barbato b1d847beb5 ppc: itx 8x8 pwr9 2024-09-17 12:34:37 +00:00
Luca Barbato da51b12322 ppc: itx 4x8 and 8x4 pwr9 2024-09-17 12:34:37 +00:00
Luca Barbato 33b9d5141f ppc: itx 4x4 pwr9 2024-09-17 12:34:37 +00:00
Jean-Baptiste Kempf 212359662d NEWS: get ready for 1.5.0 2024-09-17 12:11:45 +00:00
Jean-Baptiste Kempf bd875480a9 Update NEWS for 1.4.3 2024-09-17 12:11:45 +00:00
Michael BradshawandRonald S. Bultje dd32cd5027 Use #if HAVE_* instead of #ifdef HAVE_* 2024-09-12 20:40:08 +00:00
Arpad PanyikandMartin Storsjö 82e9155c75 AArch64: Trim Armv8.0 Neon path of 6-tap and 8-tap MC functions
There are some instruction sequences we could merge after the lane
load/store patch (ec5c3052cf).

This change will simplify the loading of filter weights to save 288
bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.
2024-09-12 11:31:07 +00:00
Kacper Michajłow f4a0d7cb70 Remove dav1d/ prefix from dav1d.h
This is possible, because we no longer generate version.h at compile
time.

Reverts header change from 7629402bbd to
preserve the same behaviour as before.
2024-09-11 02:43:02 +02:00
Kacper Michajłow 74ccc93687 meson: don't generate version.h
Instead of generating version.h, move the so version there and parse it
in meson.
2024-09-10 23:25:16 +02:00
Kyle SiefringandRonald S. Bultje 4385e7e161 Improve density of group context setting macros
Shared object binary size reduction:
x84_64           : 16112 bytes
ARM64            : 16008 bytes
ARM64(+Os)       : 21592 bytes
ARMv7(+Os+mthumb): 18480 bytes

Size reduction of symbols:
x84_64           : 15712 bytes
ARM64            : 18688 bytes
ARM64(+Os)       : 18404 bytes
ARMv7(+Os+mthumb): 17322 bytes

Compiles were done with clang version 18.1.8 and symbol sizes were
obtained using nm on the shared object.

Provides speed ups on older ARM64 cpus with very little impact on other
cpus.

Speedup:

c7i (skylake)
 Nature1080p      : x0.999
 Chimera          : x0.998

odroid C4
 Nature1080p      : x1.007
 Chimera          : x1.016
 Models1080p      : x1.005
 MountainBike1080p: x1.009
 Balloons1080p    : x1.008

Raspberry Pi 4
 Nature1080p      : x1.005
 Chimera          : x0.999
 Models1080p      : x0.999
 MountainBike1080p: x1.004
 Balloons1080p    : x1.003

Raspberry Pi 2 (Cortex-A7):
 (using size optimized build)
 Nature1080p      : x1.003
 Models1080p      : x0.997
2024-09-06 22:12:56 +00:00
Martin Storsjö 166e1df543 tests: Add an option to dav1d_argon.bash for using a wrapper tool
This allows executing all the tools within e.g. valgrind.

This matches the "meson test --wrap <tool>" feature.
2024-09-06 20:32:45 +00:00
Kyle SiefringandMartin Storsjö 79db162487 AArch64: New method for calculating sgr table
For the 3x3 part, double the width of the vertical loop. This is done to
provide more latency in the new sgr calculation.

Initial (master):  Cortex A53        A55        A72        A73       A76   Apple M1
sgr_3x3_8bpc_neon:   387702.8   383154.2   295742.4   302100.1  185420.7   472.2
sgr_5x5_8bpc_neon:   261725.1   256919.8   194205.1   197585.6  128311.3   332.9
sgr_mix_8bpc_neon:   628085.0   593664.2   453551.8   450553.8  281956.0   711.2

Current:
sgr_3x3_8bpc_neon:   368331.4   363949.7   275499.0   272056.3  169614.4   432.7
sgr_5x5_8bpc_neon:   257866.7   255265.5   195962.5   199557.8  120481.3   319.2
sgr_mix_8bpc_neon:   598234.1   572896.4   418500.4   438910.7  258977.7   659.3

Include a minor improvement that gets rid of a dup instruction.
2024-09-06 09:04:24 +00:00
Arpad PanyikandMartin Storsjö ec5c3052cf AArch64: Optimize lane load/store in MC functions
Partial register writes can create long dependency chains, which can
reduce performance on out-of-order CPUs. This patch removes most of
these kinds of problems in MC functions by filling the full register
before other lane loading instructions.

Most lane extracting stores can also be optimized using FP scalar
stores when the 0th lane would be extracted.

Relative runtime of micro benchmarks after this patch on some Neoverse
and Cortex CPU cores:

8bpc neon                V2      V1      X3      X1    A715     A78     A76
 avg        w8:       0.942x  1.030x  0.936x  0.935x  1.000x  0.877x  0.976x
 w_avg      w8:       0.908x  0.913x  0.919x  0.914x  0.999x  0.905x  0.910x
 mask       w8:       0.937x  0.905x  0.929x  0.907x  1.009x  0.921x  0.868x
 w_mask 420 w4:       0.969x  0.968x  0.951x  0.962x  0.995x  0.976x  0.958x
 w_mask 420 w8:       0.979x  0.935x  0.936x  0.935x  0.996x  0.948x  0.959x
 blend      w4:       0.721x  0.841x  0.764x  0.822x  0.772x  0.826x  0.883x
 blend      w8:       0.692x  0.733x  0.686x  0.730x  0.828x  0.723x  0.762x
 blend    h w2:       0.738x  0.776x  0.746x  0.775x  0.683x  0.827x  0.851x
 blend    h w4:       0.858x  0.942x  0.880x  0.933x  0.784x  0.924x  0.965x
 blend    h w8:       0.804x  0.807x  0.806x  0.805x  0.814x  0.810x  0.748x
 blend    v w2:       0.898x  0.931x  0.903x  0.949x  0.784x  0.867x  0.875x
 blend    v w4:       0.935x  0.905x  0.933x  0.922x  0.763x  0.777x  0.807x
 blend    v w8:       0.803x  0.802x  0.804x  0.815x  0.674x  0.677x  0.678x

16bpc neon               V2      V1      X3      X1    A715     A78     A76
 avg        w4:       0.899x  0.967x  0.897x  0.948x  1.002x  0.901x  0.884x
 w_avg      w4:       0.952x  0.951x  0.936x  0.946x  0.997x  0.937x  0.925x
 mask       w4:       0.893x  0.958x  0.887x  0.948x  1.003x  0.938x  0.934x
 w_mask 420 w4:       0.933x  0.932x  0.932x  0.939x  1.000x  0.910x  0.955x
 w_mask 420 w8:       0.966x  0.962x  0.967x  0.961x  1.000x  0.990x  1.010x
 blend      w4:       0.367x  0.361x  0.370x  0.352x  0.418x  0.394x  0.476x
 blend    h w2:       0.365x  0.445x  0.369x  0.437x  0.416x  0.576x  0.699x
 blend    h w4:       0.343x  0.402x  0.342x  0.398x  0.418x  0.525x  0.603x
 blend    v w2:       0.464x  0.460x  0.460x  0.447x  0.494x  0.446x  0.503x
 blend    v w4:       0.432x  0.424x  0.437x  0.416x  0.433x  0.427x  0.534x
 blend    v w8:       0.936x  0.847x  0.949x  0.848x  1.007x  0.811x  0.785x

bilinear 8bpc neon       V2      V1      X3      X1    A715     A78     A76
 mct     w4  0:       0.982x  0.983x  0.955x  1.029x  0.784x  0.817x  0.814x
 mc      w2  h:       0.277x  0.333x  0.275x  0.325x  0.299x  0.435x  0.518x
 mct     w4  h:       0.835x  0.862x  0.814x  0.887x  1.074x  0.899x  0.884x
 mc      w2  v:       0.887x  0.966x  0.894x  0.945x  0.808x  0.953x  0.997x
 mc      w4  v:       0.762x  0.899x  0.766x  0.867x  0.695x  0.915x  1.017x
 mct     w4  v:       0.700x  0.812x  0.740x  0.777x  0.777x  0.824x  0.853x
 mc      w2 hv:       0.928x  0.985x  0.929x  0.978x  0.789x  0.969x  1.010x
 mct     w4 hv:       0.887x  0.913x  0.912x  0.920x  1.001x  0.922x  0.937x

bilinear 16bpc neon      V2      V1      X3      X1    A715     A78     A76
 mc      w2  0:       0.991x  1.032x  0.993x  0.970x  0.878x  0.925x  0.999x
 mct     w4  0:       0.811x  0.730x  0.797x  0.680x  0.808x  0.711x  0.805x
 mc      w4  h:       0.885x  0.901x  0.895x  0.905x  1.003x  0.909x  0.910x
 mct     w4  h:       0.902x  0.914x  0.898x  0.896x  1.000x  0.897x  0.934x
 mc      w2  v:       0.888x  0.966x  0.913x  0.955x  0.824x  0.958x  1.005x
 mc      w4  v:       0.897x  0.894x  0.903x  0.902x  1.001x  0.895x  0.895x
 mct     w4  v:       0.924x  0.908x  0.921x  0.901x  1.001x  0.904x  0.918x
 mc      w4 hv:       0.927x  0.925x  0.924x  0.933x  1.000x  0.936x  0.959x
 mct     w4 hv:       0.923x  0.944x  0.923x  0.944x  0.999x  0.931x  0.956x

8tap 8bpc neon           V2      V1      X3      X1    A715     A78     A76
 mct regular w4  0:   0.829x  0.854x  0.735x  0.861x  0.769x  0.766x  0.840x
 mc  regular w2  h:   0.984x  1.008x  0.983x  1.012x  0.986x  0.989x  0.995x
 mc  sharp   w2  h:   0.987x  1.008x  0.986x  1.011x  0.985x  0.989x  0.995x
 mc  regular w4  h:   0.907x  0.911x  0.916x  0.908x  0.997x  0.936x  0.932x
 mc  sharp   w4  h:   0.916x  0.914x  0.918x  0.913x  0.999x  0.939x  0.905x
 mct regular w4  h:   0.992x  0.979x  0.993x  0.971x  1.000x  0.986x  0.976x
 mct sharp   w4  h:   0.991x  0.979x  0.989x  0.984x  1.001x  0.979x  0.983x
 mc  regular w2  v:   1.002x  1.001x  1.005x  1.000x  1.000x  0.998x  0.983x
 mc  sharp   w2  v:   1.005x  1.001x  1.009x  0.998x  0.994x  0.997x  0.989x
 mc  regular w4  v:   0.985x  0.998x  0.991x  0.998x  1.000x  1.000x  0.983x
 mc  sharp   w4  v:   1.005x  1.002x  1.006x  1.002x  0.998x  0.991x  0.999x
 mct regular w4  v:   0.966x  0.967x  0.961x  0.974x  0.996x  0.954x  0.982x
 mct sharp   w4  v:   0.970x  0.944x  0.967x  0.944x  0.997x  0.951x  0.966x
 mc  regular w2 hv:   0.993x  0.993x  0.994x  0.987x  0.993x  0.985x  0.999x
 mc  sharp   w2 hv:   0.994x  0.996x  0.992x  0.998x  0.997x  0.999x  0.999x
 mc  regular w4 hv:   0.964x  0.958x  0.964x  0.960x  0.982x  0.938x  0.958x
 mc  sharp   w4 hv:   0.982x  0.981x  0.980x  0.982x  0.995x  0.986x  0.941x
 mct regular w4 hv:   0.993x  0.994x  0.992x  0.994x  0.996x  0.992x  0.988x
 mct sharp   w4 hv:   0.993x  0.996x  0.991x  0.996x  0.954x  0.992x  1.011x

8tap 16bpc neon          V2      V1      X3      X1    A715     A78     A76
 mc  regular w2  0:   0.869x  1.059x  0.874x  0.956x  0.883x  0.932x  1.000x
 mct regular w4  0:   0.348x  0.369x  0.354x  0.377x  0.560x  0.409x  0.648x
 mc  regular w2  h:   0.996x  0.988x  0.992x  0.985x  0.989x  0.991x  1.006x
 mc  sharp   w2  h:   0.996x  0.989x  0.979x  0.991x  0.987x  0.988x  0.997x
 mc  regular w4  h:   0.957x  0.937x  0.957x  0.948x  0.961x  0.927x  0.994x
 mc  sharp   w4  h:   0.966x  0.940x  0.962x  0.954x  0.985x  0.929x  0.970x
 mct regular w4  h:   0.922x  0.942x  0.932x  0.933x  1.007x  0.938x  0.905x
 mct sharp   w4  h:   0.919x  0.943x  0.919x  0.931x  0.971x  0.943x  0.929x
 mc  regular w2  v:   1.000x  0.997x  1.001x  1.003x  1.001x  0.999x  0.984x
 mc  sharp   w2  v:   1.000x  0.999x  1.000x  0.999x  1.000x  1.000x  0.993x
 mc  regular w4  v:   0.936x  0.941x  0.936x  0.939x  0.999x  0.928x  0.981x
 mc  sharp   w4  v:   0.955x  0.961x  0.949x  0.956x  0.999x  0.947x  0.953x
 mct regular w4  v:   0.977x  0.966x  0.979x  0.968x  0.990x  0.972x  0.972x
 mct sharp   w4  v:   0.973x  0.965x  0.981x  0.963x  0.994x  0.977x  0.974x
 mc  regular w2 hv:   0.995x  1.001x  0.995x  0.995x  0.995x  1.000x  0.981x
 mc  sharp   w2 hv:   0.993x  1.012x  0.993x  0.988x  0.996x  0.992x  1.008x
 mc  regular w4 hv:   0.938x  0.943x  0.939x  0.943x  0.986x  0.943x  0.997x
 mc  sharp   w4 hv:   0.969x  0.959x  0.970x  0.974x  0.986x  0.993x  0.997x
 mct regular w4 hv:   0.942x  0.970x  0.951x  0.960x  0.977x  0.958x  1.018x
 mct sharp   w4 hv:   0.923x  0.958x  0.934x  0.955x  0.973x  0.946x  0.986x
2024-09-06 11:40:46 +03:00
Arpad PanyikandMartin Storsjö a992a9bede AArch64: Optimize Armv8.0 Neon path of SBD H/HV 6-tap filters
The 6-tap horizontal and the horizontal parts of 6-tap HV subpel
filters can be further improved by some pointer arithmetic and saving
some instructions (EXTs) in their data rearrangement codes.

Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:

SBD mct h         X1     A78     A76     A72     A55
 regular  w8:  0.878x  0.894x  0.990x  0.923x  0.944x
 regular w16:  0.962x  0.931x  0.943x  0.949x  0.949x
 regular w32:  0.937x  0.937x  0.972x  0.938x  0.947x
 regular w64:  0.920x  0.965x  0.992x  0.936x  0.944x

SBD mct hv        X1     A78     A76     A72     A55
 regular  w8:  0.931x  0.970x  0.951x  0.950x  0.971x
 regular w16:  0.940x  0.971x  0.941x  0.952x  0.967x
 regular w32:  0.943x  0.972x  0.946x  0.961x  0.974x
 regular w64:  0.943x  0.973x  0.952x  0.944x  0.975x
2024-09-06 08:08:08 +00:00
Arpad PanyikandMartin Storsjö 2d808de191 AArch64: Optimize Armv8.0 Neon path of HBD HV 6-tap filters
The horizontal parts of 6-tap HV subpel filters can be further
improved by some pointer arithmetic and saving some instructions
(EXTs) in their data rearrangement codes.

Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:

HBD mct hv        X1     A78     A76     A72     A55
 regular  w8:  0.952x  0.989x  0.924x  0.973x  0.976x
 regular w16:  0.961x  0.993x  0.928x  0.952x  0.971x
 regular w32:  0.964x  0.996x  0.930x  0.973x  0.972x
 regular w64:  0.963x  0.997x  0.930x  0.969x  0.974x
2024-09-06 07:50:38 +00:00
Arpad PanyikandMartin Storsjö 93339ce857 AArch64: Optimize Armv8.0 Neon path of HBD horizontal 6-tap filters
The 6-tap horizontal subpel filters can be further improved by some
pointer arithmetic and saving some instructions (EXTs) in their data
rearrangement codes.

Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:

regular:     X1      A78      A76      A55
 mc  w8:  0.915x   0.937x   0.900x   0.982x
 mc w16:  0.917x   0.947x   0.911x   0.971x
 mc w32:  0.914x   0.938x   0.873x   0.961x
 mc w64:  0.918x   0.932x   0.882x   0.964x
2024-09-06 07:38:18 +00:00
Arpad PanyikandMartin Storsjö 109b24277b AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters
The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+
SRSHL instruction sequences. In the horizontal case this can be
rewritten using a single SQSHRUN instruction with an additional
rounding value (34 for 10-bit and 40 for 12-bit).

Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:

regular:     X1      A78      A76      A55
 mc  w2:  0.847x   0.864x   0.822x   0.859x
 mc  w4:  0.889x   0.994x   0.868x   0.917x
 mc  w8:  0.857x   0.911x   0.915x   0.978x
 mc w16:  0.890x   0.982x   0.868x   0.974x
 mc w32:  0.904x   0.991x   0.873x   0.967x
 mc w64:  0.919x   1.003x   0.860x   0.970x
2024-09-06 07:38:18 +00:00
Cameron CawleyandRonald S. Bultje d268788467 Support using C11 aligned_alloc for dav1d_alloc_aligned 2024-09-05 12:36:00 +00:00