Unfortunately a bit slower than the MMX version due to
the impossibility to use memory operands in paddw.
The situation would reverse if ff_dctB_mmx() would have
to issue emms.
dctB_c: 3.7 ( 1.00x)
dctB_mmx: 3.3 ( 1.13x)
dctB_sse2: 3.6 ( 1.03x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>