diff options
author | Eric Biggers <ebiggers@kernel.org> | 2025-07-06 16:11:00 -0700 |
---|---|---|
committer | Greg Kroah-Hartman <gregkh@linuxfoundation.org> | 2025-08-20 18:40:48 +0200 |
commit | 497a2bf6b694f82b33110ac811527be6302e6f23 (patch) | |
tree | 22b8bcb8dac8845aeb30a1001040afc1353535e3 | |
parent | 197c62be7ee5c1deee880f79d4326140027f4def (diff) |
lib/crypto: x86/poly1305: Fix performance regression on short messages
commit 9f65592b7e1f24569bb6ced064df5b4319f725ce upstream.
Restore the len >= 288 condition on using the AVX implementation, which
was incidentally removed by commit 318c53ae02f2 ("crypto: x86/poly1305 -
Add block-only interface"). This check took into account the overhead
in key power computation, kernel-mode "FPU", and tail handling
associated with the AVX code. Indeed, restoring this check slightly
improves performance for len < 256 as measured using poly1305_kunit on
an "AMD Ryzen AI 9 365" (Zen 5) CPU:
Length Before After
====== ========== ==========
1 30 MB/s 36 MB/s
16 516 MB/s 598 MB/s
64 1700 MB/s 1882 MB/s
127 2265 MB/s 2651 MB/s
128 2457 MB/s 2827 MB/s
200 2702 MB/s 3238 MB/s
256 3841 MB/s 3768 MB/s
511 4580 MB/s 4585 MB/s
512 5430 MB/s 5398 MB/s
1024 7268 MB/s 7305 MB/s
3173 8999 MB/s 8948 MB/s
4096 9942 MB/s 9921 MB/s
16384 10557 MB/s 10545 MB/s
While the optimal threshold for this CPU might be slightly lower than
288 (see the len == 256 case), other CPUs would need to be tested too,
and these sorts of benchmarks can underestimate the true cost of
kernel-mode "FPU". Therefore, for now just restore the 288 threshold.
Fixes: 318c53ae02f2 ("crypto: x86/poly1305 - Add block-only interface")
Cc: stable@vger.kernel.org
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250706231100.176113-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-rw-r--r-- | arch/x86/lib/crypto/poly1305_glue.c | 8 |
1 files changed, 8 insertions, 0 deletions
diff --git a/arch/x86/lib/crypto/poly1305_glue.c b/arch/x86/lib/crypto/poly1305_glue.c index 968d84677631..856d48fd422b 100644 --- a/arch/x86/lib/crypto/poly1305_glue.c +++ b/arch/x86/lib/crypto/poly1305_glue.c @@ -98,7 +98,15 @@ void poly1305_blocks_arch(struct poly1305_block_state *state, const u8 *inp, BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE || SZ_4K % POLY1305_BLOCK_SIZE); + /* + * The AVX implementations have significant setup overhead (e.g. key + * power computation, kernel FPU enabling) which makes them slower for + * short messages. Fall back to the scalar implementation for messages + * shorter than 288 bytes, unless the AVX-specific key setup has already + * been performed (indicated by ctx->is_base2_26). + */ if (!static_branch_likely(&poly1305_use_avx) || + (len < POLY1305_BLOCK_SIZE * 18 && !ctx->is_base2_26) || unlikely(!irq_fpu_usable())) { convert_to_base2_64(ctx); poly1305_blocks_x86_64(ctx, inp, len, padbit); |