That’s odd that MKL doesn’t give you a big improvement over OpenBLAS with the 7900X.
Testing Julia with OpenBLAS vs MKL with an i9 9940X…
OpenBLAS
julia> using LinearAlgebra
julia> BLAS.vendor()
:openblas64
julia> X = rand(10^4,10^4);
julia> BLAS.set_num_threads(14)
julia> @time inv(X);
4.893179 seconds (830.89 k allocations: 808.104 MiB, 0.36% gc time)
julia> @time inv(X);
4.756498 seconds (11 allocations: 767.899 MiB, 1.33% gc time)
julia> @time inv(X);
4.745121 seconds (11 allocations: 767.899 MiB, 1.14% gc time)
julia> @time foreach(inv, (X for _ ∈ 1:10));
46.807347 seconds (17.29 k allocations: 7.500 GiB, 0.04% gc time)
MKL
julia> using LinearAlgebra
julia> BLAS.vendor()
:mkl
julia> X = rand(10^4,10^4);
julia> BLAS.set_num_threads(14)
julia> @time inv(X);
3.024960 seconds (837.05 k allocations: 854.885 MiB, 0.14% gc time)
julia> @time inv(X);
2.763249 seconds (11 allocations: 814.286 MiB, 0.74% gc time)
julia> @time inv(X);
2.736513 seconds (11 allocations: 814.286 MiB, 1.77% gc time)
julia> @time foreach(inv, (X for _ ∈ 1:10));
27.015827 seconds (17.28 k allocations: 7.953 GiB, 0.20% gc time)
MKL was about 75% faster for matrix inversion.
Similarly, for complex matrix multiplication benchmarks (see here to see all the code needed to reproduce this), I get a similar pattern:
OpenBLAS
julia> for N in 2 .^(1:10)
A = randn(N,N) + im*randn(N,N)
B = randn(N,N) + im*randn(N,N)
C = randn(N,N) + im*randn(N,N)
println(N)
@btime gemm!('N', 'N', one(ComplexF64), $A, $B, zero(ComplexF64), $C)
@btime BLAS.gemm!('N', 'N', one(ComplexF64), $A, $B, zero(ComplexF64), $C)
end
2
160.143 ns (0 allocations: 0 bytes)
117.131 ns (0 allocations: 0 bytes)
4
240.743 ns (0 allocations: 0 bytes)
146.918 ns (0 allocations: 0 bytes)
8
718.537 ns (0 allocations: 0 bytes)
364.333 ns (0 allocations: 0 bytes)
16
3.127 μs (0 allocations: 0 bytes)
1.599 μs (0 allocations: 0 bytes)
32
19.333 μs (0 allocations: 0 bytes)
8.293 μs (0 allocations: 0 bytes)
64
132.353 μs (0 allocations: 0 bytes)
27.014 μs (0 allocations: 0 bytes)
128
999.595 μs (0 allocations: 0 bytes)
144.694 μs (0 allocations: 0 bytes)
256
8.010 ms (0 allocations: 0 bytes)
643.952 μs (0 allocations: 0 bytes)
512
4.730 ms (0 allocations: 0 bytes)
2.997 ms (0 allocations: 0 bytes)
1024
33.351 ms (0 allocations: 0 bytes)
16.163 ms (0 allocations: 0 bytes)
julia> N = 1024; A = randn(N,N); B = randn(N,N); C = similar(A);
julia> @btime mul!($C,$A,$B);
4.356 ms (0 allocations: 0 bytes)
MKL
julia> for N in 2 .^(1:10)
A = randn(N,N) + im*randn(N,N)
B = randn(N,N) + im*randn(N,N)
C = randn(N,N) + im*randn(N,N)
println(N)
@btime gemm!('N', 'N', one(ComplexF64), $A, $B, zero(ComplexF64), $C)
@btime BLAS.gemm!('N', 'N', one(ComplexF64), $A, $B, zero(ComplexF64), $C)
end
2
113.691 ns (0 allocations: 0 bytes)
165.543 ns (0 allocations: 0 bytes)
4
254.173 ns (0 allocations: 0 bytes)
339.594 ns (0 allocations: 0 bytes)
8
352.398 ns (0 allocations: 0 bytes)
462.586 ns (0 allocations: 0 bytes)
16
2.467 μs (0 allocations: 0 bytes)
909.077 ns (0 allocations: 0 bytes)
32
3.087 μs (0 allocations: 0 bytes)
2.578 μs (0 allocations: 0 bytes)
64
9.303 μs (0 allocations: 0 bytes)
5.005 μs (0 allocations: 0 bytes)
128
26.714 μs (0 allocations: 0 bytes)
18.382 μs (0 allocations: 0 bytes)
256
230.296 μs (0 allocations: 0 bytes)
117.873 μs (0 allocations: 0 bytes)
512
2.028 ms (0 allocations: 0 bytes)
922.646 μs (0 allocations: 0 bytes)
1024
19.121 ms (0 allocations: 0 bytes)
8.961 ms (0 allocations: 0 bytes)
julia> N = 1024; A = randn(N,N); B = randn(N,N); C = similar(A);
julia> @btime mul!($C,$A,$B);
1.727 ms (0 allocations: 0 bytes)
That last dgemm example was a 2.5x difference.
Broadly, I see about a 2x performance improvement from MKL over OpenBLAS (of course, only on the actual BLAS/LAPACK operations!).
That is consistent with MKL supporting avx512 while OpenBLAS doesn’t really. Temperatures and observed clock speeds support that to (ie, you can set avx2 and avx512 clock offsets in the bios; the CPU runs at avx512 speeds while using MKL but runs at avx2 speeds while using OpenBLAS).