R studio bundle performance

Hi all,
I am trying clear linux for the first time and I first started using this R script as a benchmark:
https://mac.r-project.org/benchmarks/R-benchmark-25.R

I am really impressed by the R performance standard bundle that comes with clear linux. Here are the benchmark results :

intel i9-7900x system 4.5ghz 10core/20 threads

App type Benchmark time
Clear Linux + stock R 3.6.1 bundle 2.02 seconds
Clear Linux + manual build of R 3.6.1 with MKL for BLAS and LAPACK backend 25 seconds
Clear Linux + manual build of R 3.6.1 without MKL 27 seconds
Windows + stock R 3.6.1 27 seconds
Clear Linux + MKL + Explicitly defined # of threads =10 2.2 seconds

How does the stock R3.6.1 bundle is soooo much faster than MKL? I would thought the MKL build using this guide would be faster: https://software.intel.com/en-us/articles/using-intel-mkl-with-r

I would love to know how to manually build the R from source that could give me the same performance as the R bundle that comes with clear linux.

Update: I rebuild with MKL except I explictly defined OMP_NUM_THREADS && MKL_NUM_THREADS to be 10. I thought it would default to 10 based on my physical core and now I got much better performance improvement.

I would say something is not working correctly. I would expect it to be significantly faster than the standard R build (which as you’ve seen is horribly slow!). If it were still using say the reference lapack that I think comes in the tarball, then that would explain why it is quite slow. But I wouldn’t necessarily think it would be any faster than what’s shipped in CL by default, but it should be much closer than the results you’re showing.

Would need to profile the various versions to determine what’s going on and/or use /proc/${PID}/maps to ensure it is using the correct libraries.

I did perf run on both version and noticed my MKL Build didn’t look like it was doing multi threading properly.

Left image is Clear linux standard package, and right was MKL build from source without thread number explicitly defined.

I rebuild with MKL except I explicitly defined OMP_NUM_THREADS && MKL_NUM_THREADS to be 10. I thought it would default to 10 based on my physical core and now I got much better performance improvement ~ 2.2 seconds (close to intel’s clear linux package but not as fast)

That seems to be more in line with what one would expect. I would posit that one of the main reasons that MKL existed was due to most distros using the reference BLAS (which is excruciatingly slow) and being built for a 12 year old processor (lack of AVX instructions). So you could get 10x the performance just switching out the BLAS lib. The 27s you experienced on the default build may well be the experience on some distros.

However, as you can see CL ships with openblas and optimized AVX builds which should perform about the same (making the effort for MKL unnecessary) with one caveat. There are still a couple of slow functions in openblas, if you hit them in your program, then MKL becomes much faster (someone reported a couple of bugs on the tracker for this). If you don’t hit this slower code (as in this R case) there’s likely not much difference.

You can tweak each implementation here and there to get a small difference, but you’d have to be quite performance sensitive to really care. There’s also the fact the CL build includes PGO, so may reduce the time R spends in the non-math parts of the program to explain the difference between the MKL and CL default.

2 Likes

That’s odd that MKL doesn’t give you a big improvement over OpenBLAS with the 7900X.

Testing Julia with OpenBLAS vs MKL with an i9 9940X…
OpenBLAS

julia> using LinearAlgebra

julia> BLAS.vendor()
:openblas64

julia> X = rand(10^4,10^4);

julia> BLAS.set_num_threads(14)

julia> @time inv(X);
  4.893179 seconds (830.89 k allocations: 808.104 MiB, 0.36% gc time)

julia> @time inv(X);
  4.756498 seconds (11 allocations: 767.899 MiB, 1.33% gc time)

julia> @time inv(X);
  4.745121 seconds (11 allocations: 767.899 MiB, 1.14% gc time)

julia> @time foreach(inv, (X for _ ∈ 1:10));
 46.807347 seconds (17.29 k allocations: 7.500 GiB, 0.04% gc time)

MKL

julia> using LinearAlgebra

julia> BLAS.vendor()
:mkl

julia> X = rand(10^4,10^4);

julia> BLAS.set_num_threads(14)

julia> @time inv(X);
  3.024960 seconds (837.05 k allocations: 854.885 MiB, 0.14% gc time)

julia> @time inv(X);
  2.763249 seconds (11 allocations: 814.286 MiB, 0.74% gc time)

julia> @time inv(X);
  2.736513 seconds (11 allocations: 814.286 MiB, 1.77% gc time)

julia> @time foreach(inv, (X for _ ∈ 1:10));
 27.015827 seconds (17.28 k allocations: 7.953 GiB, 0.20% gc time)

MKL was about 75% faster for matrix inversion.

Similarly, for complex matrix multiplication benchmarks (see here to see all the code needed to reproduce this), I get a similar pattern:
OpenBLAS

julia> for N in 2 .^(1:10)
           A = randn(N,N) + im*randn(N,N)
           B = randn(N,N) + im*randn(N,N)
           C = randn(N,N) + im*randn(N,N)
           println(N)
           @btime gemm!('N', 'N', one(ComplexF64), $A, $B, zero(ComplexF64), $C)
           @btime BLAS.gemm!('N', 'N', one(ComplexF64), $A, $B, zero(ComplexF64), $C)
       end
2
  160.143 ns (0 allocations: 0 bytes)
  117.131 ns (0 allocations: 0 bytes)
4
  240.743 ns (0 allocations: 0 bytes)
  146.918 ns (0 allocations: 0 bytes)
8
  718.537 ns (0 allocations: 0 bytes)
  364.333 ns (0 allocations: 0 bytes)
16
  3.127 μs (0 allocations: 0 bytes)
  1.599 μs (0 allocations: 0 bytes)
32
  19.333 μs (0 allocations: 0 bytes)
  8.293 μs (0 allocations: 0 bytes)
64
  132.353 μs (0 allocations: 0 bytes)
  27.014 μs (0 allocations: 0 bytes)
128
  999.595 μs (0 allocations: 0 bytes)
  144.694 μs (0 allocations: 0 bytes)
256
  8.010 ms (0 allocations: 0 bytes)
  643.952 μs (0 allocations: 0 bytes)
512
  4.730 ms (0 allocations: 0 bytes)
  2.997 ms (0 allocations: 0 bytes)
1024
  33.351 ms (0 allocations: 0 bytes)
  16.163 ms (0 allocations: 0 bytes)

julia> N = 1024; A = randn(N,N); B = randn(N,N); C = similar(A);

julia> @btime mul!($C,$A,$B);
  4.356 ms (0 allocations: 0 bytes)

MKL

julia> for N in 2 .^(1:10)
           A = randn(N,N) + im*randn(N,N)
           B = randn(N,N) + im*randn(N,N)
           C = randn(N,N) + im*randn(N,N)
           println(N)
           @btime gemm!('N', 'N', one(ComplexF64), $A, $B, zero(ComplexF64), $C)
           @btime BLAS.gemm!('N', 'N', one(ComplexF64), $A, $B, zero(ComplexF64), $C)
       end
2
  113.691 ns (0 allocations: 0 bytes)
  165.543 ns (0 allocations: 0 bytes)
4
  254.173 ns (0 allocations: 0 bytes)
  339.594 ns (0 allocations: 0 bytes)
8
  352.398 ns (0 allocations: 0 bytes)
  462.586 ns (0 allocations: 0 bytes)
16
  2.467 μs (0 allocations: 0 bytes)
  909.077 ns (0 allocations: 0 bytes)
32
  3.087 μs (0 allocations: 0 bytes)
  2.578 μs (0 allocations: 0 bytes)
64
  9.303 μs (0 allocations: 0 bytes)
  5.005 μs (0 allocations: 0 bytes)
128
  26.714 μs (0 allocations: 0 bytes)
  18.382 μs (0 allocations: 0 bytes)
256
  230.296 μs (0 allocations: 0 bytes)
  117.873 μs (0 allocations: 0 bytes)
512
  2.028 ms (0 allocations: 0 bytes)
  922.646 μs (0 allocations: 0 bytes)
1024
  19.121 ms (0 allocations: 0 bytes)
  8.961 ms (0 allocations: 0 bytes)

julia> N = 1024; A = randn(N,N); B = randn(N,N); C = similar(A);

julia> @btime mul!($C,$A,$B);
  1.727 ms (0 allocations: 0 bytes)

That last dgemm example was a 2.5x difference.

Broadly, I see about a 2x performance improvement from MKL over OpenBLAS (of course, only on the actual BLAS/LAPACK operations!).
That is consistent with MKL supporting avx512 while OpenBLAS doesn’t really. Temperatures and observed clock speeds support that to (ie, you can set avx2 and avx512 clock offsets in the bios; the CPU runs at avx512 speeds while using MKL but runs at avx2 speeds while using OpenBLAS).

1 Like

They are certainly very different benchmarks and my comments largely based on the experience with R (as that’s what I’m familiar with). The R benchmark covers a variety of functions, not all will require BLAS. R is also single threaded where it is calling out to BLAS itself that allows that to use multiple threads to provide most of the performance benefit over the default. I would estimate the computation time (1 Thread) towards ~1.5s in R and ~22s in BLAS. BLAS using threading cuts it down to 1s meaning it spends more time outside of BLAS than in it. The more threads you throw at it, the more irrelevant the BLAS performance becomes.

The full integration of R in CL with PGO build will certainly help it reduce the time in R, where the MKL benchmark was using a self compiled build (which may be slower in the non-BLAS parts).

For the performance difference to matter, it would need to be a sufficiently large dataset. If performance is crucial you would probably want to use a different program altogether!

1 Like