This documents a fairly neat hack to speed up linear algebra transparently under a number of Fedora/EPEL packages and demonstrates the need for a linear algebra policy like Debian’s.

The Issue

Fedora contains several packages that implement the BLAS (and maybe LAPACK) interfaces. You can end up with packages linking multiple ones of these, with unclear results. Apart from that, they are generally relatively slow by a factor of several — even ATLAS, let alone the reference BLAS which various packages use, such as R. In fact, there are multiple versions of ATLAS for different SIMD architectures, which increases the complication, although there’s no AVX version packaged.[1]

At least on x86(_64), OpenBLAS is essentially optimal (about the same speed as Intel MKL for the most important operations — specifically GEMM) and there’s no reason to have anything else installed apart from testing, e.g. against the reference.[2]

The whole business is a mess. On the other hand, Debian has a policy on linear algebra libraries, which mandates interchangeable libraries (with a common ELF soname) that implement the BLAS/LAPACK interface. Normally this will mean having OpenBLAS installed on x86 via the alternatives system.

In the absence of a similar policy and implementation for Fedora, we can use a hack to subvert the other libraries with OpenBLAS and achieve the same sort of effect as the Debian policy.

Effect

This will transparently improve the overall efficiency of installed packages doing linear algebra (e.g. on an HPC cluster), but obviously by an amount determined by the package use which I’m not in a position to measure on a representative system.

For orientation, on Sandy Bridge with vanilla RHEL6.6/EPEL packages, and the single-threaded LINPACK benchmark, the openblas library is asymptotically around nine times faster than reference lapack/blas and four times faster than atlas-sse3 (the best variant for SB that is packaged). OpenBLAS also comfortably beat an AVX version of ATLAS, but I don’t have the numbers now. See below for an example.

Workaround Implementation

The openblas-compat Fedora/EPEL package does the job. It provides trivial shared LAPACK/BLAS library shims which can replace those from the reference blas, lapack, atlas, gsl, and R packages at run time so that binaries dynamically linked against those automatically get the benefit of the generally-faster OpenBLAS routines. It can be installed in parallel with the reference versions and others. (This is obviously restricted to the architectures supported by OpenBLAS, currently x86, x86_64, arm, and ppc.)

It overrides others via ldconfig configuration. (Note that the older atlas in EPEL already overrides the reference versions in the same way.) You can still use LD_LIBRARY_PATH to pick up the original versions for testing, of course.

The implementation is as follows. We loop over names $n of the relevant libraries, and for each of them, do something like

cc -shared -o lib$n.so.$sover -Wl,-soname,lib$n.so.$sover -lopenblas

where $sover is currently 3. (It might not be obvious that linking like that with just the library will work.) Then we drop a .conf file into /etc/ld.so.conf.d that refers to the openblas-compat library sub-directory. Experimentally, it needs to go at the front of the directory — I couldn’t find documentation of whether the first or last matching entry is used. Obviously the same tactic might be used in other cases where the same ABI is presented.

After installing this, we can see, for instance

$ ldd /usr/lib64/R/bin/exec/R | grep blas
     libRblas.so => /usr/lib64/openblas-compat/libRblas.so (0x00007f052f458000)
     libopenblas.so.0 => /usr/lib64/libopenblas.so.0 (0x00007f052b5b5000)

as opposed to vanilla

$ ldd /usr/lib64/R/bin/exec/R | grep blas
     libRblas.so => /usr/lib64/R/lib/libRblas.so (0x00007fbfee05e000)

The current list of overridden libraries is: libRblas.so, libRlapack.so (from R); libblas.so.3, liblapack.so.3, liblapacke.so.3 (reference); libblas64.so.3, liblapack64.so.3 (64-bit integer reference); libcblas.so.3 (C-callable version of reference); libclapack.so.3, libf77blas.so.3 (ATLAS); libgslcblas.so.0 (from GSL). There are also substitutes for the ATLAS threaded libraries and OpenMP-based substitutes for the serial ones, accessed with, e.g.

LD_LIBRARY_PATH=/usr/lib64/openblas-compat-openmp

to provide potential parallel acceleration for otherwise-serial programs if linear algebra is their bottleneck.

Addendum: Fedora Proposal

There is now a proposal for a sensible policy which is roughly what I should have written earlier, but couldn’t summon the energy to try to push through. (Unfortunately, it turns out Orion doesn’t have time to pursue it, so someone else needs to….)

Addendum: Example

Note
The Fedora package figures are obsolete since the package version 3.3.2 is now built with openblas (in the testing repos at the time of writing) influenced by these figures, but they seem worth preserving for comparison.

I was referred to a blog entry, referenced from another by Mike Croucher, which advocates Intel-lectual rebuilds of R, specifically using Intel MKL for BLAS/LAPACK. (Elsewhere, Microsoft advertises a GNU/Linux(!) binary distribution using MKL,[3] and pushes it with similar benchmarks.)

Unfortunately, measurements and reference to the MKL licence conditions don’t make much impression on the mythology around the proprietary Intel tools. However, here are results of redoing Mike Croucher’s example to demonstrate this technique, OpenBLAS AVX performance, and how Fedora technical computing is missing out on linear algebra performance and PR.

The following results were obtained with EPEL6 (R-3.3.1, blas/lapack-3.2.1, openblas-0.2.18) using openblas-compat and a similar mkl-compat package​[4] which linked MKL from a distribution labelled ‘composer_xe_2015.2.164’. The system was a 2.2 GHz E5-2660 (Sandy Bridge). An example interactive run for the vanilla version is as follows, and the explicit binding would be elided in batch:

R_LD_LIBRARY_PATH=/usr/lib64/R/lib hwloc-bind core:8 Rscript \
  linear_algebra_bench.r x

The numbers are elapsed time in seconds, apart from the speed up ratio of the first tow rows, and their variance over multiple runs is ≲1%. They’re similar to the ones from Sheffield — which I assume were obtained on 2.6 GHz Ivy Bridge — except that the reference BLAS is somewhat slower and the MKL results don’t scale with that assumed clock frequency:

Matmul

Chol

SVD

PCA

LDA

Vanilla EPEL (s)

149.7

21.9

51.5

201.4

148.0

OpenBLAS (s)

11.6

2.13

9.32

26.0

33.2

OB Speed up ratio

12.9

10.2

5.5

7.7

4.4

MKL (s)

12.2

2.18

8.65

25.7

32.3

OpenBLAS scales similarly to MKL with multiple threads. For example,[5] using

OMP_NUM_THREADS=4 R_LD_LIBRARY_PATH=/usr/lib64/openblas-compat-openmp

for the first case:

Matmul

Chol

SVD

PCA

LDA

OpenBLAS (s)

3.22

0.74

3.73

9.82

22.4

MKL (s)

3.38

1.54

3.00

8.50

21.3

Conclusion

So in most cases you don’t need to spend time rebuilding and linking R with proprietary things until linear algebra in Fedora is sorted out – just install the normal R packages and openblas-compat.[6] On an AVX512 system, you can hold your nose and pull the same trick with mkl-compat if the proprietary library is available and you have a linear algebra bottleneck.


1. There’s no really satisfactory way to deal with architecture-specific libraries like ATLAS, as opposed to those like OpenBLAS which dispatch on the architecture at run time; that deserves an essay itself.
2. Unfortunately, that now doesn’t hold for Knights Landing and future systems supporting AVX512 — MKL is ∼3 times faster on DGEMM than OpenBLAS, which currently supports only up to AVX2; ATLAS has unreleased AVX512 support for Knights Corner which may or may not be adaptable, but OpenBLAS substantially out-performed ATLAS tuned for AVX when I compared them, which isn’t encouraging.
3. I’m not sure how that complies with R’s GPL licence.
4. There’s no built mkl-compat — see comments in the spec file.
5. R’s ldpaths file is modified by openblas-compat to make R_LD_LIBRARY_PATH work.
6. Of course, R functions that don’t depend on BLAS might gain from rebuilding the package with better optimization options, but that’s not relevant here.