This documents a fairly neat hack to speed up linear algebra transparently under a number of Fedora/EPEL packages and demonstrates the need for a linear algebra policy like Debian’s.
The Issue
Fedora contains several packages that implement the BLAS (and maybe
LAPACK) interfaces. You can end up with packages linking multiple
ones of these, with unclear results. Apart from that, they are
generally relatively slow by a factor of several — even ATLAS, let
alone the reference BLAS which various packages use, such as R
. In
fact, there are multiple versions of ATLAS for different SIMD
architectures, which increases the complication, although there’s no
AVX version packaged.^{[1]}
At least on x86(_64), OpenBLAS is essentially optimal (about the same speed as Intel MKL for the most important operations — specifically GEMM) and there’s no reason to have anything else installed apart from testing, e.g. against the reference.^{[2]}
The whole business is a mess. On the other hand, Debian has a policy on linear algebra libraries, which mandates interchangeable libraries (with a common ELF soname) that implement the BLAS/LAPACK interface. Normally this will mean having OpenBLAS installed on x86 via the alternatives system.
In the absence of a similar policy and implementation for Fedora, we can use a hack to subvert the other libraries with OpenBLAS and achieve the same sort of effect as the Debian policy.
Effect
This will transparently improve the overall efficiency of installed packages doing linear algebra (e.g. on an HPC cluster), but obviously by an amount determined by the package use which I’m not in a position to measure on a representative system.
For orientation, on Sandy Bridge with vanilla RHEL6.6/EPEL packages,
and the singlethreaded LINPACK benchmark, the openblas
library is
asymptotically around nine times faster than reference lapack
/blas
and four times faster than atlassse3
(the best variant for SB that
is packaged). OpenBLAS also comfortably beat an AVX version of ATLAS,
but I don’t have the numbers now. See below for
an example.
Workaround Implementation
The
openblascompat
Fedora/EPEL package does the job. It provides trivial shared
LAPACK/BLAS library shims which can replace those from the reference
blas
, lapack
, atlas
, gsl
, and R
packages at run time so that
binaries dynamically linked against those automatically get the
benefit of the generallyfaster OpenBLAS routines. It can be
installed in parallel with the reference versions and others. (This
is obviously restricted to the architectures supported by OpenBLAS,
currently x86, x86_64, arm, and ppc.)
It overrides others via ldconfig
configuration. (Note that the
older atlas in EPEL already overrides the reference versions in the
same way.) You can still use LD_LIBRARY_PATH
to pick up the
original versions for testing, of course.
The implementation is as follows. We loop over names $n
of the
relevant libraries, and for each of them, do something like
cc shared o lib$n.so.$sover Wl,soname,lib$n.so.$sover lopenblas
where $sover
is currently 3
. (It might not be obvious that
linking like that with just the library will work.) Then we drop a
.conf
file into /etc/ld.so.conf.d
that refers to the
openblascompat
library subdirectory. Experimentally, it needs to
go at the front of the directory — I couldn’t find documentation of
whether the first or last matching entry is used. Obviously the same
tactic might be used in other cases where the same ABI is presented.
After installing this, we can see, for instance
$ ldd /usr/lib64/R/bin/exec/R  grep blas libRblas.so => /usr/lib64/openblascompat/libRblas.so (0x00007f052f458000) libopenblas.so.0 => /usr/lib64/libopenblas.so.0 (0x00007f052b5b5000)
as opposed to vanilla
$ ldd /usr/lib64/R/bin/exec/R  grep blas libRblas.so => /usr/lib64/R/lib/libRblas.so (0x00007fbfee05e000)
The current list of overridden libraries is: libRblas.so
,
libRlapack.so
(from R); libblas.so.3
, liblapack.so.3
,
liblapacke.so.3
(reference); libblas64.so.3
, liblapack64.so.3
(64bit integer reference); libcblas.so.3
(Ccallable version of
reference); libclapack.so.3
, libf77blas.so.3
(ATLAS);
libgslcblas.so.0
(from GSL). There are also substitutes for the
ATLAS threaded libraries and OpenMPbased substitutes for the serial
ones, accessed with, e.g.
LD_LIBRARY_PATH=/usr/lib64/openblascompatopenmp
to provide potential parallel acceleration for otherwiseserial programs if linear algebra is their bottleneck.
Addendum: Fedora Proposal
There is now a proposal for a sensible policy which is roughly what I should have written earlier, but couldn’t summon the energy to try to push through. (Unfortunately, it turns out Orion doesn’t have time to pursue it, so someone else needs to….)
Addendum: Example
Note

The R package figures are obsolete if extended to EPEL7 and Fedora, since from version 3.3.2 it is now built with OpenBLAS (influenced by these figures). However, that’s not the case for EPEL6, for some reason, and they seem worth preserving for comparison anyway. 
I was referred to a blog entry, referenced from another by Mike Croucher, which advocates Intellectual rebuilds of R, specifically using Intel MKL for BLAS/LAPACK. (Elsewhere, Microsoft advertises a GNU/Linux(!) binary distribution using MKL,^{[3]} and pushes it with similar benchmarks.)
Unfortunately, measurements and reference to the MKL licence conditions don’t make much impression on the mythology around the proprietary Intel tools. However, here are results of redoing Mike Croucher’s example to demonstrate this technique, OpenBLAS AVX performance, and how Fedora technical computing is missing out on linear algebra performance and PR.
The following results were obtained with EPEL6 (R3.3.1,
blas/lapack3.2.1, openblas0.2.18) using openblascompat
and a
similar mklcompat
package^{[4]}
which linked MKL from a distribution labelled ‘composer_xe_2015.2.164’.
The system was a 2.2 GHz E52660 (Sandy Bridge). An example
interactive run for the vanilla version is as follows, and the
explicit binding would be elided in batch:
R_LD_LIBRARY_PATH=/usr/lib64/R/lib hwlocbind core:8 Rscript \ linear_algebra_bench.r x
The numbers are elapsed time in seconds, apart from the speed up ratio of the first tow rows, and their variance over multiple runs is ≲1%. They’re similar to the ones from Sheffield — which I assume were obtained on 2.6 GHz Ivy Bridge — except that the reference BLAS is somewhat slower and the MKL results don’t scale with that assumed clock frequency:
Matmul 
Chol 
SVD 
PCA 
LDA 

Vanilla EPEL (s) 
149.7 
21.9 
51.5 
201.4 
148.0 
OpenBLAS (s) 
11.6 
2.13 
9.32 
26.0 
33.2 
OB Speed up ratio 
12.9 
10.2 
5.5 
7.7 
4.4 
MKL (s) 
12.2 
2.18 
8.65 
25.7 
32.3 
OpenBLAS scales similarly to MKL with multiple threads. For example,^{[5]} using
OMP_NUM_THREADS=4 R_LD_LIBRARY_PATH=/usr/lib64/openblascompatopenmp
for the first case:
Matmul 
Chol 
SVD 
PCA 
LDA 

OpenBLAS (s) 
3.22 
0.74 
3.73 
9.82 
22.4 
MKL (s) 
3.38 
1.54 
3.00 
8.50 
21.3 
Conclusion
So in most cases you don’t need to spend time rebuilding and linking R
with proprietary things until linear algebra in Fedora is sorted out –
just install the normal R packages and openblascompat
.^{[6]} On an AVX512 system, you can hold your nose and
pull the same trick with mklcompat
if the proprietary library is
available and you have a linear algebra bottleneck.