This documents a fairly neat hack to speed up linear algebra transparently under a number of Fedora/EPEL packages and demonstrates the need for a linear algebra policy like Debian’s.
The Issue
Fedora contains several packages that implement the BLAS (and maybe
LAPACK) interfaces. You can end up with packages linking multiple
ones of these, with unclear results. Apart from that, they are
generally relatively slow by a factor of several — even ATLAS, let
alone the reference BLAS which various packages use, such as R
. In
fact, there are multiple versions of ATLAS for different SIMD
architectures, which increases the complication, although there’s no
AVX version packaged.[1]
At least on x86(_64), OpenBLAS is essentially optimal (about the same speed as Intel MKL for the most important operations — specifically GEMM) and there’s probably no reason to have anything else installed apart from for testing, e.g. against the reference.[2]
The whole business is a mess. On the other hand, Debian has a certain amount of interchangeability between linear algebra libraries that implement the BLAS/LAPACK interface (with a common ELF soname) policy. Normally this will mean having OpenBLAS installed on x86 via the alternatives system, although there is only a pthreaded version of that packaged.
In the absence of a similar policy and implementation for Fedora, we can use a hack to subvert the other libraries with OpenBLAS and achieve the same sort of effect as the Debian policy.
Effect
This will transparently improve the overall efficiency of installed packages doing linear algebra (e.g. on an HPC cluster), but obviously by an amount determined by the package use which I’m not in a position to measure on a representative system.
For orientation, on Sandy Bridge with vanilla RHEL6.6/EPEL packages,
and the single-threaded LINPACK benchmark, the openblas
library is
asymptotically around nine times faster than reference lapack
/blas
and four times faster than atlas-sse3
(the best variant for SB that
is packaged). OpenBLAS also comfortably beat an AVX version of ATLAS,
but I don’t have the numbers now. See below for
an example.
Workaround Implementation
The
openblas-compat
Fedora/EPEL package does the job. It provides trivial shared
LAPACK/BLAS library shims which can replace those from the reference
blas
, lapack
, atlas
, gsl
, and R
packages at run time so that
binaries dynamically linked against those automatically get the
benefit of the generally-faster OpenBLAS routines. It can be
installed in parallel with the reference versions and others. (This
is obviously restricted to the architectures supported by OpenBLAS,
currently x86, x86_64, arm, and ppc.)
It overrides others via ldconfig
configuration. (Note that the
older atlas in EPEL already overrides the reference versions in the
same way.) You can still use LD_LIBRARY_PATH
or LD_PRELOAD
to
pick up the original versions for testing, of course.
The implementation is as follows. We loop over names $n
of the
relevant libraries, and for each of them, do something like
cc -shared -o lib$n.so.$sover -Wl,-soname,lib$n.so.$sover -lopenblas
where $sover
is currently 3
. (It might not be obvious that
linking like that with just the library will work.) Then we drop a
.conf
file into /etc/ld.so.conf.d
that refers to the
openblas-compat
library sub-directory. Experimentally, it needs to
go at the front of the directory — I couldn’t find documentation of
whether the first or last matching entry is used. Obviously the same
tactic might be used in other cases where the same ABI is presented.
After installing this, we can see, for instance
$ ldd /usr/lib64/R/bin/exec/R | grep blas libRblas.so => /usr/lib64/openblas-compat/libRblas.so (0x00007f052f458000) libopenblas.so.0 => /usr/lib64/libopenblas.so.0 (0x00007f052b5b5000)
as opposed to vanilla
$ ldd /usr/lib64/R/bin/exec/R | grep blas libRblas.so => /usr/lib64/R/lib/libRblas.so (0x00007fbfee05e000)
The current list of overridden libraries is: libRblas.so
,
libRlapack.so
(from R); libblas.so.3
, liblapack.so.3
,
liblapacke.so.3
(reference); libblas64.so.3
, liblapack64.so.3
(64-bit integer reference); libcblas.so.3
(C-callable version of
reference); libclapack.so.3
, libf77blas.so.3
(ATLAS);
libgslcblas.so.0
(from GSL). There are also substitutes for the
ATLAS threaded libraries and OpenMP-based substitutes for the serial
ones, accessed with, e.g.
LD_LIBRARY_PATH=/usr/lib64/openblas-compat-openmp
to provide potential parallel acceleration for otherwise-serial programs if linear algebra is their bottleneck.
Addendum: Fedora Proposal
There is now a proposal for a sensible policy which is roughly what I should have written earlier, but couldn’t summon the energy to try to push through. (Unfortunately, it turns out Orion doesn’t have time to pursue it, so someone else needs to….)
Addendum: Example
Note
|
The R package figures are obsolete if extended to EPEL7 and Fedora, since from version 3.3.2 it is now built with OpenBLAS (influenced by these figures). However, that’s not the case for EPEL6, for some reason, and they seem worth preserving for comparison anyway. |
I was referred to a blog entry, referenced from another by Mike Croucher, which advocates Intel-lectual rebuilds of R, specifically using Intel MKL for BLAS/LAPACK. (Elsewhere, Microsoft advertises a GNU/Linux(!) binary distribution using MKL,[3] and pushes it with similar benchmarks.)
Unfortunately, measurements and reference to the MKL licence conditions don’t make much impression on the mythology around the proprietary Intel tools. However, here are results of redoing Mike Croucher’s example to demonstrate this technique, OpenBLAS AVX performance, and how Fedora technical computing is missing out on linear algebra performance and PR.
The following results were obtained with EPEL6 (R-3.3.1,
blas/lapack-3.2.1, openblas-0.2.18) using openblas-compat
and a
similar mkl-compat
package[4]
which linked MKL from a distribution labelled ‘composer_xe_2015.2.164’.
The system was a 2.2 GHz E5-2660 (Sandy Bridge). An example
interactive run for the vanilla version is as follows, and the
explicit binding would be elided in batch:
R_LD_LIBRARY_PATH=/usr/lib64/R/lib hwloc-bind core:8 Rscript \ linear_algebra_bench.r x
The numbers are elapsed time in seconds, apart from the speed up ratio of the first tow rows, and their variance over multiple runs is ≲1%. They’re similar to the ones from Sheffield — which I assume were obtained on 2.6 GHz Ivy Bridge — except that the reference BLAS is somewhat slower and the MKL results don’t scale with that assumed clock frequency:
Matmul |
Chol |
SVD |
PCA |
LDA |
|
Vanilla EPEL (s) |
149.7 |
21.9 |
51.5 |
201.4 |
148.0 |
OpenBLAS (s) |
11.6 |
2.13 |
9.32 |
26.0 |
33.2 |
OB Speed up ratio |
12.9 |
10.2 |
5.5 |
7.7 |
4.4 |
MKL (s) |
12.2 |
2.18 |
8.65 |
25.7 |
32.3 |
OpenBLAS scales similarly to MKL with multiple threads. For example,[5] using
OMP_NUM_THREADS=4 R_LD_LIBRARY_PATH=/usr/lib64/openblas-compat-openmp
for the first case:
Matmul |
Chol |
SVD |
PCA |
LDA |
|
OpenBLAS (s) |
3.22 |
0.74 |
3.73 |
9.82 |
22.4 |
MKL (s) |
3.38 |
1.54 |
3.00 |
8.50 |
21.3 |
Conclusion
So in most cases you don’t need to spend time rebuilding and linking R
with proprietary things until linear algebra in Fedora is sorted out –
just install the normal R packages and openblas-compat
.[6] On an AVX512 system, you can hold your nose and
pull the same trick with mkl-compat
if the proprietary library is
available and you have a linear algebra bottleneck.