Authors: Ondřej Čertík, Brian Beckman
In this blog post I am announcing fastGPT, fast GPT-2 inference written in Fortran. In it, I show
Fortran has speed at least as good as default PyTorch
on Apple M1 Max.
Fortran code has statically typed arrays, making maintenance of the code easier than with Python
It seems that the bottleneck algorithm in GPT-2 inference is matrix-matrix multiplication. For physicists like us, matrix-matrix multiplication is very familiar, unlike other aspects of AI and ML. Finding this familiar ground inspired us to approach GPT-2 like any other numerical computing problem.
Fixed an unintentional single-to-double conversion that slowed down the original Python.
I am asking others to take over and parallelize fastGPT
on CPU and
offload to GPU and see how fast you can make it.
About one month ago, I read the blogpost GPT in 60 Lines of NumPy, and it piqued my curiosity. I looked at the corresponding code (picoGPT) and was absolutely amazed, for two reasons. First, I hadn’t known it could be so simple to implement the GPT-2 inference. Second, this looks just like a typical computational physics code, similar to many that I have developed and maintained throughout my career.
I immediately downloaded picoGPT
to test it out and indeed it worked! It was
slow, as advertised, but it worked and it gave exactly the same answer as
PyTorch
. Then I studied the source code more and indeed it seemed like a
clean, full, self-contained implementation of GPT-2.
The next step is obvious: this is just a numerical array-oriented algorithm, so
if we want it to look like NumPy, but to be fast like PyTorch
, let’s rewrite
in Fortran!
Following picoGPT
as a reference, I straightforwardly rewrote one function at
a time to Fortran, and checked against picoGPT
that my Fortran gives exactly
the same answer. The job took about two afternoons. Both picoGPT
and
PyTorch
(from conda-forge) use OpenBLAS to run in parallel on Apple M1, so I
linked my Fortran against OpenBLAS also to get fast matrix-matrix multiplies.
Without any other optimizations, my Fortran gave faster inference than
PyTorch
!
While writing picoGPT
into fastGPT
, I noticed that picoGPT
accidentally
casts the computation from single to double precision. I sent a
PR to picoGPT
that fixes that,
speeding it up 5x for me. I use the faster version below.
I also implemented kv-cache, which greatly speeds up token generation beyond
the first version of fastGPT
. Below, “no cache” means kv-cache is turned off.
Let’s look at the benchmarks on my laptop. On Apple M1 Max we do the GPT-2 124M
model inference of 19 input tokens and generating 20 more tokens (see the
README
for more details). The following two lines are the most fair comparison against
PyTorch
: just the inference itself, excluding all initialization; using the
same backend (OpenBLAS); using caching (the default in PyTorch); all compiler
optimizations on, but no special-purpose code in fastGPT
. In our opinion we
give the maximum possible advantage to PyTorch and we are faster on all cores
(1-8):
Code | 1 core | 2 cores | 4 cores | 8 cores |
---|---|---|---|---|
fastGPT (OpenBLAS) | 0.837s | 0.514s | 0.341s | 0.339s |
PyTorch (OpenBLAS) | 0.873s | 0.539s | 0.386s | 0.392s |
In the second table we now introduce two improvements: faster implementation of
the tanh
function and using the Accelerate framework on macOS, now the
results are 3x faster on single core.
Code | 1 core | 2 cores | 4 cores | 8 cores |
---|---|---|---|---|
fastGPT (Accelerate, fast_tanh) | 0.288s | |||
fastGPT (Accelerate) | 0.299s | |||
fastGPT (OpenBLAS) | 0.837s | 0.514s | 0.341s | 0.339s |
PyTorch (OpenBLAS) | 0.873s | 0.539s | 0.386s | 0.392s |
In the third table we also compare against picoGPT
, which does not have
caching implemented, so we turn off caching in fastGPT
and PyTorch
and
again use the same backend (OpenBLAS) and no special optimizations in
fastGPT
, for fair comparison:
Code | 1 core | 2 cores | 4 cores | 8 cores |
---|---|---|---|---|
fastGPT (OpenBLAS, no cache) | 2.343s | 1.603s | 1.209s | 1.018s |
PyTorch (OpenBLAS, no cache) | 2.356s | 1.520s | 1.104s | 0.997s |
picoGPT (OpenBLAS, no cache) | 2.427s | 1.645s | 1.272s | 1.081s |
The above benchmarks only compare the time for the inference itself, excluding
loading the data (for all codes) and Python import times (for picoGPT
and
PyTorch
). With IO optimized for Fortran arrays, the results are truly
dramatic, up to 12x faster. Total run (includes loading the model and Python
imports):
Code | Time |
---|---|
fastGPT (Accelerate, fast_tanh) | 0.401s |
picoGPT (8 cores) | 3.445s |
PyTorch (OpenBLAS, 4 cores) | 4.867s |
As you can see, fastGPT
is slightly faster than PyTorch
when doing as fair
comparison as we can (both using OpenBLAS as a backend and both using caching,
the default in PyTorch
). You can also see that fastGPT
loads the model very
quickly and runs immediately, while both PyTorch
and picoGPT
take a long
time to both load the model and to import all the Python libraries.
This matches my past experience with Fortran. Every time I rewrite NumPy code in Fortran, it looks almost the same, but I get very competitive performance. Until now I have not been interested in machine learning / AI, because it seemed to me like very large fits to data, plus the results were not even very impressive to me, and the algorithms themselves did not seem similar to computational physics. But GPT-2, after implementing a Fortran version of it, I can say without any doubt that the algorithm is exactly analogous to many computational physics codes that I have been working with. Consequently, I think exactly the same performance techniques apply here.
Using a language like Fortran, which is oriented to the fastest possible array computations, allows to write code that is the highly performing, but still readable, because things get complicated and one must be able to maintain it. (The GPT-2 inference algorithm is actually quite simple compared to most physics codes.)
Both maintainability and speed is achieved by array declarations with static types, compare the original Python:
def mha(x, c_attn, c_proj, n_head): # [n_seq, n_embd] -> [n_seq, n_embd]
...
and Fortran:
function mha(n_seq, n_embd, x, attn_w, attn_b, proj_w, proj_b, n_head) result(y)
integer, intent(in) :: n_seq, n_embd, n_head
real(sp), intent(in) :: x(n_embd,n_seq), &
attn_w(3*n_embd,n_embd), attn_b(3*n_embd), &
proj_w(n_embd,n_embd), proj_b(n_embd)
real(sp) :: y(n_embd,n_seq)
...
In picoGPT
one must use comments to keep track of the dimensions, and
sometimes there are mistakes, which is inevitable. In Fortran the compiler
itself ensures all the dimensions are correct with compile and runtime checks.
It is great for both documentation and speed. The Python version actually
accepts c_attn
which is a dictionary of arrays. For performance I do not
recommend that, so we pass all the underlying arrays directly. Besides these
declarations, the Fortran code is almost identical to the original NumPy code.
If you like these results so far, please help us parallelize fastGPT
on CPU
as well as offload to GPU. We have a very good single core CPU performance (but
we should still try to speed it up further), and it provides a great foundation
for parallelization. Let’s see how fast we can make it!
Discussions: