fastGPT: Faster than PyTorch in 300 lines of Fortran

March 14, 2023

Authors: Ondřej Čertík, Brian Beckman

In this blog post I am announcing fastGPT, fast GPT-2 inference written in Fortran. In it, I show

  1. Fortran has speed at least as good as default PyTorch on Apple M1 Max.

  2. Fortran code has statically typed arrays, making maintenance of the code easier than with Python

  3. It seems that the bottleneck algorithm in GPT-2 inference is matrix-matrix multiplication. For physicists like us, matrix-matrix multiplication is very familiar, unlike other aspects of AI and ML. Finding this familiar ground inspired us to approach GPT-2 like any other numerical computing problem.

  4. Fixed an unintentional single-to-double conversion that slowed down the original Python.

  5. I am asking others to take over and parallelize fastGPT on CPU and offload to GPU and see how fast you can make it.

About one month ago, I read the blogpost GPT in 60 Lines of NumPy, and it piqued my curiosity. I looked at the corresponding code (picoGPT) and was absolutely amazed, for two reasons. First, I hadn’t known it could be so simple to implement the GPT-2 inference. Second, this looks just like a typical computational physics code, similar to many that I have developed and maintained throughout my career.

I immediately downloaded picoGPT to test it out and indeed it worked! It was slow, as advertised, but it worked and it gave exactly the same answer as PyTorch. Then I studied the source code more and indeed it seemed like a clean, full, self-contained implementation of GPT-2.

The next step is obvious: this is just a numerical array-oriented algorithm, so if we want it to look like NumPy, but to be fast like PyTorch, let’s rewrite in Fortran!

Following picoGPT as a reference, I straightforwardly rewrote one function at a time to Fortran, and checked against picoGPT that my Fortran gives exactly the same answer. The job took about two afternoons. Both picoGPT and PyTorch (from conda-forge) use OpenBLAS to run in parallel on Apple M1, so I linked my Fortran against OpenBLAS also to get fast matrix-matrix multiplies. Without any other optimizations, my Fortran gave faster inference than PyTorch!

While writing picoGPT into fastGPT, I noticed that picoGPT accidentally casts the computation from single to double precision. I sent a PR to picoGPT that fixes that, speeding it up 5x for me. I use the faster version below.

I also implemented kv-cache, which greatly speeds up token generation beyond the first version of fastGPT. Below, “no cache” means kv-cache is turned off. Let’s look at the benchmarks on my laptop. On Apple M1 Max we do the GPT-2 124M model inference of 19 input tokens and generating 20 more tokens (see the README for more details). The following two lines are the most fair comparison against PyTorch: just the inference itself, excluding all initialization; using the same backend (OpenBLAS); using caching (the default in PyTorch); all compiler optimizations on, but no special-purpose code in fastGPT. In our opinion we give the maximum possible advantage to PyTorch and we are faster on all cores (1-8):

Code 1 core    2 cores    4 cores    8 cores   
fastGPT (OpenBLAS)    0.837s 0.514s 0.341s 0.339s
PyTorch (OpenBLAS) 0.873s 0.539s 0.386s 0.392s


In the second table we now introduce two improvements: faster implementation of the tanh function and using the Accelerate framework on macOS, now the results are 3x faster on single core.

Code 1 core    2 cores    4 cores    8 cores   
fastGPT (Accelerate, fast_tanh)    0.288s
fastGPT (Accelerate) 0.299s
fastGPT (OpenBLAS) 0.837s 0.514s 0.341s 0.339s
PyTorch (OpenBLAS) 0.873s 0.539s 0.386s 0.392s


In the third table we also compare against picoGPT, which does not have caching implemented, so we turn off caching in fastGPT and PyTorch and again use the same backend (OpenBLAS) and no special optimizations in fastGPT, for fair comparison:

Code 1 core    2 cores    4 cores    8 cores   
fastGPT (OpenBLAS, no cache) 2.343s 1.603s 1.209s 1.018s
PyTorch (OpenBLAS, no cache) 2.356s 1.520s 1.104s 0.997s
picoGPT (OpenBLAS, no cache)    2.427s 1.645s 1.272s 1.081s


The above benchmarks only compare the time for the inference itself, excluding loading the data (for all codes) and Python import times (for picoGPT and PyTorch). With IO optimized for Fortran arrays, the results are truly dramatic, up to 12x faster. Total run (includes loading the model and Python imports):

Code Time
fastGPT (Accelerate, fast_tanh)   0.401s
picoGPT (8 cores) 3.445s
PyTorch (OpenBLAS, 4 cores) 4.867s


As you can see, fastGPT is slightly faster than PyTorch when doing as fair comparison as we can (both using OpenBLAS as a backend and both using caching, the default in PyTorch). You can also see that fastGPT loads the model very quickly and runs immediately, while both PyTorch and picoGPT take a long time to both load the model and to import all the Python libraries.

This matches my past experience with Fortran. Every time I rewrite NumPy code in Fortran, it looks almost the same, but I get very competitive performance. Until now I have not been interested in machine learning / AI, because it seemed to me like very large fits to data, plus the results were not even very impressive to me, and the algorithms themselves did not seem similar to computational physics. But GPT-2, after implementing a Fortran version of it, I can say without any doubt that the algorithm is exactly analogous to many computational physics codes that I have been working with. Consequently, I think exactly the same performance techniques apply here.

Using a language like Fortran, which is oriented to the fastest possible array computations, allows to write code that is the highly performing, but still readable, because things get complicated and one must be able to maintain it. (The GPT-2 inference algorithm is actually quite simple compared to most physics codes.)

Both maintainability and speed is achieved by array declarations with static types, compare the original Python:

def mha(x, c_attn, c_proj, n_head):  # [n_seq, n_embd] -> [n_seq, n_embd]
    ...

and Fortran:

function mha(n_seq, n_embd, x, attn_w, attn_b, proj_w, proj_b, n_head) result(y)
integer, intent(in) :: n_seq, n_embd, n_head
real(sp), intent(in) :: x(n_embd,n_seq), &
    attn_w(3*n_embd,n_embd), attn_b(3*n_embd), &
    proj_w(n_embd,n_embd), proj_b(n_embd)
real(sp) :: y(n_embd,n_seq)
...

In picoGPT one must use comments to keep track of the dimensions, and sometimes there are mistakes, which is inevitable. In Fortran the compiler itself ensures all the dimensions are correct with compile and runtime checks. It is great for both documentation and speed. The Python version actually accepts c_attn which is a dictionary of arrays. For performance I do not recommend that, so we pass all the underlying arrays directly. Besides these declarations, the Fortran code is almost identical to the original NumPy code.

If you like these results so far, please help us parallelize fastGPT on CPU as well as offload to GPU. We have a very good single core CPU performance (but we should still try to speed it up further), and it provides a great foundation for parallelization. Let’s see how fast we can make it!

Discussions:

Nifty tech tag lists from Wouter Beeftink