Deep Learning in Rust with GPU

Deep Learning in Rust on GPU with onnxruntime-rs for deep learning pipelines and servers.

30 Jul 2021

3 min read

5 comments

I have searched for months for a way to do Deep Learning(DL) Inference with Rust on GPU and I finally did it!!✨👏✨ This blog post will try to answer if Rust is a good fit for the job!

I have put an annexe at the end with the definition of Deep Learning words.

My setup

I am using a Hugging Face tokenizer and a custom BERT Model from Pytorch that I have converted to ONNX to be run with onnxruntime-rs.

I have tweaked onnxruntime-rs to do Deep Learning on GPU with CUDA 11 and onnxruntime 1.8 You can check it out on my git: https://github.com/haixuanTao/onnxruntime-rs

Hardware-side, I have a 6 cores/12 threads CPU and a GTX 1050 GPU.

Unit Results

1. Inference time

	Inferencing time per phrase	Speedup
(Rust or Python) ONNX CPU	~125ms
(Rust or Python) ONNX GPU	~10ms	x12🔥

DL inference using Onnxruntime will not be faster in Rust because both are wrapping the same C++ underlying engine. What is going to make a difference is the GPU.

2. Preprocessing time

	Tokenizing time per phrase	Speedup
Python BertTokenizer	1000μs
Python BertTokenizerFast	200-600μs	x2.5🔥
Rust Tokenizer	50-150μs	x4🔥

Gains can be made on DL Preprocessing! You can tokenize 4 times faster in Rust compared to Python, with the same Hugging Face Tokenizer library.

Case studies Results

Looking at those results alone is not enough. To dig a little further, I have built a DL data pipeline for batch inference and a DL server, to see what Rust for DL could be like on a daily basis.

1. Deep Learning batch inference: Running BERT on a CSV

Let say you want the inference of a BERT model on one column of a 10 thousand lines CSV.

On my setup, I got those timings:

10k phrases	Python	Rust
Booting time	4s	1s
Encoding time	0.7s	0.3s
DL Inference time	75s	75s
Total time	80s	76s
Memory usage	1 GiB	0.7 GiB

As DL inference is taking the majority of the time, Rust will not increase performance and I would not bother with Rust and stay with Python for large batches of inference. 👍🐍

Git: https://github.com/haixuanTao/bert-onnx-rs-pipeline

2. ONNX Server: Serving BERT as an API

Let say you want to serve a BERT-like model through a server API endpoint.

On my setup, I got those metrics:

	Python FastAPI	Rust Actix Web	Speedup
Encoding time	400μs	100μs
ONNX Inference time	~10ms	~10ms
API overhead time	~2ms	~1ms
Mean Latency	12.8ms	10.4ms	-20%⏰
Requests/secs	77.5 #/s	95 #/s	+22%🍾

The gain in performance comes from moving from considered “Fast” Python library to Rust: FastAPI -> Actix Web, BertokenizerFast -> Rust Tokenizer.

Thus, as Rust libraries tend to be faster than Python ones, the more functionalities you will have, the more speedup you’re going to see with Rust when serving Deep Learning.

That’s why, for performance-centric Deep Learning applications such as Real-Time Deep Learning, Embedded Deep Learning, Large-Scale AI servers …. I can definitely see Rust be a good fit! ❤️‍🦀

Git: https://github.com/haixuanTao/bert-onnx-rs-server

In conclusion, should you use Rust for Deep Learning?

Like the whole Rust ecosystem, you should use it if it’s the best tool for the job. If you really need performance 🏎️ and resilience🛡️, and you are ok to have a stack in Rust, go ahead🚀! But be aware that making Rust fast is not easy!
If you need quick prototyping with a data scientist friendly language, you should better use Python!

‌

There will be a following blog post around the actual implementation of the DL pipeline and server so make sure to follow along 😀 :

On able: https://able.bio/haixuanTao
On github: https://github.com/haixuanTao
On linkedin: https://www.linkedin.com/in/haixuan-xavier-tao-7460b1102/

‌

Git reference

Git of my tweaked onnxruntime-rs library with ONNX 1.8 and GPU features with CUDA 11: https://github.com/haixuanTao/onnxruntime-rs

Git of bert - onnxruntime-rs - Pipeline: https://github.com/haixuanTao/bert-onnx-rs-pipeline

Git of bert - onnxruntime-rs - actix - server: https://github.com/haixuanTao/bert-onnx-rs-server

‌

Annexe

ONNX is an open format built to represent machine learning models. You can convert Pytorch, Tensorflow and Sklearn models into an ONNX format and then run them with onnxruntime.

ONNXRuntime is the inference and optimized training engine that can read and run ONNX model. It is written in C++. There are official wrappers for Python, JS, JAVA, C and C++.

Onnxruntime-rs is the Onnxruntime wrapper for Rust, it is a community-developed solution and does not wrap all the features of the C++ engine. I have tweaked the initial repo in order to have version 1.8 of Onnxruntime with the possibility of running it in GPU. My version is here: https://github.com/haixuanTao/onnxruntime-rs with branch onnx1.8.

Tokenizer enables you to transform words into index given a list of word.

BERT is a Deep Learning model used for natural language processing.

GPU is a graphical processing unit for parallel computing. You will need CUDA and therefore an NVIDIA GPU to run Onnxruntime-rs.

Gleb June 18, 2024, 3:10 p.m.

Hello everyone! I recently discovered free live sex on BongaCams and it has been an absolute game-changer. The site is filled with beautiful performers eager to interact and fulfill your fantasies. Watching live shows and engaging in intimate conversations with stunning models has added an exciting new dimension to my online experience. The quality of the streams is impeccable, making every moment feel real and immersive.

Ivanjoy Jan. 5, 2022, 10:23 a.m.

Hi, I've built your code bert-onnx-rs-server successfully and run with cargo run --release, but it didn't use GPU at all, do you have some suggestions? Thanks!

Haixuan Xavier Tao Jan. 6, 2022, 11:58 a.m.

So it is probably an error with onnxruntime-rs with GPU. Is your GPU an NVIDIA one? Do you have CUDA installed ?

Ivanjoy Jan. 5, 2022, 10:30 a.m.

More information:

 $ ldd target/release/onnx-server   linux-vdso.so.1 (0x00007ffdca293000)
    libonnxruntime.so.1.8.0 => /home/user/Downloads/bert-onnx-rs-server/target/release/build/onnxruntime-sys-b594350fd8e30b34/out/onnxruntime-linux-x64-gpu-1.8.0/lib/libonnxruntime.so.1.8.0 (0x00007fc681aa9000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc681891000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fc681689000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc68146a000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc6810cc000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc680ec8000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc680ad7000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fc682f17000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc68074e000)

AlexMikhalev Aug. 7, 2021, 11:31 p.m.

Great article, I achieved to speed 1.43 ms for BERT QA differently - by using cache and pre-tokenising potential answers. Draft of the future how to using RedisAI https://alexmikhalev.medium.com/bert-models-with-millisecond-inference-the-pattern-project-409b2880524d