dfdx v0.11.0: ergonomic GPU accelerated deep learning ENTIRELY in rust!

What is dfdx?

An ergonomic deep learning library in rust (similar to pytorch in python). dfdx has all the standard tensor operations, neural network modules, and optimizers. It also knows tensor sizes across all operations, meaning if you use compile time dimensions, you get shape checking for free! No more kicking off a training run and getting a runtime shape error. It’s easy to use, and the internals are much easier to understand something like pytorch or tensorflow.

Check out dfdx/examples for more examples, but here is an api preview:

// Declare our model structure.
// Tuples represent sequential models!
type Model = (
    (Linear<5, 10>, ReLU),
    Residual<(Linear<10, 10>, Tanh)>,
    Linear<10, 2>,
);

// Devices let you allocate tensors/nns
let dev: Cuda = Default::default();

// Create our model with f32 dtype.
// f64 is supported & f16 is coming in the future!
let mut model = dev.build_module::<Model, f32>();

// Create random input data for simplicity.
// `x` will have shape `(10, 5)`.
let x: Tensor<(usize, Const<5>), f32, _> =
    dev.sample_normal_like(&(10, Const));

// A simple forward pass.
let pred = model.forward(x);

See dfdx docs for more info.

What is cudarc?

cudarc is a safe and ergonomic rust wrapper around the CUDA toolkit. It aims to be as safe as possible given interacting with CUDA involves all FFI calls. dfdx leverages cudarc for all of it’s CUDA support.

Check out cudarc/examples for more examples, but here is an api preview:

// Creates a handle to device ordinal 0.
let dev = CudaDevice::new(0)?;

// You can load pre-compiled kernels:
let ptx = Ptx::from_file("./examples/sin.ptx");
dev.load_ptx(ptx, "sin_module", &["sin_kernel"])?;

// Then load a function from the kernel with `get_func`:
let f = dev.get_func("sin_module", "sin_kernel").unwrap();

// Copying `Vec` to the device is done asynchronously.
let a_dev = dev.htod_copy(std::vec![1.0, 2.0, 3.0])?;
let mut b_dev = a_dev.clone();

// And launching the kernels is a straightforward
// method call.
let cfg = LaunchConfig::for_num_elems(3);
let params = (&mut b_dev, &a_dev, 3i32);
unsafe { f.launch(cfg, params) }?;

See cudarc docs for more info.

dfdx new features at a glance

Introducing the insanely fast Cuda device

This release introduced both the Cpu and Cuda device. Both support all operations/apis in dfdx, so its easy to switch between them!

- type Device = Cpu;
+ type Device = Cuda;
let dev: Device = Default::default();
let x: Tensor<Rank2<2, 3>, f32, _> = dev.zeros();
let m = dev.build_module::<Linear<5, 10>, f32>;

Plus the Cuda device is absurdly fast.

Shapes made up of both compile-time and run-time dimensions

Tensor is now generic over Shape, Dtype, and Device, meaning you can now represent tensors with different:

Shapes, including mixed compile time and runtime dimensions
Dtypes, (both f32/f64 are supported right now)
Device storages (Cuda & Cpu)

Here are some example tensors, and you can find out more about how to use these in dfdx/examples/01-tensor.rs.

// A 3d CPU tensor with shape (1, 2, 3).
Tensor<Rank3<1, 2, 3>, f32, Cpu>

// A 2d CUDA tensor with shape (?, ?).
Tensor<(usize, usize), f64, Cuda>

// A 3d CUDA tensor with shape (3, ?, 5).
Tensor<(Const<3>, usize, Const<5>), usize, Cuda>

// A 1d CUDA tensor with shape (10,).
Tensor<Rank1<10>, u16, Cuda>

// A 0d CPU tensor.
Tensor<(), bool, Cpu>

And for those curious, check out the internals of the struct here (hint: we use GATs!).

Rust iterators for Dataset/DataLoader functionality

I’ve added a number of iterator extension methods to take the place of dataloaders from other frameworks. One of the best parts is you can use these with normal rust iterators! Here’s how they all work together (this is ripped from the examples/06-mnist.rs):

// A preprocessing function that is applied
// to each item in the dataset
let mut preprocess = |(img, lbl)| { ... };

dataset
    // From `dfdx::data::ExactSizeDataset`.
    .shuffled(&mut rng)

    // Just a normal iterator map.
    .map(preprocess)

    // Turns `Item` into `[Item; BATCH_SIZE]`.
    // `.batch(BATCH_SIZE)` would create `Vec<Item>`.
    .batch(Const::<BATCH_SIZE>)

    // Turns `Vec<(A, B)>` into `(Vec<A>, Vec<B>)`.
    // Also works for array items!
    .collate()

    // Stacks Vecs/Arrays of tensors together
    // into a tensor with an additional dim
    .stack()

See:

Pre-allocate gradients & gradient accumuluation

Gradient accumulation is often used to train on a bigger batch size than you can actually run on. This is trivial to implement now thanks to some updates to the tracing and gradients APIs.

Additionally, pre-allocating gradients for your model significantly speeds up forward and backward passes, since you only have to allocate once.

// Allocate our gradients ahead of time.
let mut grads = model.alloc_grads();

// Do the first forward pass.
let pred = model.forward_mut(x1.traced(grads));
grads = loss(pred, truth).backward();

// Accumulate by doing the same thing again.
let pred = model.forward_mut(x2.traced(grads));
grads = loss(pred, truth).backward();

// Zero the gradients when done accumulating.
model.zero_grads(&mut grads);

See dfdx::nn::ZeroGrads docs

Exponential moving average of neural networks

A common technique in many domains of DL is to keep a second model which is an exponential moving average of your main model. The update formula is looks like model_ema = model_ema * decay + model * (1 - decay).

With some new internal techniques, we are able to get this for free on any module that implements our trait TensorCollection!

let mut model = dev.build_module::<Model, f32>();
let mut model_ema = dev.build_module::<Model, f32>();
model_ema.ema(&model, 0.001);

See dfdx::nn::ModelEMA docs

What’s next for dfdx?

Here’s a high level view of what’s on my roadmap for new features in dfdx (you can also checkout the issues page for more insight):

float16 and AMP dtypes
CUDA & CPU kernel optimizations
Distributed Gpu/Cpu devices
Lazy disk tensors
Multi threaded CPU kernels
ONNX serialization

In parallel, I’ll build out separate crates for the following applications:

Image classification (resnets)
Text generation (gpt2/llama)
Text to image (stable diffusion)
Audio to text (whisper)

Thanks

A big thanks to all my sponsors, including: @jackvial, @TimerErTim, @paholg, @scooter, @zeroows, @iExalt, @quietlychris, @skinner, @Michael P, @Alex R.

Additionally a big shout out to all the contributors that helped with this release! Find the full release notes.

Table of contents