dfdx v0.9.0

Links: github, crates.io, docs.rs, discord link is in the github readme

Release notes: https://github.com/coreylowman/dfdx/releases/tag/v0.9.0

Thank you to all first time contributors and generous sponsors!

What is dfdx?

An ergonomic deep learning library in rust (similar to pytorch in python). dfdx heavily utilizes const generics in all of the neural network, tensor, and autodiff/backprop features, so all tensor shapes are known and enforced at compile time! It’s easy to use, and much easier to understand the library implementation than something like pytorch or tensorflow.

Nightly features

A bunch of new things were added with nightly support. These can be enabled by running cargo commands with cargo +nightly .... Under the hood this enables a feature named “nightly” in dfdx.

These features are inherently unstable since they depend on the nightly rust compiler to work, so be aware of that. Just a few weeks ago the Conv2D layers were causing some linker errors and wouldn’t compile!

The list of these nightly features are:

Breaking changes

There are a number of breaking changes in this release. You can find the PRs/issues in the release notes, but here are more details:

Interesting tidbits

Naive convolutions with const generics are surprisingly fast

When implementing convolutional layers, there are a lot of tricks to get them as fast as possible. https://sahnimanas.github.io/post/anatomy-of-a-high-performance-convolution/ is a great resource for understanding this more.

However, the backward version of these variants (i.e. the thing that updates the gradients) is really hard to understand, and I was unable to find resources that covered all the cases (batched & strided & padded). See https://github.com/coreylowman/dfdx/issues/1 for some of the resources I found.

On a whim I just implemented the “naive” version of convolutions which is just 6 nested for loops. And because of const generics it was actually somewhat competitive with pytorch’s convolutions! Const generics means the compiler knows the bounds of each for loop and the sizes of the images at compile time. Yay for compiler optimizations!

For actual timings here’s a small benchmark on my dinky laptop for forward & backward of convolutions:

  dfdx pytorch pytorch 1 core [*]
forward 3.2ms 1.8ms 2.5ms
backward 3.5ms 3.1ms 3.8ms

[*] by default pytorch will try to use all available cores. I added a torch.set_num_threads(1) and torch.set_num_interop_threads(1) to see its performance on 1 core.

I think this is pretty impressive considering the simplicity of the conv_forward() and conv_backward() functions. They are much easier to read/understand than pytorch, and therefore easier to maintain! That’s a win in my book!

I also experimented with rayon for parallelizing, which does speed up the batched convolutions, but at the cost of readability. https://github.com/coreylowman/dfdx/issues/145. Welcome any optimizations in this area!

Macros for the win

When implementing broadcasting, reductions, and select for the CPU device, macros helped a ton. You can see my usage of them in these files:

These methods are very recursive & repetitive in nature, and having 5 different tensors means there’s a lot of repetition. Now combine those 5 tensor types with each axis they have and you have a lot of combinations. For example broadcast has 30 combinations to account for!

Of course, if anyone figures out how to do all this without macros, let me know.