Links: github, crates.io, docs.rs, discord link is in the github readme
Release notes: https://github.com/coreylowman/dfdx/releases/tag/v0.9.0
Thank you to all first time contributors and generous sponsors!
What is dfdx?
An ergonomic deep learning library in rust (similar to pytorch in python). dfdx heavily utilizes const generics in all of the neural network, tensor, and autodiff/backprop features, so all tensor shapes are known and enforced at compile time! It’s easy to use, and much easier to understand the library implementation than something like pytorch or tensorflow.
Nightly features
A bunch of new things were added with nightly support. These can be enabled by running cargo commands with cargo +nightly ...
. Under the hood this enables a feature named “nightly” in dfdx.
These features are inherently unstable since they depend on the nightly rust compiler to work, so be aware of that. Just a few weeks ago the Conv2D layers were causing some linker errors and wouldn’t compile!
The list of these nightly features are:
- nn::Conv2D yay convolution support!
- nn::TransformerEncoder/Decoder blocks an awesome contribution from https://github.com/jafioti!
- tensor_ops::Reshape
- nn::FlattenImage
Breaking changes
There are a number of breaking changes in this release. You can find the PRs/issues in the release notes, but here are more details:
- CanUpdateWithGradients::update now has a 2nd parameter that you have to pass to sub-modules. (related to Optimizer breaking changes)
- Optimizer::update now returns a
Result<(), UnusedParamsError>
, which is theErr
branch if any of your neural networks parameter’s didn’t have a gradient associated with them.- This can happen if you have a bug, or are just fiddling around with custom network structures and don’t use of the params.
- Previously this would just panic! Now you can decide at the top level whether to panic (.expect()) or ignore.
- As part of the broadcast/reductions rework, all
*_last_dim
functions (e.g.sum_last_dim
) were replaced with*_axis
functions (e.g. tensor_ops::sum_axis()). Additionally the*_broadcast_rhs_last
methods were removed.- Now you can reduce any axis of a tensor (e.g. with a 3d tensor you can sum the 1st, 2nd, or 3rd axis to get a 2d tensor).
- You can also broadcast up to 4 axes of a tensor! Most useful so far seems to be broadcasting a single axis tensor_ops::Broadcast1
- See https://docs.rs/dfdx/latest/dfdx/tensor_ops/index.html#reductions, https://docs.rs/dfdx/latest/dfdx/tensor_ops/index.html#broadcasts, and https://docs.rs/dfdx/latest/dfdx/tensor_ops/index.html#selectsindexing.
gather_last_dim
replaced by tensor_ops::Select1. You can now select sub tensors with usizes from any axis of a tensor!!
Interesting tidbits
Naive convolutions with const generics are surprisingly fast
When implementing convolutional layers, there are a lot of tricks to get them as fast as possible. https://sahnimanas.github.io/post/anatomy-of-a-high-performance-convolution/ is a great resource for understanding this more.
However, the backward version of these variants (i.e. the thing that updates the gradients) is really hard to understand, and I was unable to find resources that covered all the cases (batched & strided & padded). See https://github.com/coreylowman/dfdx/issues/1 for some of the resources I found.
On a whim I just implemented the “naive” version of convolutions which is just 6 nested for loops. And because of const generics it was actually somewhat competitive with pytorch’s convolutions! Const generics means the compiler knows the bounds of each for loop and the sizes of the images at compile time. Yay for compiler optimizations!
For actual timings here’s a small benchmark on my dinky laptop for forward & backward of convolutions:
- input size:
(64, 4, 28, 28)
so a batch of 64 28x28 images with 4 channels - dfdx conv:
Conv2D<4, 8, 4>
, pytorch conv:Conv2d(4, 8, 4)
- ran forward & backward 1000 times, dropped the first 100 times.
- mean times are reported
dfdx | pytorch | pytorch 1 core [*] | |
---|---|---|---|
forward | 3.2ms | 1.8ms | 2.5ms |
backward | 3.5ms | 3.1ms | 3.8ms |
[*] by default pytorch will try to use all available cores. I added a torch.set_num_threads(1)
and torch.set_num_interop_threads(1)
to see its performance on 1 core.
I think this is pretty impressive considering the simplicity of the conv_forward() and conv_backward() functions. They are much easier to read/understand than pytorch, and therefore easier to maintain! That’s a win in my book!
I also experimented with rayon for parallelizing, which does speed up the batched convolutions, but at the cost of readability. https://github.com/coreylowman/dfdx/issues/145. Welcome any optimizations in this area!
Macros for the win
When implementing broadcasting, reductions, and select for the CPU device, macros helped a ton. You can see my usage of them in these files:
- https://github.com/coreylowman/dfdx/blob/main/src/devices/broadcast.rs
- https://github.com/coreylowman/dfdx/blob/main/src/devices/select.rs
- https://github.com/coreylowman/dfdx/blob/main/src/devices/reduce_axis.rs
These methods are very recursive & repetitive in nature, and having 5 different tensors means there’s a lot of repetition. Now combine those 5 tensor types with each axis they have and you have a lot of combinations. For example broadcast has 30 combinations to account for!
Of course, if anyone figures out how to do all this without macros, let me know.