Lately I’ve been playing around with the Deeplearning4j suite, and in particular ND4J. ND4J is a Java library for doing linear algebra. I’ve been evaluating it as a replacement for EJML in my little machine learning library . What got me excited about ND4J is that it has built in capabilities to leverage CUDA , which means harnessing the speed of GPUs. After extensive experimentation, the results are mixed.
First, I will say that from a purely “aesthetic” perspective, I like ND4J better. I found the syntax to be more succinct. I never did find a nice way to do broadcasting or vector operations in EJML. So, the code seemed more readable. The main question though is, how does it perform?
I compared the execution times using my mnist-images-test app, as well as chess4j . For the mnist-images test, I processed 60,000 training images for 50 epochs, a learning rate of 3.0, and batch sizes of 16 and 32. Batch size 16 was slightly more accurate, but Batch size 32 was faster for both libraries. Using EJML, it took 89 seconds. Using ND4J without vectorizing, it took 1045 seconds! After vectorizing some loops away, that reduced somewhat to 810 seconds, but that’s still a nearly 10x loss. Further, the CUDA build didn’t improve that any. So, clearly ND4J isn’t the winner here.
In chess4j however, the situation was different. I ran just 1 epoch of 10m+ positions, with a batch size of 8192. The EJML build took 1109 seconds. The ND4J without vectorization build took 1920 seconds! However, vectorizing did help tremendously in this case. With vectorization, the execution time was reduced to 586 seconds. Sadly, the CUDA build didn’t improve that any either. So, in summary, ND4J is a winner in this case by a factor of about 2x, but still no benefit from using GPUs which seems odd.
I’m going to speculate that the difference is the “sparseness” of the operations that made the difference. The network inputs in chess4j would be more sparse. As for the CUDA runs though- perhaps because I’m running Ubuntu on WSL2? More testing is needed here.
All that said, I’m thinking of separating the trainer into a separate application anyway. If I do, it will probably be a Python app that leverages Pytorch. Technically there’s no need to do this, but I think it would keep the chess app itself cleaner; to keep its focus “pure.” Also, the extra practice with Pytorch would be beneficial. Even if I go that route I’d need to run forward pass code in the chess application, so I want that code to execute as fast as possible as well.
I have built up a library of around 1 million games, some played with other opponents and some self play games with random starting positions. At some point I’ll turn those games into training positions.