Lately I’ve been playing around with the Deeplearning4j suite, and in particular ND4J. ND4J is a Java library for doing linear algebra. I’ve been evaluating it as a replacement for EJML in my little machine learning library . What got me excited about ND4J is that it has built in capabilities to leverage CUDA , which means harnessing the speed of GPUs. After extensive experimentation, the results are mixed.
First, I will say that from a purely “aesthetic” perspective, I like ND4J better. I found the syntax to be more succinct. I never did find a nice way to do broadcasting or vector operations in EJML. So, the code seemed more readable. The main question though is, how does it perform?
I compared the execution times using my mnist-images-test app, as well as chess4j . For the mnist-images test, I processed 60,000 training images for 50 epochs, a learning rate of 3.0, and batch sizes of 16 and 32. Batch size 16 was slightly more accurate, but Batch size 32 was faster for both libraries. Using EJML, it took 89 seconds. Using ND4J without vectorizing, it took 1045 seconds! After vectorizing some loops away, that reduced somewhat to 810 seconds, but that’s still a nearly 10x loss. Further, the CUDA build didn’t improve that any. So, clearly ND4J isn’t the winner here.
In chess4j however, the situation was different. I ran just 1 epoch of 10m+ positions, with a batch size of 8192. The EJML build took 1109 seconds. The ND4J without vectorization build took 1920 seconds! However, vectorizing did help tremendously in this case. With vectorization, the execution time was reduced to 586 seconds. Sadly, the CUDA build didn’t improve that any either. So, in summary, ND4J is a winner in this case by a factor of about 2x, but still no benefit from using GPUs which seems odd.
I’m going to speculate that the difference is the “sparseness” of the operations that made the difference. The network inputs in chess4j would be more sparse. As for the CUDA runs though- perhaps because I’m running Ubuntu on WSL2? More testing is needed here.
All that said, I’m thinking of separating the trainer into a separate application anyway. If I do, it will probably be a Python app that leverages Pytorch. Technically there’s no need to do this, but I think it would keep the chess app itself cleaner; to keep its focus “pure.” Also, the extra practice with Pytorch would be beneficial. Even if I go that route I’d need to run forward pass code in the chess application, so I want that code to execute as fast as possible as well.
I have built up a library of around 1 million games, some played with other opponents and some self play games with random starting positions. At some point I’ll turn those games into training positions.
Over the last month I’ve continued to experiment with different network structures with little to show for it. It’s been a little frustrating. I can’t even get a model that fits the Hand Crafted Evaluation. 🙁 This just reinforces what I had concluded in my last update though- I just don’t have enough training data. I’m still using the ~10m position dataset by Andrew Grant. From reading online (such as this post on talkchess.com), hundreds of millions or even a billion training positions is not uncommon. So, how do I get my hands on that many training positions?
I could most likely find a source online if I looked around; perhaps some other authors would share. But, that doesn’t seem very satisfying to me. I’d rather generate them myself, if that’s practical. So, just to do some napkin math: let’s say I want a set of 100m training positions. Let’s also assume that each training game will yield 5 training positions. That means I’d need on the order of 20m unique games! If I wanted a set of 250m training positions (which doesn’t seem to be an unreasonable number at all), I’d need something like 50m unique games. That’s a lot of games!
I’m not philosophically opposed to scraping *games* from online sources, because the games themselves still have to be translated into training positions, and even in that there seems to be some room for experimentation. Another possibility is to play very fast games from random starting positions. I’m not entirely sure how fast I can produce these on my hardware without some experimentation though. I think I will do some experiments before deciding on my “game gathering” strategy.
Another area that needs some work: my home grown machine learning library is far too slow for this. Judging by its performance on the MNIST image classification problem, the library works just fine, but chess seems to be a very extreme use case. I don’t think there’s a way around using hardware level instructions (SIMD, AVX) to have speeds that are acceptable for a chess engine. I think I’m a pretty good bit twiddler, but hardware level stuff isn’t my specialty. If it were just a matter of training faster, I could look into using some other training library (maybe nnue-pytorch, I’m not sure). But, the engine still needs to be able to run a forward pass through the network to make predictions during the game, so it seems there’s really no way around it — I need to figure this out. On the Java front, there is JEP 438 which is part of the latest JDK 20 release. That seems like a promising direction for chess4j. Or, perhaps there is already some Java based linear algebra library that would handle that job for me. I don’t think I’ll ever port the training code to Prophet, but I would like to do the forward pass code, so I’ll have to figure that out in C as well.
Lots to do! In summary, I think the plan for now is to get my machine working on building training data, then while it works I’ll do my homework on making my Java based machine learning library more accurate and faster.
I’ve been sidetracked with some coursework lately on Coursera, but still carving out a little time here and there to work on implementing a neural network in chess4j. I’ve written this post to capture some early experiments and the results.
My first attempt was inspired by the AlphaZero writeup in the “Neural Networks for Chess” book by Dominik Klein. My first encoding wasn’t as elaborate as that, but I did adopt the “planes” encoding of the current position of the pieces (basically a one-hot encoding). The data was one of Andrew Grant’s sets of 10m positions, with the labels being the same as was used for the logistic regression experiment – 0.0 for a white loss, 0.5 for draw, and 1.0 for a white win. I don’t remember the exact network shape, but I think it was something like 100 neurons in the hidden layer followed by a single neuron output layer. The idea was to predict the probability of a win. After training, just to get a feel for how it did, I evaluated each move from the root position and sorted by score. The results were very, very bizarre. It was obvious that wasn’t going to work out.
I thought I must be confused about the input encoding, and tried some experiments where the inputs were full bitboards (e.g. the entire 64 bit word describing the position of white rooks), and the results were tragic. Back to “one hot” encoding – 768 inputs (64 squares x 6 piece types x 2 colors).
After some reading on Talkchess I learned that the approach most people take is to label the positions using the evaluation output (in centipawns) to some fixed depth or fixed node count, and train to that. Again using Andrew’s dataset, I evaluated each position to a fixed depth of 3, and trained the network using 3 layers (not counting the input) – 8x8x1. The root position evals still didn’t seem to be making sense.
I enlarged the network to 32x32x1 and saved the trained network. Then, I added a depth=4 eval to the dataset and trained against those. I ran a 2000 game match between the depth=3 eval and depth=4 eval, and the depth=4 eval just edged out the win with 51% (from memory), +7 ELO. Not very convincing, and well within error bars, but possibly a sign of improvement.
Following that I enlarged the network again to 64x32x1, and ran a match between that and the 32x32x1 (depth4) network to see if there is any evidence of improvement. 705-537-757 (0.542), +29.3 ELO +/- 12.0. So it appears doubling the size of the first hidden layer had a positive effect. What if I ran that strictly as a fixed depth match? Would the results be even stronger in favor of the larger net? I tried a fixed depth=6 match. Result: 3659-3666-2675 (0.500). Disappointing.
At that point I decided to take a step back and try to create a neural network to fit the evaluation function. That is completely useless of course – it would be far faster to actually run the eval function than take a pass through a neural network that approximates it. The point though is to prove that it can be done; to build a foundation to work upon. Once I am successful in that task, then I can move onto building networks that approximate the output of a search, and eventually onto NNUE. The best result I’ve gotten so far has been with a 64x32x1 network, training 1000 iterations over Andrew’s 10m positions, with a learning rate of 3.0. After doing that I listed the moves from the root position, along with the HCE score and the net’s score for comparison:
That is not bad perhaps but feels like we should be a little closer. The question is, what to try next? Is this a data issue? Would using some form of regularization help? Should I use a larger model? I remember some advice from Andrew Ng in one of the courses I took on this topic. His advice was to first determine if you have a bias or variance problem. The error from the test data tracks very closely with the error from the training data, so I don’t think I have a variance problem (if so it would perform significantly better on the training data than on the test data — e.g. fail to generalize). Could I have a bias problem? I’m not sure. The typical solutions to this are to train longer or create a bigger network. The problem with creating a bigger network is that , it’s already taking around 6 days to train!
I think I’ll try to *reduce* the dataset by half. If there is no impact on the overall error, then that would be confirmation that I don’t have a variance problem and I should try a larger network. If it does have an impact, I may want to look at regularization or the data itself.
I remember when I thought that things would be so much simpler with machine learning.
Edit 4/14/23: halving the data did have a negative impact, so I’m going to focus on (1) regularization and (2) increasing the size of the data set.
A good part of the development effort in the previous year was spent optimizing evaluation weights using logistic regression, first using the Texel Tuning Method (simple and slow), and later using a more standard gradient descent approach (faster, but more complicated). The ultimate goal, however, has been to develop a solution that doesn’t require domain knowledge at all. What if there were a way to just let the machine figure out for itself what is “good” about a chess position by examining enough training positions? A supervised machine learning algorithm such as neural networks would be perfect here.
One of the key challenges in using a neural network based evaluation within an alpha-beta search is the loss of speed. There is no getting around the fact that a neural network based evaluator is going to be slower than a traditional “hand crafted” evaluation. The key question is, can it be performed efficiently enough that the speed loss is offset by the gain in evaluation precision? Enter “efficiently updatable neural networks (NNUE).”
To begin the journey towards NNUE, I first had to learn about neural networks. My thoughts were to first learn about neural networks, and then worry about the “efficiently updatable” part, which is specific to computer chess. To that end I’ve spent the last several months reading, taking some online courses, and even writing my own simple machine learning library. Though I’m still learning and experimenting with neural nets in general, I’m at the point where I’m thinking about how to apply them to computer chess and have begun that work.
My path towards NNUE looks like this:
Learning the fundamentals (neural networks) without worrying about NNUE. I’ve learned a lot from Andrew Ng’s Deep Learning courses on Coursera. Michael Nielson’s fantastic (and free!) online book has also been a fantastic resource.
Write some code! The “hello world” of neural networks seems to be writing an image classifier, so I did that.
How would it look to train a network that replaces a traditional chess evaluator? This is the stage I’m at now, and the work is already underway. I’ve decided again to not worry about the “efficiently updatable” stuff here, though I have done some studying around NNUE. I’ll focus on fixed depth searches in this stage, so I can focus more on the mechanics of the network itself. How do I encode a chess position? How should the network be structured? What hyperparameters seem to work better?
Finally, how do I do all this efficiently to minimize speed loss?
There are shorter paths than the DIY approach I’m taking. I could probably rip some code or use an existing library to get something working much faster, but I’m OK with the long road. To me, the journey is as important as the end result. So far it’s been fun.
If you need a Mac build for Prophet, please look here. (Thank you Darius!)
The focus of this release was to continue to improve the evaluation. Prophet 4.3 and chess4j 5.1 have the exact same change log:
Passed pawn by rank (was a single value)
Non-linear mobility (was a single value)
Trapped bishop penalty
These changes are worth about +50 ELO in Prophet (which I expect will bring it very close to the 2500 mark on the CCRL Blitz List). I attempted a “supported rook” term, meaning the rook on an open file is connected to another rook, but surprisingly it actually cost a few ELO. Seems that should work though, so I’ve left the code in place but have it commented out.
I had planned on doing some pawn and basic endgame work in this line, and perhaps I still will, but right now I feel the time is right to begin work on neural networks. I’m pausing development for a while to study the literature. Hopefully by spring I’m ready to begin the implementation.
Prophet and chess4j now understand knight outposts. An outpost, as implemented in Prophet, is a square that cannot be attacked by an enemy pawn. Putting a knight on an outpost can be a strong advantage, particularly if that knight is supported by a friendly pawn.
In the following diagram, the knight on D4 is on an outpost square, but the knight on E4 is not since it may be run off by the F7 pawn at some point.
The bonus (or penalty) given for an outpost varies by square. An additional bonus is given if the outpost is supported, such as the knight on the D4 square above. The “supported” bonus also varies by square. This is possibly overkill, but with an auto-tuner , I reasoned the more knobs and dials it has to minimize error the better. (Or at least, it can’t hurt as long as we guard against over-fitting.)
As expected this feature isn’t a huge gain in terms of ELO, but it did net a few points. It also puts the latest development version at +50 ELO over Prophet 4.2, which was my goal before doing a new release. Before doing a release I’m going to test a couple more terms, both expected to be minor gains at most, but after that I’m going to switch gears so it’d be good to clear them from the board. Those terms are “trapped bishop” and “supported rook on open file.”
Since I released Prophet 4.2 I’ve made a couple of additional evaluation changes:
The passed pawn bonus has been made more granular. Where it used to be a simple bonus for a passed pawn, now it varies depending on the pawn’s rank. 40,000 bullet games says that change was worth about 14 ELO.
Bishop and queen mobility has been made non-linear. This change was inspired by Erik Madsen’s MadChess blog – https://www.madchess.net/2014/12/16/madchess-2-0-beta-build-29-piece-mobility/ . The idea is to encourage piece development. I had originally plugged Erik’s values in verbatim, but they didn’t mesh well with existing weights and testing showed it weakened the program. After running the auto-tuner, this change brought in an additional 22 ELO.
In my first attempt at running the auto-tuner, I just started with the previously tuned weights, plus Erik’s values for bishop and queen mobility, but the tuner couldn’t seem to find any improvements. The error bounced around a little, going up and down, and not making any progress. I eventually decided to do a complete reset. I set the piece values to the traditional 1/3/3/5/9 values, and everything else to 0. Then I re-tuned and validated with some bullet games. The learning curve:
Fresh off the heals of these improvements, Prophet played in an informal online engine blitz tourney today. Unfortunately it was a pretty rough outing, placing just 16/20 with 2.5 points out of 9. It was a very strong field though. Even the 10th place finisher is nearly 3000 ELO on CCRL’s 40/2 list.
Next up- I’m going to continue with the mobility theme a little longer by testing rook mobility, then knight outposts, trapped bishops, and connected rooks on open files. I don’t expect any of those will be big points by themselves but cumulatively they might be worth a bit.
I’ve long believed that one of the biggest potential areas of improvement in my chess programs was tuning of the evaluation parameters. chess4j‘s evaluation function is rather simple; there are gaps in its knowledge, but the issue I’m talking about here are the values assigned to the parameters it does have relative to each other. Automated tuning addresses that problem.
Automated tuning in computer chess is not a new concept. In 2014, Peter Österlund (author of the chess program Texel), wrote about his approach of using logistic regression to optimize evaluation parameters. This approach has since been dubbed The Texel Tuning Method . While Peter does get credit for popularizing the idea, it goes back even further – at least as far back as 2009. I won’t rigorously describe the algorithm here as it’s already done in the link referenced above, but I will try to give some intuition without getting bogged down in the math.
Get a bunch of labeled positions. By “labeled,” I mean – each position is associated with an outcome. It’s convenient to think of the outcome from the perspective of the white player: win=1, draw=0.5, loss=0.
Given a chess position, formulate a hypothesis. The hypothesis should be a value in the closed interval [0, 1]. From the perspective of the white player, the hypothesis is the probability of a win.
The error (for a set of evaluation parameters) is a measure of difference between the actual outcome and the hypothesis. Measure the error for each position in your test set and take the average. This is your initial error.
Find a set of evaluation parameters that minimizes the error.
Imagine the error (the cumulative difference between your hypothesis and the actual outcome) as a landscape with hills and valleys. Some hills may be taller than others, and some valleys lower than others. We want to settle into the lowest spot in a valley. Ideally, in the LOWEST valley, but that’s really hard. We’ll settle for ANY valley.
There are various approaches to go about minimizing the error with varying degrees of complexity. I’ve experimented with a local search and gradient descent, which I’ll discuss below.
I don’t mean to fully explain how logistic regression or gradient descent works, but I will share a few details about my experience and my implementation. Most of the code can be found in the tuner package of chess4j.
Good data is crucial to a successful outcome. This point can’t be overstated – the end result is limited by the quality of the data you’re training with. In fact, the data is even more important than the algorithm! If you think about it this should be obvious.
Building a good data set isn’t an overly difficult task, but it is a tedious and time consuming one. Fortunately, others have already done the work and have been generous enough to share with the community. The well known Zuri Chess set from Alexandru Moșoi is a common one, and was my starting point. It contains around 725,000 labeled “quiet” positions. Andrew Grant (Ethereal) has shared a few larger data sets containing 10 million or so positions each.
I did have slightly better results using Andrew’s data. I don’t know if that’s because of the number of positions in the set, or the quality of the data itself, or possibly a combination of both.
In his paper Evaluation & Tuning in Chess Engines, Andrew describes his approach to generating datasets. (I’m assuming the same methodology was used in the shared ones.) He sampled positions from self play games, then performed high depth searches from each of those positions to produce a principal variation. The terminal position of the principal variation was the position added to the dataset. A major benefit of this approach is that the evaluation function can be applied directly, without first running a quiescence search. This is a major time saver.
The basic idea behind almost any learning algorithm is to minimize the error between “truth” and what you think will happen (your “hypothesis”). In the case of a chess game, the outcome is one of three things: white wins, black wins (white loses), or a draw. Let’s give our outcome a name, say “y”. Let y=1 be the case where white wins, y=0 be the case where white loses, and y=0.5 be the case where the game is drawn.
Now, our hypothesis, call it “h(x)” where x is a vector of our evaluation parameter values, is just the probability of a white win. We can map the output of the evaluation function to a hypothesis using a standard sigmoid function. If the eval says 0, our hypothesis is 0.5 (draw). The larger the eval score in the positive direction, the closer the hypothesis moves towards 1 (white wins). As the eval gets more negative, our hypothesis moves towards 0.
Note: Texel did not use the standard logistic sigmoid function, but a more general sigmoid function that uses a scaling constant K he used to minimize the error. Texel used a K of -1.13, so I just did the same. It’s possible with some tuning the tuner might converge a little faster but I haven’t attempted this nor do I plan to.
The goal of training is to minimize error, which is where the cost function comes in. We want our cost function to produce large values when our predictions are wrong, and smaller and smaller values as our predictions are closer to the actual outcome. The simplest way to do this is to take the absolute value of the difference: |y-h| . I used a common variation of that: (y-h)^2. Squaring has the effect of “amplifying” errors.
The overall cost (error) associated with an evaluation parameter vector is simply the average cost over all the training data.
MINIMIZING ERROR WITH LOCAL SEARCH
I used local search in my initial attempts. The idea is very simple – adjust the evaluation parameters, one at a time, each time re-measuring the error. Please reference the pseudo code in the Texel Tuning Wiki article for the details, but this is essentially a very slow walk (more like a stumble) down a hill. The stumble continues until you are at the bottom of the hill.
Pros: This approach is guaranteed to find a local minimum. Indeed, I used this approach to derive the eval parameters in Prophet 4.1, which was a 100 elo improvement over Prophet 4.0. Another major pro is that it’s easy to implement. It will likely require only minimal changes to your evaluation function.
Cons: it is a naive approach, and consequently VERY slow. As in, hours or even days.
Using a local search is a great start, but if you are actively developing and would like to be able to iterate quickly, you’ll likely want a faster approach. That is where gradient descent comes in.
Gradient descent is a commonly used machine learning algorithm to minimize some (differentiable) function by moving in the area of steepest descent. Where a local search takes a drunken stumble down the hill, gradient descent takes a more direct path. Where local search takes hours (or days!), gradient descent takes minutes. Unfortunately, this “steeper walk” also means a “steeper learning curve.” Understanding gradient descent does require a little calculus and linear algebra.
The biggest challenge for me was to start thinking about the evaluation differently than I had been for the last 20+ years. You’ll have to think about the evaluation in terms of features and weights. Your final evaluation is a sum of products:
The beauty of gradient descent is that it adjusts the individual weights according to how active the feature was. Consider a position that your program evaluates incorrectly. If the position didn’t have any queens on the board, why adjust any of the terms related to queens?
This will almost surely require refactoring your evaluation function. In chess4j, I augmented the standard evaluation functions with “feature extraction” functions. See EvalRook for an example. Note that tapered evaluations make this slightly more complex. Instead of a feature being “on or off,” you’ll have to assign it a value indicating how active it is according to the phase of the game.
For each training session, the test data was shuffled and then split into two subsets. The first subset contained 80% of the records and was used for the actual training. The other 20% were used for validation (technically the “test set”). Ideally, the training error should decrease after every iteration, but with gradient descent (particularly stochastic gradient descent) that may not be the case. The important point is that the error trends downward (this is tough if you’re a perfectionist).
Learning curves are tremendously useful to know when to STOP training. If you go too far, you risk a serious problem – overfitting. This is when your model fits the training data “too well” and fails to generalize. In the sample learning curve below I plotted the error curve for the training data vs the error curve for the test (validation) data. In this case the valley around iteration 800 is probably the best place to stop. Things could get interesting if the two curves begin to diverge, but with a large enough data set that shouldn’t be an issue.
After every training session I would validate the results by running a gauntlet of 10,000 bullet games (1 second + 0.5 second increment) against a collection of 8 sparring partners. Periodically I would also run self play matches.
There are several hyper parameters that influence the outcome of a training session. Finding the ideal values for each took a little trial and error. (Actually it is possible to automate this using a validation set, but it didn’t seem worth it here.)
Alpha (the learning rate). This of this as the size of your step as you walk towards down the hill towards the bottom of the valley. A small alpha will take a very long time to converge. Too big, and you may “overstep” the bottom. I ended up with alpha=25.
Batch size. With millions of test positions, it’s not computationally feasible to train against all positions all at once. A typical approach is to choose a small subset, running many iterations with different subsets each time. I tested batch sizes of 10 up to 10,000. The learning curves seemed similar, just more noise with lower batch sizes. I settled on 10,000.
Lambda. This is a decay parameter. The idea is to taper down the size of your steps down the hill over time. Anything I tried seemed worse than lambda=0 so I eventually disabled it altogether.
Regularization. The idea behind regularization is to penalize the size of the parameters. That is, not only do we want the eval weights to be in the proper ratios to each other, but we want the smallest values that satisfy those ratios. In practice I never observed a difference so I eventually disabled it.
This exercise brought in around 120 ELO for Prophet. The difference between Prophet 4.0 and 4.1 was around 100 ELO and can be fully attributed to optimizing evaluation parameters. The difference between Prophet 4.1 and 4.2 is around 50 ELO, but some of that is due to the addition of pawn hashing and other eval terms.
I find it instructive to visualize the piece square tables as heat maps. The following are the pawn piece square tables before and after training. Note in 4.0 there was only a single pawn table, but by 4.2 I had split those into middle and endgame tables. A couple of things that really stand out to me are how undervalued a pawn on the 7th rank was, and just how much the danger of pushing a pawn around the king was underestimated.
In the near term, the focus will be on adding some additional evaluation terms, particularly for pawns. The nice thing about having an auto-tuning capability is that I can focus on correctly implementing a term without worrying too much about how to assign a value to the term. Just rerun the tuner and verify with some games.
Longer term (perhaps Spring 2023) I plan to delve into neural networks and my own implementation of NNUE.
I’m happy to announce updates to both chess engines. Prophet 4.2 is approximately 50 elo stronger than 4.1, and 150 elo stronger than 4.0. (I missed a release announcement or two while this development blog was offline.) The most significant change, and the reason the chess4j major version number has been incremented, is that chess4j now includes an auto tuner! The tuner uses logistic regression with gradient descent to optimize evaluation terms. I’ll write more detail about that in a separate post. Those optimized weights have been added into Prophet, so it benefits from that work as well. Tapered evaluation has been fully implemented as well which added a few elo. I say “fully” because the king evaluation was already tapered, but now both programs fully evaluate the position with a middle game and an endgame score, and weight them based on material on the board. Some concept of mobility has been added as well – a simple count of available squares for both bishops and queens.
Here is how Prophet 4.2 stacks up against its current sparring partners in 1+0.5 games: