I’ve been sidetracked with some coursework lately on Coursera, but still carving out a little time here and there to work on implementing a neural network in chess4j. I’ve written this post to capture some early experiments and the results.
My first attempt was inspired by the AlphaZero writeup in the “Neural Networks for Chess” book by Dominik Klein. My first encoding wasn’t as elaborate as that, but I did adopt the “planes” encoding of the current position of the pieces (basically a one-hot encoding). The data was one of Andrew Grant’s sets of 10m positions, with the labels being the same as was used for the logistic regression experiment – 0.0 for a white loss, 0.5 for draw, and 1.0 for a white win. I don’t remember the exact network shape, but I think it was something like 100 neurons in the hidden layer followed by a single neuron output layer. The idea was to predict the probability of a win. After training, just to get a feel for how it did, I evaluated each move from the root position and sorted by score. The results were very, very bizarre. It was obvious that wasn’t going to work out.
I thought I must be confused about the input encoding, and tried some experiments where the inputs were full bitboards (e.g. the entire 64 bit word describing the position of white rooks), and the results were tragic. Back to “one hot” encoding – 768 inputs (64 squares x 6 piece types x 2 colors).
After some reading on Talkchess I learned that the approach most people take is to label the positions using the evaluation output (in centipawns) to some fixed depth or fixed node count, and train to that. Again using Andrew’s dataset, I evaluated each position to a fixed depth of 3, and trained the network using 3 layers (not counting the input) – 8x8x1. The root position evals still didn’t seem to be making sense.
I enlarged the network to 32x32x1 and saved the trained network. Then, I added a depth=4 eval to the dataset and trained against those. I ran a 2000 game match between the depth=3 eval and depth=4 eval, and the depth=4 eval just edged out the win with 51% (from memory), +7 ELO. Not very convincing, and well within error bars, but possibly a sign of improvement.
Following that I enlarged the network again to 64x32x1, and ran a match between that and the 32x32x1 (depth4) network to see if there is any evidence of improvement. 705-537-757 (0.542), +29.3 ELO +/- 12.0. So it appears doubling the size of the first hidden layer had a positive effect. What if I ran that strictly as a fixed depth match? Would the results be even stronger in favor of the larger net? I tried a fixed depth=6 match. Result: 3659-3666-2675 (0.500). Disappointing.
At that point I decided to take a step back and try to create a neural network to fit the evaluation function. That is completely useless of course – it would be far faster to actually run the eval function than take a pass through a neural network that approximates it. The point though is to prove that it can be done; to build a foundation to work upon. Once I am successful in that task, then I can move onto building networks that approximate the output of a search, and eventually onto NNUE. The best result I’ve gotten so far has been with a 64x32x1 network, training 1000 iterations over Andrew’s 10m positions, with a learning rate of 3.0. After doing that I listed the moves from the root position, along with the HCE score and the net’s score for comparison:
train 3.0 1000 net.cfg
final cost: 0.02422496889794358
e2e4 46 52
g1f3 43 47
e2e3 42 33
b1c3 38 32
d2d4 37 31
b2b4 23 24
c2c4 12 21
c2c3 7 19
b2b3 15 18
a2a3 9 16
a2a4 13 16
g1h3 10 15
d2d3 33 13
h2h3 2 11
f2f4 8 9
g2g3 7 8
f2f3 -2 8
b1a3 10 7
h2h4 1 4
g2g4 4 2
That is not bad perhaps but feels like we should be a little closer. The question is, what to try next? Is this a data issue? Would using some form of regularization help? Should I use a larger model? I remember some advice from Andrew Ng in one of the courses I took on this topic. His advice was to first determine if you have a bias or variance problem. The error from the test data tracks very closely with the error from the training data, so I don’t think I have a variance problem (if so it would perform significantly better on the training data than on the test data — e.g. fail to generalize). Could I have a bias problem? I’m not sure. The typical solutions to this are to train longer or create a bigger network. The problem with creating a bigger network is that , it’s already taking around 6 days to train!
I think I’ll try to *reduce* the dataset by half. If there is no impact on the overall error, then that would be confirmation that I don’t have a variance problem and I should try a larger network. If it does have an impact, I may want to look at regularization or the data itself.
I remember when I thought that things would be so much simpler with machine learning.
Edit 4/14/23: halving the data did have a negative impact, so I’m going to focus on (1) regularization and (2) increasing the size of the data set.