I’ve long believed that one of the biggest potential areas of improvement in my chess programs was tuning of the evaluation parameters. chess4j‘s evaluation function is rather simple; there are gaps in its knowledge, but the issue I’m talking about here are the values assigned to the parameters it does have relative to each other. Automated tuning addresses that problem.
Automated tuning in computer chess is not a new concept. In 2014, Peter Österlund (author of the chess program Texel), wrote about his approach of using logistic regression to optimize evaluation parameters. This approach has since been dubbed The Texel Tuning Method . While Peter does get credit for popularizing the idea, it goes back even further – at least as far back as 2009. I won’t rigorously describe the algorithm here as it’s already done in the link referenced above, but I will try to give some intuition without getting bogged down in the math.
- Get a bunch of labeled positions. By “labeled,” I mean – each position is associated with an outcome. It’s convenient to think of the outcome from the perspective of the white player: win=1, draw=0.5, loss=0.
- Given a chess position, formulate a hypothesis. The hypothesis should be a value in the closed interval [0, 1]. From the perspective of the white player, the hypothesis is the probability of a win.
- The error (for a set of evaluation parameters) is a measure of difference between the actual outcome and the hypothesis. Measure the error for each position in your test set and take the average. This is your initial error.
- Find a set of evaluation parameters that minimizes the error.
Imagine the error (the cumulative difference between your hypothesis and the actual outcome) as a landscape with hills and valleys. Some hills may be taller than others, and some valleys lower than others. We want to settle into the lowest spot in a valley. Ideally, in the LOWEST valley, but that’s really hard. We’ll settle for ANY valley.
There are various approaches to go about minimizing the error with varying degrees of complexity. I’ve experimented with a local search and gradient descent, which I’ll discuss below.
I don’t mean to fully explain how logistic regression or gradient descent works, but I will share a few details about my experience and my implementation. Most of the code can be found in the tuner package of chess4j.
Good data is crucial to a successful outcome. This point can’t be overstated – the end result is limited by the quality of the data you’re training with. In fact, the data is even more important than the algorithm! If you think about it this should be obvious.
Building a good data set isn’t an overly difficult task, but it is a tedious and time consuming one. Fortunately, others have already done the work and have been generous enough to share with the community. The well known Zuri Chess set from Alexandru Moșoi is a common one, and was my starting point. It contains around 725,000 labeled “quiet” positions. Andrew Grant (Ethereal) has shared a few larger data sets containing 10 million or so positions each.
I did have slightly better results using Andrew’s data. I don’t know if that’s because of the number of positions in the set, or the quality of the data itself, or possibly a combination of both.
In his paper Evaluation & Tuning in Chess Engines, Andrew describes his approach to generating datasets. (I’m assuming the same methodology was used in the shared ones.) He sampled positions from self play games, then performed high depth searches from each of those positions to produce a principal variation. The terminal position of the principal variation was the position added to the dataset. A major benefit of this approach is that the evaluation function can be applied directly, without first running a quiescence search. This is a major time saver.
The basic idea behind almost any learning algorithm is to minimize the error between “truth” and what you think will happen (your “hypothesis”). In the case of a chess game, the outcome is one of three things: white wins, black wins (white loses), or a draw. Let’s give our outcome a name, say “y”. Let y=1 be the case where white wins, y=0 be the case where white loses, and y=0.5 be the case where the game is drawn.
Now, our hypothesis, call it “h(x)” where x is a vector of our evaluation parameter values, is just the probability of a white win. We can map the output of the evaluation function to a hypothesis using a standard sigmoid function. If the eval says 0, our hypothesis is 0.5 (draw). The larger the eval score in the positive direction, the closer the hypothesis moves towards 1 (white wins). As the eval gets more negative, our hypothesis moves towards 0.
Note: Texel did not use the standard logistic sigmoid function, but a more general sigmoid function that uses a scaling constant K he used to minimize the error. Texel used a K of -1.13, so I just did the same. It’s possible with some tuning the tuner might converge a little faster but I haven’t attempted this nor do I plan to.
The goal of training is to minimize error, which is where the cost function comes in. We want our cost function to produce large values when our predictions are wrong, and smaller and smaller values as our predictions are closer to the actual outcome. The simplest way to do this is to take the absolute value of the difference: |y-h| . I used a common variation of that: (y-h)^2. Squaring has the effect of “amplifying” errors.
The overall cost (error) associated with an evaluation parameter vector is simply the average cost over all the training data.
MINIMIZING ERROR WITH LOCAL SEARCH
I used local search in my initial attempts. The idea is very simple – adjust the evaluation parameters, one at a time, each time re-measuring the error. Please reference the pseudo code in the Texel Tuning Wiki article for the details, but this is essentially a very slow walk (more like a stumble) down a hill. The stumble continues until you are at the bottom of the hill.
Pros: This approach is guaranteed to find a local minimum. Indeed, I used this approach to derive the eval parameters in Prophet 4.1, which was a 100 elo improvement over Prophet 4.0. Another major pro is that it’s easy to implement. It will likely require only minimal changes to your evaluation function.
Cons: it is a naive approach, and consequently VERY slow. As in, hours or even days.
Using a local search is a great start, but if you are actively developing and would like to be able to iterate quickly, you’ll likely want a faster approach. That is where gradient descent comes in.
MINIMIZING ERROR WITH GRADIENT DESCENT
Gradient descent is a commonly used machine learning algorithm to minimize some (differentiable) function by moving in the area of steepest descent. Where a local search takes a drunken stumble down the hill, gradient descent takes a more direct path. Where local search takes hours (or days!), gradient descent takes minutes. Unfortunately, this “steeper walk” also means a “steeper learning curve.” Understanding gradient descent does require a little calculus and linear algebra.
The biggest challenge for me was to start thinking about the evaluation differently than I had been for the last 20+ years. You’ll have to think about the evaluation in terms of features and weights. Your final evaluation is a sum of products:
eval = F0 * W0 + F1 * W1 + F2 * W2 + .... + FN * WN
The beauty of gradient descent is that it adjusts the individual weights according to how active the feature was. Consider a position that your program evaluates incorrectly. If the position didn’t have any queens on the board, why adjust any of the terms related to queens?
This will almost surely require refactoring your evaluation function. In chess4j, I augmented the standard evaluation functions with “feature extraction” functions. See EvalRook for an example. Note that tapered evaluations make this slightly more complex. Instead of a feature being “on or off,” you’ll have to assign it a value indicating how active it is according to the phase of the game.
For each training session, the test data was shuffled and then split into two subsets. The first subset contained 80% of the records and was used for the actual training. The other 20% were used for validation (technically the “test set”). Ideally, the training error should decrease after every iteration, but with gradient descent (particularly stochastic gradient descent) that may not be the case. The important point is that the error trends downward (this is tough if you’re a perfectionist).
Learning curves are tremendously useful to know when to STOP training. If you go too far, you risk a serious problem – overfitting. This is when your model fits the training data “too well” and fails to generalize. In the sample learning curve below I plotted the error curve for the training data vs the error curve for the test (validation) data. In this case the valley around iteration 800 is probably the best place to stop. Things could get interesting if the two curves begin to diverge, but with a large enough data set that shouldn’t be an issue.
After every training session I would validate the results by running a gauntlet of 10,000 bullet games (1 second + 0.5 second increment) against a collection of 8 sparring partners. Periodically I would also run self play matches.
There are several hyper parameters that influence the outcome of a training session. Finding the ideal values for each took a little trial and error. (Actually it is possible to automate this using a validation set, but it didn’t seem worth it here.)
- Alpha (the learning rate). This of this as the size of your step as you walk towards down the hill towards the bottom of the valley. A small alpha will take a very long time to converge. Too big, and you may “overstep” the bottom. I ended up with alpha=25.
- Batch size. With millions of test positions, it’s not computationally feasible to train against all positions all at once. A typical approach is to choose a small subset, running many iterations with different subsets each time. I tested batch sizes of 10 up to 10,000. The learning curves seemed similar, just more noise with lower batch sizes. I settled on 10,000.
- Lambda. This is a decay parameter. The idea is to taper down the size of your steps down the hill over time. Anything I tried seemed worse than lambda=0 so I eventually disabled it altogether.
- Regularization. The idea behind regularization is to penalize the size of the parameters. That is, not only do we want the eval weights to be in the proper ratios to each other, but we want the smallest values that satisfy those ratios. In practice I never observed a difference so I eventually disabled it.
This exercise brought in around 120 ELO for Prophet. The difference between Prophet 4.0 and 4.1 was around 100 ELO and can be fully attributed to optimizing evaluation parameters. The difference between Prophet 4.1 and 4.2 is around 50 ELO, but some of that is due to the addition of pawn hashing and other eval terms.
I find it instructive to visualize the piece square tables as heat maps. The following are the pawn piece square tables before and after training. Note in 4.0 there was only a single pawn table, but by 4.2 I had split those into middle and endgame tables. A couple of things that really stand out to me are how undervalued a pawn on the 7th rank was, and just how much the danger of pushing a pawn around the king was underestimated.
In the near term, the focus will be on adding some additional evaluation terms, particularly for pawns. The nice thing about having an auto-tuning capability is that I can focus on correctly implementing a term without worrying too much about how to assign a value to the term. Just rerun the tuner and verify with some games.