Validations – and surprises

I’ve been pretty busy as of late, but in an effort to get some momentum going I picked some items that have been on my to do list that would take far more processor time than programming time – validating some evaluation terms that have been part of Prophet for many years now. The goal, really, was to validate what I was sure was true- that these evaluation terms do help; if they are removed, surely performance would drop as well. As you’ll see, it didn’t quite work out that way.

Here’s a list of evaluation terms that were tested, with results below.

  • Knight tropism – the idea that keeping your knight as close as possible to the enemy’s king is generally a good thing.
  • Rooks on open files, or even half open files.
  • Passed pawns – pawns with no enemy pawn in front or on an adjacent file should be rewarded, since the likelihood of promotion increases dramatically.
  • Isolated pawns – pawns with no friendly pawns on an adjacent file should be penalized, since they are weakened considerably without the supporting pawns.
  • Doubled pawns – pawns that occupy the same file as another friendly pawn are a (small) positional weakness.
  • Major pieces (rooks, queens) on the 7th rank with the enemy king on the back rank are usually very deadly, especially when “connected.”

Knight Tropism

This term works by penalizing a knight by 2 x distance_from_enemy_king centipawns. Distance from the enemy king is the max of difference in ranks and difference in files.

I actually started out running this one with chess4j, the Java engine, before deciding to use Prophet4 going forward. As explained in a previous post, the overhead of restarting the JVM between matches is just too high, and not restarting the engine between games isn’t great either, so doing these fast tests with a lightweight executable that can be quickly restarted seems preferable. At any rate, the outcome was nearly the same so I’ve combined the results into one table.

Also, since these are validations of existing terms, player A was the “control” player, and player B the “without” player. The hypothesis is that “Player A is better than Player B.”

WinsLossesDrawsPctEloError
63125982767450.8%5.73.8
23432122276651.5%10.66.3
59425789826950.4%2.73.7
Control vs “without knight tropism”, 5+0.25

As you can see, the knight tropism is worth a few ELO. Not a game wrecker by any means, but it does have a small positive effect. We’ll check this one off the list.

Rooks on Open Files

Rooks on open files or even half open files are known to be a strategic advantage. It allows the player to move their rooks around easily, projecting strength and penetrating into the opponent position.

Rooks on files with no other pieces are given a 25 centipawn bonus. Rooks on files with just enemy pieces are given a 15 centipawn bonus.

WinsLossesDrawsPctEloError
57341171754.7%33.212.6
Control vs “No rook on open files”, 5+0.25

This term is obviously doing the job. Check this one off the list.

Passed Pawns

Passed pawns become promoted pawns. Passed pawns are awarded 20 centipawns.

WinsLossesDrawsPctEloError
21431930265651.6%11.06.5
Control vs “no passed pawn bonus”, 5+0.25

Another one validated. That’s not to say it’s tuned correctly, but at least we can say it does help.

Isolated Pawns

Isolated pawns don’t have any friendly pawns on adjacent ranks to give them any support. They are a weakness. In Prophet and chess4j, isolated pawns are penalized 20 centipawns.

WinsLossesDrawsPctEloError
50264174146.3%-25.712.2
13851532184748.5%-10.77.7
Control vs “no isolated pawn penalty”, 5+0.25

That is NOT a good result! I was in such disbelief after the first test that I ran a second test, and though the result doesn’t seem quite as bad, it’s still bad. As it stands, the isolated pawn penalty is hurting. I haven’t disabled it, because the major focus right now is rewriting the engine before improving it, and I want to be able to compare apples to apples after the rewrite. However, I have put an item on the backlog to study this in more detail. The heuristic should work. Either the implementation isn’t quite right, or it’s too expensive, or the weights aren’t right. I’ll have to get to the bottom of this.

Doubled Pawns

Doubled pawns also known to be a positional weakness. They are penalized 10 centipawns (note this penalty gets “awarded” to each pawn).

WinsLossesDrawsPctEloError
59372295047.2%-19.810.9
8581149151645.9%-28.88.7
Control vs “no doubled pawn penalty”, 5+0.25

Another surprising and disappointing result! Investigating this has also been added to the post-rewrite backlog.

Majors on 7th

This evaluation term awards rooks and queens on the 7th rank when the enemy king is on the back rank. If so, 50 centipawns are awarded. Additionally, if connected to another major piece on the 7th rank, an additional 80 centipawns are awarded – the idea being this is likely a deadly / mating attack.

I can’t remember how long this term has been around, but it’s been a looooong time. Sadly, the results aren’t so good.

WinsLossesDrawsPctEloError
10191137154248.4%-11.18.5
31783509448348.5%-10.35.0
Control vs “without majors on 7th”, 5+0.25

Clearly, not a good outcome! I wondered what would happen if I just disabled the “connected” part but left the award for having a major piece on the 7th when the enemy king is on the 8th?

WinsLossesDrawsPctEloError
57056024827149.2%-5.53.7
56915973833649.3%-4.93.7
Control vs “without CONNECTED term”, 5+0.25

Disabling the “connected” bit may help a little, but still, the heuristic is hurting overall, not helping.

Conclusion

Out of the six evaluation terms to be tested / validated, three of them were found to be helpful and three are actually harmful. The “majors on 7th”, doubled pawn, and isolated pawn heuristics have all been added to the post-rewrite backlog for further study.

The moral of the story here is, do NOT assume anything. Always test! This has been my philosophy for a good while now, but these eval terms were all added at a time that I wasn’t as rigorous in my testing as I am now. (And, frankly, I don’t think many in the computer chess community were before 10-15 years ago.)

If there is any upshot, it’s that there is a guaranteed strength improvement to be had, even if it’s just removing the terms altogether. But, since I know they really should work, I’ve opted to leave them for now. Also, when the Prophet4 rewrite is complete, I really want to be able to run some benchmarks against Prophet3 and be able to make comparisons. The rewritten codebase should “stand on its own.” Improving the evaluator now would cast some doubt on those comparisons.