Open letter to Russ re/engine use

Squelchbelch

Only Chess

28 Nov 08

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

29 Nov 08

Originally posted by Squelchbelch
A decent player who also uses an engine isn't going to give you smoking gun engine-like moves. They will simply get all the tactics right & rarely if ever make a mistake.

If not overall matchup rates out of database & over time, then how exactly do you propose they are detected & banned?!

Maybe mods should simply ask suspects "are you using an engine?"
I'm sure they'd give an honest answer. 🙄

See my other (much longer) post on the subject of what to test. In short, I think blunder rate may be a better indicator. Unfortunately it seems to fail when asked to distinguish between good pre-computer correspondence players and engines. That may actually be an artifact of the data since pre-computer CC masters tended to play very complex tactical games, hoping to out calculate their opponent, rather than positional games. It is possible that a good modern CC player using an engine would produce no outright blunders but have an increased rate of "engine doesn't understand" blunders.

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

29 Nov 08

Originally posted by Jie
People can quote statistics when it suits them and unquote them when it doesn't suit them. There was talk of a 25 move game as being "statistically insignificant" but when a certain individual was pressing for the exclusion of a certain player from the site he said "all it takes is a single game to prove 3b" (conveniently forgetting his own 100% match in a single game).

A 25 move game is statistically insignificant on its own. In terms of the test I am performing it is a sample of size one and therefore any results obtained are unreliable at best. The same would apply to a 100 or 200 move game. Taken together, the games produce a sample of size 25 (number of games NOT number of moves in one game), perfectly adequate for the tests I was performing and comparable in size to some of the other data sets currently being waved about as "damning evidence".

Jie

benching

Joined: 17 Jul 08
Moves: 1218

29 Nov 08

1 edit

Originally posted by Kepler
A 25 move game is statistically insignificant on its own. In terms of the test I am performing it is a sample of size one and therefore any results obtained are unreliable at best. The same would apply to a 100 or 200 move game. Taken together, the games produce a sample of size 25 (number of games NOT number of moves in one game), perfectly adequate for the ...[text shortened]... able in size to some of the other data sets currently being waved about as "damning evidence".

I agree. My point was that strong OTB players should be involved in identifying alleged cheats as how does a weak OTB player determine what is a strong human move and what is a computer move if its all greek to him?

Look at the following quote by Daniel Clement Dennett :-

In chess we find several quite crisp distinctions that can also be discerned rather more problematically in the larger game of life. There are, for instance, the "forced moves" in chess. Moves are occasionally forced by the rules of chess: in these instances one finds oneself so boxed in that one and only one legal move is available.... More interesting ... are the forced moves on those occasions when there is more than one legal move, but only one non-idiotic, non-"suicidal" move, which is said for that reason to be forced. It is forced not by the rules of chess, and not by the laws of physics, but by the dictates of reason. It is obviously the only rational thing to do, given one's interest in winning (or just not losing) the game.

If one'e chess level is so low that they do not understand what forced moves are, then they should not be conducting witchhunts in the name of game moderation.

Korch

Chess Warrior

Riga

Joined: 05 Jan 05
Moves: 24932

29 Nov 08

4 edits

Originally posted by Kepler
The main reason I chose first choice rather than top three is simply that no one could give me an adequate reason for preferring top three over any other number. In fact, I received no explanation at all. My conclusion is that it is an arbitrary number chosen (I suspect) because in the past it has given the desired result, namely an incriminating match up rat stinguish between correspondence-style play (few games, long move times etc) and engines.

The main reason I chose first choice rather than top three is simply that no one could give me an adequate reason for preferring top three over any other number. In fact, I received no explanation at all. My conclusion is that it is an arbitrary number chosen (I suspect) because in the past it has given the desired result, namely an incriminating match up rate for a suspect. A good statistician does not modify his methods to give the desired result. Thinking about the whole n-top choice thing, it occurred to me that if engines produce engine match up rates significantly higher than humans then this should be true of the first choice. If we now increase the number of moves considered the engine has less room for improvement than the human and the gap between the two gets narrower. Furthermore, why stop at three? If we were to increase the number of choices considered sufficiently we could "prove" that my neighbour's cat is an engine. I decided to test my idea using some games played by two versions of Glaurung (guaranteed high match up) and some blitz games played by a couple of people I know elsewhere. The match up rates for the engines were high and increased a little with increasing number of moves analysed whereas the match up rates for humans were low but increased markedly with number of moves analysed.

If only first choice is checked out then it would make very easy avoid cheat detection for example in all 100% choosing top 2 or 100% top 3 choice. Your claim that "it is an arbitrary number chosen (I suspect) because in the past it has given the desired result, namely an incriminating match up rate for a suspect" is unbased and only shows your desire against this cheat detection method. Top 3 were choosed because if someone gets high matchup degree in top 3 choices then it leaves no doubt about cheating.

Strong players from 1922 will not produce significantly worse moves than players today...

One of the most absurd statements I`ve ever have read!!! During these 80 years understanding of chess have been developed much more than you can imagine. Kasparov in his "My great precedors" have showed it very well. There are some kind of positions (like standart Queen gambit positions) in which strong players in 1922 would play not so worse, but in many modern positions (like Queens Indian, Robatch, Kings Indian, Benoni, Nimzowitch etc.) players in 1922 would play much worse than modern 2300 rated player.

Actually there are strong players who have stated opinion that Fisher with his chess knowledge in 70ties would be average GM now.

My current thought is that a more reliable way to distinguish humans and engines would be blunder rate.

Would top GMs pass your blunder rate test, taking into account that they make blunders rarely? And is your blunder rate test able to catch cheaters making some moves in their own and maybe blundering with intention to avoid detection?

Modern top GMs have passed test used by mods. And if they have passed could you give valid reason why some suspect`s can`t?

Gatecrasher

Whale watching

33°36'S 26°53'E

Joined: 05 Feb 04
Moves: 41150

29 Nov 08

Looking only at first choice engine move is like reading a book with most of its pages torn out. Why discard such valuable data? It is true that the gap between engine and human play may narrow as more choices are included, but so too does the variance of such data. Statistical inference based on it is just as reliable, if not more so.

Why stop at three? Who stops at three? We don't know what the current team uses. If they use the batch analyzer developed when I was a game mod, then they could go up to four choices. Perhaps they have developed other tools since. The real question is what engine data would a cheat use? To assume they would use only the first choice is naive in the extreme.

I don't know exactly what data selection methods and statistical tests were used to arrive at Kepler's contention that "match up rates are no indicator of engine use", but I would challenge his conclusion most vehemently. It flies full in the face of the mountains of contrary evidence that I and other mods have amassed over the years.

Each n-choice match-up that is calculated provides valuable evidence on which to apply a statistical hypothesis. It is simply more evidence to be assessed and weighed up. One does not reach a better conclusion by over simplification and selective exclusion. All one achieves wearing blinkers is bad science. One has to look at all evidence no matter how inconvenient it may be to one's preconceptions.

On the other hand, it has to be stated clearly that game modding is not as simple as arriving at a "gold standard" for match-up rates as the OP asserts. While I was a game mod there was no such arbitrary hurdle above which a suspect was automatically considered a cheat. Such an approach ignores other valuable evidence that may be available. It also ignores the nature of the games being analyzed. For example, a sample dominated by sharp tactical games will give much higher match-up rates than a sample dominated by tight positional games.

The more useful application of a benchmark would be to the confidence intervals applied to appropriate hypothesis testing. Even then there are a myriad other factors that may need to be considered before all reasonable doubt can be excluded. Sometimes those factors will support the analysis, and sometimes they detract. Each case should be an individual exercise.

If game modding was as simple as applying a benchmark to match-up rates, there would be no need for game moderators. The admins could simply program an automaton to sift through all our games, and then ban players who are above some arbitrary cut-off. For reasons too many to mention, that approach would certainly increase the likelihood of humans being wrongly banned as engines, and similarly, engines flying under the radar, wrongly passed over as human.

Match-up analysis is a very useful and often decisive tool, but it should not the be-all and end-all of game moderation. I hope the current team realize that.

I'm afraid there is a lot of barking going on in this thread but as far as I can see, it is mostly up the wrong tree.

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

29 Nov 08

2 edits

Originally posted by Korch
[b]The main reason I chose first choice rather than top three is simply that no one could give me an adequate reason for preferring top three over any other number. In fact, I received no explanation at all. My conclusion is that it is an arbitrary number chosen (I suspect) because in the past it has given the desired result, namely an incriminating match up r ed by mods. And if they have passed could you give valid reason why some suspect`s can`t?

If only first choice is checked out then it would make very easy avoid cheat detection for example in all 100% choosing top 2 or 100% top 3 choice. Your claim that "it is an arbitrary number chosen (I suspect) because in the past it has given the desired result, namely an incriminating match up rate for a suspect" is unbased and only shows your desire against this cheat detection method. Top 3 were choosed because if someone gets high matchup degree in top 3 choices then it leaves no doubt about cheating.

I have no desire to disprove the cheat detection method, quite the opposite in fact. I was trying to demonstrate that it is sound. That is why I find the results so shocking.

A player could only avoid cheat detection by avoiding the top engine choice if he knew in advance what engine was going to be used for detection purposes and he was using the same engine. My results indicate that different engines match each other's top choices approximately 60% of the time but get an almost 100% match up if they are matching moves from the same engine. If the cheat were using Fritz and I was using Shredder how would he know that his 2nd choice was noit the same as my first choice?

If top three match ups leave no doubt about cheating why not use top four or top five? Surely that would be even more indicative of cheating? I still await any kind of reasoned argument in favour of using top three choices and until I receive such a reasoned argument I will assume it is an arbitrary number chosen for no particularly good reason.

Would top GMs pass your blunder rate test, taking into account that they make blunders rarely? And is your blunder rate test able to catch cheaters making some moves in their own and maybe blundering with intention to avoid detection?

I don't know. I am still investigating the blunder rate idea. I don't suppose that a cheat is going to deliberately blunder since they are playing to win! I suppose a clever cheat who has different agenda to most of the obvious cheats who have been banned could actually tweak his engine to make plausible blunders at a reasonable rate. If such creatures are common I suspect they are here already and are evading detection by such methods. After all, a blunder is not an engine match so deliberate blunders evade top three match up and blunder rate methods.

Modern top GMs have passed test used by mods. And if they have passed could you give valid reason why some suspect`s can`t?

My concern is that the test may be flawed. Erroneous bannings might be a concern but of equal, or likely greater, concern to at least a few people must be the possibility of failing to ban a cheat.

It is possible that there are other factors at work here. I am very puzzled by the low match up rates the engine tournament got. I was confident that an all-engine event should produce a significantly higher math up rate than humans. Having given this some thought, I am going to revisit the whole thing with different analysis times. The reason for this is because I already discovered that different analysis times can produce strange, apparently chaotic, variations in match up rates. It is possible that Glaurung running on my particular hardware hits a low match up rate at 30 seconds but gives results more in line with expectation at other times. However, even if I do get a result more in keeping with expectation, the first results would just provide confirmation that the combination of hardware, engine and analysis time is critical.

luctruc

Joined: 28 Jan 04
Moves: 3570

29 Nov 08

On the contrary, an "automaton" is exactly what we're after. The ideal test doesn't need deep-browed chess nestors making the calls. Can a test like that be devised? I'm prepared to offer a crisp, new one dollar bill to anyone who can devise it. Site management welcome to chip in.

Jie

benching

Joined: 17 Jul 08
Moves: 1218

29 Nov 08

1 edit

Kepler can you please put [ i ] or [ b ] tags ^^^ so that is clear what you are replying to.

Thanks.

Korch

Chess Warrior

Riga

Joined: 05 Jan 05
Moves: 24932

29 Nov 08

4 edits

Originally posted by Kepler
If only first choice is checked out then it would make very easy avoid cheat detection for example in all 100% choosing top 2 or 100% top 3 choice. Your claim that "it is an arbitrary number chosen (I suspect) because in the past it has given the desired result, namely an incriminating match up rate for a suspect" is unbased and only shows your desire aga rovide confirmation that the combination of hardware, engine and analysis time is critical.

I have no desire to disprove the cheat detection method, quite the opposite in fact. I was trying to demonstrate that it is sound. That is why I find the results so shocking.

It seems to me that your statements does not match with these words....

A player could only avoid cheat detection by avoiding the top engine choice if he knew in advance what engine was going to be used for detection purposes and he was using the same engine. My results indicate that different engines match each other's top choices approximately 60% of the time but get an almost 100% match up if they are matching moves from the same engine. If the cheat were using Fritz and I was using Shredder how would he know that his 2nd choice was noit the same as my first choice?

I agree that different engines does not match 100% with each other, but how can it lead to unbased bans? Different engine used to analyse cheaters moves would give lower matchup results which will favor suspect. And if matchup of current engine is overhelming high then it does not matter which exactly engine was used. and suspect are usually analysed using more than one engine.

I don't suppose that a cheat is going to deliberately blunder since they are playing to win!

They can "sacrifice" some less important games (or even tournaments) to win the most important ones (like English Tal did). Also blunder in winning position may not lead to loss. There can be tactical oversights not making too much harm.

If top three match ups leave no doubt about cheating why not use top four or top five? Surely that would be even more indicative of cheating? I still await any kind of reasoned argument in favour of using top three choices and until I receive such a reasoned argument I will assume it is an arbitrary number chosen for no particularly good reason.

Top 4 could be useful too as humans are unable to reach 100% top 4 choice too. As Gatecrasher have pointed out game mods can investigate also top 4. I did not tell it before not to give extra information to potential cheaters. But could you explain why overhelming matchup of top3 can become invalid evidence, because it was not analysed top 4 or 4 choices? According to your logic if person has overhelming high top1 matchup (unreachable for human players) and not so overhelming top3 matchup, we can make absurd conclusion that "normal" top 3 matchup "neutralises" overhelming high top 1 matchup.

My concern is that the test may be flawed. Erroneous bannings might be a concern but of equal, or likely greater, concern to at least a few people must be the possibility of failing to ban a cheat.

From your statements I dont see logic to claim that erroneous bannings might be a concern. About failing to ban a cheat - as Gatecrasher have pointed out - matchup is not only possible indication of cheating.

It is possible that there are other factors at work here. I am very puzzled by the low match up rates the engine tournament got. I was confident that an all-engine event should produce a significantly higher math up rate than humans. Having given this some thought, I am going to revisit the whole thing with different analysis times. The reason for this is because I already discovered that different analysis times can produce strange, apparently chaotic, variations in match up rates. It is possible that Glaurung running on my particular hardware hits a low match up rate at 30 seconds but gives results more in line with expectation at other times. However, even if I do get a result more in keeping with expectation, the first results would just provide confirmation that the combination of hardware, engine and analysis time is critical.

I dont know how did you get your low matchup but could I have these engine tournament games? It would be interesting to analyse them myself to check out if really they are giving much lower results than top GMs...

But I presume that some weaker engines can give lower matchup with moves of stronger engine than top GMs.

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

29 Nov 08

Originally posted by Gatecrasher
Looking only at first choice engine move is like reading a book with most of its pages torn out. Why discard such valuable data? It is true that the gap between engine and human play may narrow as more choices are included, but so too does the variance of such data. Statistical inference based on it is just as reliable, if not more so.

Why stop at th ...[text shortened]... arking going on in this thread but as far as I can see, it is mostly up the wrong tree.

The data in my two samples are the moves made in the games. I have single moves since no one in 1922 indicated their top three choices and the game scores of the 16th WCCC do not indicate top three choices. The only data I have available are the moves actually made. To compare that to top three choices from the analysis engine is asking for speculation from the engine. Would Reti have even considered that second choice move? Did he even think about the first choice move if he chose instead the second choice or nth choice? We have no way of knowing. Similarly, we have no way of knowing what the engines' second and third choices were. All we know is that at the moment the move was made that move was the player's (human or engine) first choice. This is not over simplification, this is just accepting the data as it is. Adding unknown and unknowable factors is just adding complexity for no good reason. If the data does not provide any evidence to reject a hypothesis we should not manipulate it just to make it do so.

If, as some assert, engines give a clear signal of "engineness" via the moves they make that signal should be evident in the moves they make NOT in the second and third choices that another engine comes up with. Manipulating the data until it matches preconceived notions or desires is definitely bad science, which is the reason I have left it alone rather than fiddling with the analysis until it matches my expectation of a clear difference between the two samples. I have still not had any reasoned argument for choosing top three choices over any other number so will continue to consider it no more than an arbitrary choice.

I am encouraged that the previous game mods did not take match up rates as the only evidence required. However, it has to be a matter of concern that non-game mods consider that it is the only evidence required to ban without further investigation.

Korch

Chess Warrior

Riga

Joined: 05 Jan 05
Moves: 24932

29 Nov 08

1 edit

Originally posted by Kepler
The data in my two samples are the moves made in the games. I have single moves since no one in 1922 indicated their top three choices and the game scores of the 16th WCCC do not indicate top three choices. The only data I have available are the moves actually made. To compare that to top three choices from the analysis engine is asking for speculation from t me mods consider that it is the only evidence required to ban without further investigation.

I am encouraged that the previous game mods did not take match up rates as the only evidence required. However, it has to be a matter of concern that non-game mods consider that it is the only evidence required to ban without further investigation.

You should understand that disclosing ALL fineses of used methods will help cheats to avoid detection.....

gambit05

Mad Murdock

I forgot

Joined: 05 May 05
Moves: 20526

29 Nov 08

Slightly off topic:

How are book moves defined?

I guess by comparing the moves with a suitable database. However, a certain player could earn specific books on a given opening that others do not have, or could even have analysed many lines beyond accepted opening theory. This might even include an engine, which is also used by many authors of opening books.

Ragnorak

For RHP addons...

tinyurl.com/yssp6g

Joined: 16 Mar 04
Moves: 15013

29 Nov 08

Originally posted by Carterson
i hope i don't get into trouble again for posting this, i ruffled a few feathers in that other forum with this info.

"info"

D

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

29 Nov 08

1 edit

Originally posted by Korch
I dont know how did you get your low matchup but could I have these engine tournament games? It would be interesting to analyse them myself to check out if really they are giving much lower results than top GMs...

But I presume that some weaker engines can give lower matchup with moves of stronger engine than top GMs.

All the games from the tournament are available in pgn here:
http://www.chessbase.com/news/2008/games/wccc08.pgn

I could give you my copy of this but then you could not know whether I had manipulated the game scores or not. Better that you look at the original data and make your own conclusions. I selected 25 games from the 45 available by removing all the games involving Mobile Chess (if I had not the match up rate would have been much lower!) and then choosing 25 games at random from those remaining.

Your comment about weaker engines giving a lower match up rate is of interest. I am wondering if it is actually possible that the the opposite is true. Could it be that a very strong engine, optimised for the championship, running at tournament times on specialised hardware could produce lower match up rates? It is possible that the games from the 16th World Chess Championship are not really the games I should be looking at. I think I shall have to investigate the possibility that modern engines running on ordinary hardware at more reasonable time controls actually do produce greater match up rates.

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

29 Nov 08

Originally posted by Jie
Kepler can you please put [ i ] or [ b ] tags ^^^ so that is clear what you are replying to.

Thanks.

I tried that but for some reason it didn't come through when I hit post. I shan't try anything that complicated again.