Originally posted by KeplerSurely the reason for the low match up from the World Computer Championship is the fact that there were at least two or three absolutely dreadful programs in the competition? These would drag down the average match up for all the games in this tournament.
You may recall I posted in another thread now long gone stating that I was applying statistical analysis to two samples of games. One sample was taken from a tournament held in Vienna in 1922 and featured the likes of Reti, Gruenfeld and Rubinstein. The other sample was taken from the 16th World Computer Chess Championship which was recently won by Rybka.
...[text shortened]... it has taken so long to ban some alleged cheats, match up rates are no indicator of engine use!
http://www.grappa.univ-lille3.fr/icga/tournament.php?id=178
Did your analysis take book moves into account? The human game which you said had an 80% match up was a 25 move draw, so probably too short a game to be statistically significant.
http://www.chessgames.com/perl/chessgame?gid=1148874
For individual games between players on this site we have seen 90% match ups with Fritz' first choice.
Originally posted by alexstclairei have been really itching to ask the same question but thought it inappropriate as the thread may be removed as a consequence, however, if someone would like to send any analysis privately i would really like to give it some consideration - regards Robbie.
who is the player in question?! is it numero uno
Originally posted by KeplerI´d wondered about the choice of top 3 moves myself. Can you expand on how and why you decided that testing beyond the first choice engine move makes the difference between engines and humans harder to detect (other than the obvious well they´ve got three chances to match aspect)? I think that this is a critical issue. For that matter I´d be interested to hear one of the former games mods comment on this.
DT: Stuff I´m not responding to cut out
A little preliminary work on the top three issue suggested that this actually narrows the gap between engine and human, making the difference harder to detect, whereas I wanted the opposite.
I´m sympathetic to what you are trying to do as it is vital that the innocent are not found guilty. However. It´s also annoying that people use engines at all - I´d like their behaviour modified. While I´m aware that you are an expert in statistics, which I´m not, I feel that your implied conclusion - that it is impossible to find a statistical test which can reliably detect engine use - is pessimistic.
I think you are right that a major problem is to find a good way of eliminating moves that both a human and an engine would make. Forced lines and automatic recaptures, or situations where there are a small number of sensible moves and the rest are obviously losing. We seem to agree that eliminating these moves from the sample should increase the difference between genuine players and GenuineIntel players.
Originally posted by KeplerA decent player who also uses an engine isn't going to give you smoking gun engine-like moves. They will simply get all the tactics right & rarely if ever make a mistake.
....
This is an extremely disturbing result. If anyone has been banned on the basis of match up rates alone I consider that there is at least a 50% chance that they were wrongly banned. I hope that match up rates have only been used as an indicator to suggest further scrutiny and that further tests have then been applied. This also suggests a reason why it has taken so long to ban some alleged cheats, match up rates are no indicator of engine use!
If not overall matchup rates out of database & over time, then how exactly do you propose they are detected & banned?!
Maybe mods should simply ask suspects "are you using an engine?"
I'm sure they'd give an honest answer. π
Originally posted by DeepThoughtThe methods I use compare the overall matchups between both players so this takes into account forcing lines.
...Forced lines and automatic recaptures, or situations where there are a small number of sensible moves and the rest are obviously losing. We seem to agree that eliminating these moves from the sample should increase the difference between genuine players and GenuineIntel players.
If you have a game between 2 evenly rated players with forcing lines & get results like:
Result:
White: 2300(a)
Top 1 Match: 19/31 (61,3% )
Top 2 Match: 27/31 (87,1% )
Top 3 Match: 31/31 (100,0% )
Black: 2300(b)
Top 1 Match: 13/30 (43,3% )
Top 2 Match: 22/30 (73,3% )
Top 3 Match: 23/30 (76,7% )
Then who out of the 2300(a) or 2300(b) are you going to investigate further?
I honestly think that it is naive to discount forcing moves.
What next? Should we discount all tactics/combinations where the engine user forces the other player into a losing position because of a wonderful 10-move coup d’état?
I think games mods would still be investigating the first suspect on the site if we followed your discounted moves criteria to it's logical conclusion!
This discussion is taking the well trodden path of the thread that was removed earlier.
Let us take the amazing game below :-
If forced moves and recaptures should not be factored out then, database moves should also not be factored out. Taking into account all the moves, what are the matchups for the above game? This illustrates also that a credible "game cheat hunter" should be at least 2200 OTB not some prima donna loudmouth punk.
Originally posted by Fat LadyYes, I took all that into account. I removed all the games played by Mobile Chess for instance because it actually plays worse than I do when drunk. Opening moves were discarded even though opening theory was not as advanced in 1922 as it is now. In fact, the 1922 Vienna tournament player list is like looking at an opening book, most of the players' names are now opening or variation names.
Surely the reason for the low match up from the World Computer Championship is the fact that there were at least two or three absolutely dreadful programs in the competition? These would drag down the average match up for all the games in this tournament.
http://www.grappa.univ-lille3.fr/icga/tournament.php?id=178
Did your analysis take book moves into ...[text shortened]... ividual games between players on this site we have seen 90% match ups with Fritz' first choice.
I mentioned the two individual match up rates to show the range of results involved. We may not like a 25 move draw but it is part of the data and there is no good statistical reason to discard it, unlike the games played by Mobile Chess.
Originally posted by BlitzNewbiePeople can quote statistics when it suits them and unquote them when it doesn't suit them. There was talk of a 25 move game as being "statistically insignificant" but when a certain individual was pressing for the exclusion of a certain player from the site he said "all it takes is a single game to prove 3b" (conveniently forgetting his own 100% match in a single game).
Some time ago I posted in the OTB players club forum that I found simple matchup rates to be too one-dimensional to be used as conclusive evidence of engine use.
Kepler seems to back up my point of view.
The nature of the games analyzed need to somehow be taken into account. I have no idea of how this is done in praxis though...
Originally posted by DeepThoughtThe main reason I chose first choice rather than top three is simply that no one could give me an adequate reason for preferring top three over any other number. In fact, I received no explanation at all. My conclusion is that it is an arbitrary number chosen (I suspect) because in the past it has given the desired result, namely an incriminating match up rate for a suspect. A good statistician does not modify his methods to give the desired result. Thinking about the whole n-top choice thing, it occurred to me that if engines produce engine match up rates significantly higher than humans then this should be true of the first choice. If we now increase the number of moves considered the engine has less room for improvement than the human and the gap between the two gets narrower. Furthermore, why stop at three? If we were to increase the number of choices considered sufficiently we could "prove" that my neighbour's cat is an engine. I decided to test my idea using some games played by two versions of Glaurung (guaranteed high match up) and some blitz games played by a couple of people I know elsewhere. The match up rates for the engines were high and increased a little with increasing number of moves analysed whereas the match up rates for humans were low but increased markedly with number of moves analysed.
I´d wondered about the choice of top 3 moves myself. Can you expand on how and why you decided that testing beyond the first choice engine move makes the difference between engines and humans harder to detect (other than the obvious well they´ve got three chances to match aspect)? I think that this is a critical issue. For that matter I´d be interested ...[text shortened]... from the sample should increase the difference between genuine players and GenuineIntel players.
I do not think that a reliable statistical test for engine use is impossible, just that the current top three engine choices over 30 seconds may well be unable to distinguish accurately between engine and human. It occurs to me that this is not actually a surprising result. Strong players from 1922 will not produce significantly worse moves than players today and were probably playing better than the majority of players on this site. I suspect they may actually have been capable of better play than all the players on this site. Similarly, a strong engine will produce good moves. If that were no the case strong engines would lose regularly to humans. I am also unconvinced by the idea that engines somehow produce moves with some characteristic whiff of silicon. Modern engines may produce dubious moves at times but so do modern GMs.
There are many ways that a good player can distinguish between engine and human. Unfortunately most of them are subjective and we require an objective test of "engineness". My current thought is that a more reliable way to distinguish humans and engines would be blunder rate. This is quite easy to check if we use an interface that sticks ? or ?? next to moves it considers downright bad. Now, it is possible that the move is really bad (it loses material for no compensation) or it is actually good but the engine does not understand it (many positional sacrifices are "bad" to an engine). Whether objectively good or bad does not matter, the fact is moves that get a ? or ?? would not be played by the analysis engine and are therefore unlikely to be played by other engines. Preliminary investigation of the same sample games indicates that blunder rate is a very accurate method of distinguishing between strong OTB players and engines. It should be possible to use this to distinguish between anyone playing in OTB fashion (Many games, short move times etc) and an engine but it may be more difficult to distinguish between correspondence-style play (few games, long move times etc) and engines.