Open letter to Russ re/engine use

Squelchbelch

Only Chess

28 Nov 08

Gatecrasher

Whale watching

33°36'S 26°53'E

Joined: 05 Feb 04
Moves: 41150

30 Nov 08

Gausdal Classic GM-A 2008

{ Kaidanov, Gregory - ELO 2596 (Games: 9) }
{ Top 1 Match: 151/274 ( 55.1% )
{ Top 2 Match: 203/274 ( 74.1% )
{ Top 3 Match: 232/274 ( 84.7% )
{ Top 4 Match: 249/274 ( 90.9% )

{ Gopal, Geetha Narayanan - ELO 2562 (Games: 9) }
{ Top 1 Match: 165/306 ( 53.9% )
{ Top 2 Match: 212/306 ( 69.3% )
{ Top 3 Match: 243/306 ( 79.4% )
{ Top 4 Match: 261/306 ( 85.3% )

{ Kotronias, Vasilios - ELO 2611 (Games: 9) }
{ Top 1 Match: 180/320 ( 56.3% )
{ Top 2 Match: 238/320 ( 74.4% )
{ Top 3 Match: 257/320 ( 80.3% )
{ Top 4 Match: 271/320 ( 84.7% )

{ Sandipan, Chanda - ELO 2585 (Games: 9) }
{ Top 1 Match: 164/310 ( 52.9% )
{ Top 2 Match: 212/310 ( 68.4% )
{ Top 3 Match: 242/310 ( 78.1% )
{ Top 4 Match: 260/310 ( 83.9% )

{ Macieja, Bartlomiej - ELO 2599 (Games: 9) }
{ Top 1 Match: 183/329 ( 55.6% )
{ Top 2 Match: 243/329 ( 73.9% )
{ Top 3 Match: 274/329 ( 83.3% )
{ Top 4 Match: 292/329 ( 88.8% )

{ Ikonnikov, Vyacheslav - ELO 2578 (Games: 9) }
{ Top 1 Match: 135/286 ( 47.2% )
{ Top 2 Match: 193/286 ( 67.5% )
{ Top 3 Match: 207/286 ( 72.4% )
{ Top 4 Match: 237/286 ( 82.9% )

{ Lie, Kjetil A - ELO 2558 (Games: 9) }
{ Top 1 Match: 161/289 ( 55.7% )
{ Top 2 Match: 217/289 ( 75.1% )
{ Top 3 Match: 236/289 ( 81.7% )
{ Top 4 Match: 252/289 ( 87.2% )

{ Krush, Irina - ELO 2479 (Games: 9) }
{ Top 1 Match: 202/362 ( 55.8% )
{ Top 2 Match: 258/362 ( 71.3% )
{ Top 3 Match: 291/362 ( 80.4% )
{ Top 4 Match: 314/362 ( 86.7% )

{ Hole, Oystein - ELO 2387 (Games: 9) }
{ Top 1 Match: 199/409 ( 48.7% )
{ Top 2 Match: 273/409 ( 66.7% )
{ Top 3 Match: 315/409 ( 77.0% )
{ Top 4 Match: 340/409 ( 83.1% )

{ Moskow, Eric - ELO 2229 (Games: 9) }
{ Top 1 Match: 171/354 ( 48.3% )
{ Top 2 Match: 229/354 ( 64.7% )
{ Top 3 Match: 263/354 ( 74.3% )
{ Top 4 Match: 287/354 ( 81.1% )

{ All Players }
{ Top 1 Match: 1711/3239 ( 52.8% )
{ Top 2 Match: 2278/3239 ( 70.3% )
{ Top 3 Match: 2560/3239 ( 79.0% )
{ Top 4 Match: 2763/3239 ( 85.3% )

Gatecrasher

Whale watching

33°36'S 26°53'E

Joined: 05 Feb 04
Moves: 41150

30 Nov 08

1 edit

World Chess Championship Candidate Matches 1971

{ Fischer R (Games: 21) }
{ Top 1 Match: 378/602 ( 62.8% )
{ Top 2 Match: 495/602 ( 82.2% )
{ Top 3 Match: 534/602 ( 88.7% )
{ Top 4 Match: 553/602 ( 91.9% )

{ Petrosian T (Games: 26) }
{ Top 1 Match: 301/545 ( 55.2% )
{ Top 2 Match: 390/545 ( 71.6% )
{ Top 3 Match: 437/545 ( 80.2% )
{ Top 4 Match: 464/545 ( 85.1% )

{ Larsen B (Games: 15) }
{ Top 1 Match: 239/418 ( 57.2% )
{ Top 2 Match: 316/418 ( 75.6% )
{ Top 3 Match: 356/418 ( 85.2% )
{ Top 4 Match: 377/418 ( 90.2% )

{ Korchnoi V (Games: 18) }
{ Top 1 Match: 207/393 ( 52.7% )
{ Top 2 Match: 272/393 ( 69.2% )
{ Top 3 Match: 306/393 ( 77.9% )
{ Top 4 Match: 321/393 ( 81.7% )

{ Taimanov M (Games: 6) }
{ Top 1 Match: 131/239 ( 54.8% )
{ Top 2 Match: 173/239 ( 72.4% )
{ Top 3 Match: 195/239 ( 81.6% )
{ Top 4 Match: 212/239 ( 88.7% )

{ Geller E (Games: 8) }
{ Top 1 Match: 86/174 ( 49.4% )
{ Top 2 Match: 111/174 ( 63.8% )
{ Top 3 Match: 128/174 ( 73.6% )
{ Top 4 Match: 141/174 ( 81.0% )

{ Uhlmann W (Games: 9) }
{ Top 1 Match: 140/255 ( 54.9% )
{ Top 2 Match: 185/255 ( 72.5% )
{ Top 3 Match: 200/255 ( 78.4% )
{ Top 4 Match: 217/255 ( 85.1% )

{ Huebner R (Games: 7) }
{ Top 1 Match: 71/133 ( 53.4% )
{ Top 2 Match: 94/133 ( 70.7% )
{ Top 3 Match: 105/133 ( 78.9% )
{ Top 4 Match: 111/133 ( 83.5% )

{ All Players }
{ Top 1 Match: 1553/2759 ( 56.3% )
{ Top 2 Match: 2036/2759 ( 73.8% )
{ Top 3 Match: 2261/2759 ( 81.9% )
{ Top 4 Match: 2396/2759 ( 86.8% )

Gatecrasher

Whale watching

33°36'S 26°53'E

Joined: 05 Feb 04
Moves: 41150

30 Nov 08

6th Correspondence Chess World Cup Final 1968-1971

{ Rittner, H. (Games: 15) }
{ Top 1 Match: 203/337 ( 60.2% )
{ Top 2 Match: 254/337 ( 75.4% )
{ Top 3 Match: 276/337 ( 81.9% )
{ Top 4 Match: 293/337 ( 86.9% )

{ Zagorovsky, V. (Games: 15) }
{ Top 1 Match: 201/378 ( 53.2% )
{ Top 2 Match: 267/378 ( 70.6% )
{ Top 3 Match: 311/378 ( 82.3% )
{ Top 4 Match: 328/378 ( 86.8% )

{ Estrin, Y. (Games: 15) }
{ Top 1 Match: 201/339 ( 59.3% )
{ Top 2 Match: 263/339 ( 77.6% )
{ Top 3 Match: 291/339 ( 85.8% )
{ Top 4 Match: 301/339 ( 88.8% )

{ Thiele, E. (Games: 15) }
{ Top 1 Match: 302/511 ( 59.1% )
{ Top 2 Match: 386/511 ( 75.5% )
{ Top 3 Match: 417/511 ( 81.6% )
{ Top 4 Match: 443/511 ( 86.7% )

{ Top 4 Players}
{ Top 1 Match: 907/1565 ( 58.0% )
{ Top 2 Match: 1170/1565 ( 74.8% )
{ Top 3 Match: 1298/1565 ( 82.9% )
{ Top 4 Match: 1365/1565 ( 87.2% )

Gatecrasher

Whale watching

33°36'S 26°53'E

Joined: 05 Feb 04
Moves: 41150

30 Nov 08

Morelia-Linares 2007

{ Anand, Viswanathan - ELO 2779 (Games: 14) }
{ Top 1 Match: 224/355 ( 63.1% )
{ Top 2 Match: 270/355 ( 76.1% )
{ Top 3 Match: 304/355 ( 85.6% )
{ Top 4 Match: 314/355 ( 88.5% )

{ Carlsen, Magnus - ELO 2690 (Games: 14) }
{ Top 1 Match: 238/400 ( 59.5% )
{ Top 2 Match: 307/400 ( 76.8% )
{ Top 3 Match: 338/400 ( 84.5% )
{ Top 4 Match: 354/400 ( 88.5% )

{ Morozevich, Alexander - ELO 2741 (Games: 14) }
{ Top 1 Match: 272/465 ( 58.5% )
{ Top 2 Match: 347/465 ( 74.6% )
{ Top 3 Match: 382/465 ( 82.2% )
{ Top 4 Match: 412/465 ( 88.6% )

{ Svidler, Peter - ELO 2728 (Games: 14) }
{ Top 1 Match: 189/311 ( 60.8% )
{ Top 2 Match: 239/311 ( 76.8% )
{ Top 3 Match: 262/311 ( 84.2% )
{ Top 4 Match: 275/311 ( 88.4% )

{ Aronian, Levon - ELO 2744 (Games: 14) }
{ Top 1 Match: 188/306 ( 61.4% )
{ Top 2 Match: 239/306 ( 78.1% )
{ Top 3 Match: 262/306 ( 85.6% )
{ Top 4 Match: 275/306 ( 89.9% )

{ Ivanchuk, Vassily - ELO 2750 (Games: 14) }
{ Top 1 Match: 254/460 ( 55.2% )
{ Top 2 Match: 334/460 ( 72.6% )
{ Top 3 Match: 375/460 ( 81.5% )
{ Top 4 Match: 402/460 ( 87.4% )

{ Leko, Peter - ELO 2749 (Games: 14) }
{ Top 1 Match: 215/407 ( 52.8% )
{ Top 2 Match: 300/407 ( 73.7% )
{ Top 3 Match: 339/407 ( 83.3% )
{ Top 4 Match: 363/407 ( 89.2% )

{ Topalov, Veselin - ELO 2783 (Games: 14) }
{ Top 1 Match: 264/463 ( 57.0% )
{ Top 2 Match: 348/463 ( 75.2% )
{ Top 3 Match: 393/463 ( 84.9% )
{ Top 4 Match: 416/463 ( 89.8% )

{ All Players }
{ Top 1 Match: 1844/3167 ( 58.2% )
{ Top 2 Match: 2384/3167 ( 75.3% )
{ Top 3 Match: 2655/3167 ( 83.8% )
{ Top 4 Match: 2811/3167 ( 88.8% )

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

30 Nov 08

Originally posted by Yuga
I am not exactly sure how Kepler conducted his methodology and obtained his results but logic is sufficient to deduce the basic possibilities and implications of human-computer match up analyses.

I strongly expect a statistically significant difference between the subjects of the human and engine sample in comparing the match up rates of these subjects with ...[text shortened]... s, the presence of engine-shaped mistakes, and the consistent failure to use human-shaped ideas.

Here, we do not have a third entity that is sufficiently different from humans and engines to be distinguishable. Glaurung 2.1 is a member of the population of engines but not that of humans. It is not the third entity, Glaurung, that I was testing. I was actually performing a test to see if there was any evidence that might cause me to reject the hypothesis that the two samples of games had the same mean match up rate. There was no evidence to reject that hypothesis. That does not mean I have "proved" that engines and humans are the same or cannot be distinguished. All I have done is failed to distinguish between the two samples.

Having done a little more work, I have evidence that supports the rejection of the hypothesis which implies that the two samples are from different populations. This time I used the number of question marks awarded during the game, again only using post opening moves.

In both cases I was looking at test statistic calculated from the t distribution. In the case of match ups the test statistic was high in value which essentially tells me I have no reason to reject the hypothesis of equal means. In the other case the test statistic was very low, telling me that it is very unlikely that the two samples have equal means and hence are unlikely to be drawn from the same population. You will notice that statistics does not provide evidence to prove something, rather it indicates possible falseness of a hypothesis.

So, yes, there was no statistically significant difference in the match up rates of the two samples. However, there was a statistically significant difference between the rates at which Glaurung awarded question marks to human and engine moves. Same two samples and "third entity" but completely different results.

Gatecrasher

Whale watching

33°36'S 26°53'E

Joined: 05 Feb 04
Moves: 41150

30 Nov 08

16th World Computer Chess Championship 2008

{ Rybka - Cluster 40 cores (Games: 9) } { Rybka (Games: 9) }
{ Top 1 Match: 249/385 ( 64.7% )
{ Top 2 Match: 315/385 ( 81.8% )
{ Top 3 Match: 346/385 ( 89.9% )
{ Top 4 Match: 359/385 ( 93.2% )

{ Hiarcs - Intel Skulltrail, 8 x 4Ghz (Games: 9) }
{ Top 1 Match: 230/356 ( 64.6% )
{ Top 2 Match: 289/356 ( 81.2% )
{ Top 3 Match: 314/356 ( 88.2% )
{ Top 4 Match: 328/356 ( 92.1% )

{ Junior - Intel Dunnington, 12 x 2.67Ghz (Games: 9) }
{ Top 1 Match: 281/469 ( 59.9% )
{ Top 2 Match: 366/469 ( 78.0% )
{ Top 3 Match: 405/469 ( 86.4% )
{ Top 4 Match: 427/469 ( 91.0% )

{ Cluster Toga - Cluster, 24 cores (Games: 9) }
{ Top 1 Match: 275/469 ( 58.6% )
{ Top 2 Match: 372/469 ( 79.3% )
{ Top 3 Match: 408/469 ( 87.0% )
{ Top 4 Match: 424/469 ( 90.4% )

{ Shredder - Intel Core 2, 8 x 3.16Ghz (Games: 9) }
{ Top 1 Match: 271/467 ( 58.0% )
{ Top 2 Match: 344/467 ( 73.7% )
{ Top 3 Match: 374/467 ( 80.1% )
{ Top 4 Match: 399/467 ( 85.4% )

{ Falcon - Intel Core 2, 2 x 2.1Ghz (Games: 9) }
{ Top 1 Match: 353/484 ( 72.9% )
{ Top 2 Match: 426/484 ( 88.0% )
{ Top 3 Match: 443/484 ( 91.5% )
{ Top 4 Match: 455/484 ( 94.0% )

{ Jonny - Cluster, 16 cores (Games: 9) }
{ Top 1 Match: 307/476 ( 64.5% )
{ Top 2 Match: 391/476 ( 82.1% )
{ Top 3 Match: 430/476 ( 90.3% )
{ Top 4 Match: 442/476 ( 92.9% )

{ Sjeng - Intel Core 2, 4 x 2.8Ghz (Games: 9) }
{ Top 1 Match: 266/459 ( 58.0% )
{ Top 2 Match: 364/459 ( 79.3% )
{ Top 3 Match: 389/459 ( 84.7% )
{ Top 4 Match: 415/459 ( 90.4% )

{ The Baron - AMD Opteron 270, 4 x 2Ghz (Games: 9) }
{ Top 1 Match: 218/327 ( 66.7% )
{ Top 2 Match: 271/327 ( 82.9% )
{ Top 3 Match: 286/327 ( 87.5% )
{ Top 4 Match: 300/327 ( 91.7% )

{ Mobile Chess - Nokia 6120c (Games: 9) }
{ Top 1 Match: 133/239 ( 55.6% )
{ Top 2 Match: 170/239 ( 71.1% )
{ Top 3 Match: 185/239 ( 77.4% )
{ Top 4 Match: 197/239 ( 82.4% )

{ All Engines }
{ Top 1 Match: 2583/4131 ( 62.5% )
{ Top 2 Match: 3308/4131 ( 80.1% )
{ Top 3 Match: 3580/4131 ( 86.7% )
{ Top 4 Match: 3746/4131 ( 90.7% )

{ All Engines excluding Mobile Chess}
{ Top 1 Match: 2450/3892 ( 62.9% )
{ Top 2 Match: 3138/3892 ( 80.6% )
{ Top 3 Match: 3395/3892 ( 87.2% )
{ Top 4 Match: 3548/3892 ( 91.2% )

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

30 Nov 08

Blimey, that's a fair sized data set. You wouldn't happen to have the figures for each individual game would you?

Gatecrasher

Whale watching

33°36'S 26°53'E

Joined: 05 Feb 04
Moves: 41150

30 Nov 08

Originally posted by Kepler
Blimey, that's a fair sized data set. You wouldn't happen to have the figures for each individual game would you?

Yes, but I wouldn't post it here. If you PM me an email address you can have it.

no1marauder

Naturally Right

Somewhere Else

Joined: 22 Jun 04
Moves: 42677

30 Nov 08

Originally posted by DeepThought
What is required is confidence in the system. Kepler, and to a lesser extent I, are not confident that a straightforward comparison of match-up rates is sufficient to decide that someone is cheating (except in the more blatant cases). Gatecrasher made quite a long post earlier in the thread that put my mind at rest on that point, as he stated that matc ...[text shortened]... right. So why are you resistant to the idea?

Edit: Written before I read Kepler´s last post.

There is plenty of confidence in the system, but a few people here want to undermine it for whatever reason. I have a good amount of confidence in the system that has been developed over the years here by the Game Mods. I have little confidence in that the ultimate decision makers don't take into account things that should be non-factors in deciding someone should be banned for engine use. Both of those conclusions are based on years of experience with this issue here on RHP.

You can and just did argue that no system can ever get every possible case right and that there is always a very slim possibility of false positives. So what? Every reasonable doubt has been given to everyone who has ever been banned from this site for engine use. Ringing your hands and awaiting for the 100% never wrong to Deep Thought's ironic user name's satisfaction of detecting engine cheats is a Quixotic quest. If Berliner and Rittner playing a limited number of correspondence games under the very liberal time limits of CC in decades past can't get past 82% match ups on a regular basis, neither can Joe Blow who turns on his computer and plays a 100 games at a time without using an engine. Deal with it.

Yuga

Renaissance

OnceInALifetime

Joined: 24 Sep 05
Moves: 30579

01 Dec 08

1 edit

Originally posted by Kepler
Here, we do not have a third entity that is sufficiently different from humans and engines to be distinguishable. Glaurung 2.1 is a member of the population of engines but not that of humans. It is not the third entity, Glaurung, that I was testing. I was actually performing a test to see if there was any evidence that might cause me to reject the hypothesis uman and engine moves. Same two samples and "third entity" but completely different results.

An expression that is used where I live - I do not think that we are completely on the same page. 🙂

I have a rudimentary understanding of how test statistics and hypotheses function since I have to use them in science and I have taken a statistics class. 🙂

When comparing human X’s moves directly against computer Z’s moves for match up rates we will get a percent match up.

When comparing computer Y’s moves directly against computer Z’s moves for match up rates we will get a percent match up.

Then we compare X-Z and Y-Z match-ups.

^^^ This is what I believe what happened in Kepler’s analysis. [Edit: Gatecrasher just posted data for a comparison for X-Z and Y-Z matchups.]

There was no statistically significant difference in [X-Z and Y-Z matchups]; I am not surprised by this result since your analysis was not a direct comparison between human X’s moves and computer Y’s moves. I inferred reasons for this result in the second and third sentences of my previous post.

Then you did a follow-up, using the same procedure as above but noting the number of question marks awarded during the game. Then you deduced that you have evidence that supports the rejection of the hypothesis which implies that the two samples are from different populations. I do not know what you mean to state by this but I will surmise that you found that there WAS a statistically significant difference in X-Z and Y-Z matchups when question marks were considered. I do not understand exactly how you obtained a result that is seemingly contradictory to your prior study.

So, “there was a statistically significant difference between the rates at which Z awarded question marks to human and engine moves. Same two samples and "third entity" but completely different results.”

So now I assume in your analysis that you considered the humans and engines individually and also collectively? The reason for the difference in question marks given for humans as a collective and engines as a collective is a result of differences in collective human and engine strength.

DeepThought

Losing the Thread

Quarantined World

Joined: 27 Oct 04
Moves: 87415

01 Dec 08

1 edit

Originally posted by no1marauder
There is plenty of confidence in the system, but a few people here want to undermine it for whatever reason. I have a good amount of confidence in the system that has been developed over the years here by the Game Mods. I have little confidence in that the ultimate decision makers don't take into account things that should be non-factors in deciding some turns on his computer and plays a 100 games at a time without using an engine. Deal with it.

The ultimate decision makers are part of the system. I was happy with the previous system when I knew who the games mods were and could make my own judgement about their temperaments and ability to judge whether a high match up rate was an unfortunate quirk of playing style and would dig deeper when needed. The new anonymous system is not reassuring to me.

I am happy with 99% confident about a user using an engine. But bear in mind that although this means that in any individual case the decision is almost certainly correct, once 100 people have been banned the chances that at least one of them have been incorrectly banned are about 99% (see birthday paradox on Wikipedia). The human element involving Tebb, Korch, and Gatecrasher, in my estimation, reduced the chance of this happening, but you and others seem to be arguing that match-up rates alone are sufficient evidence in themselves.

Once again you place the substance of your argument in between two insults. The vague ¨undermine it for whatever reason¨ followed by my ¨ironic¨ user name. You have to demonstrate significance, to do this you can either give a p-value or quote a confidence interval, if you ever did that I might stop arguing with you, while you use insults instead all you can realistically hope for is a flame war.

For the record my reason for arguing this is two-fold. First I really disapprove of naming people in the forums which recently happened in a couple of subsequently removed or closed threads, and I doubt that formal statistical criteria were applied by those naming names in those instances. Second I am concerned that people could be incorrectly banned. While this is clearly nowhere near on the same scale as my examples of faulty murder convictions, being thrown off a website for cheating is going to seriously upset someone who is innocent of engine use. I really do not understand why you feel threatened by people who want to satisfy themselves that a ¨match up¨ test is reliable.

Edit: Modified the sentence about statistical flaws in murder convictions.

Yuga

Renaissance

OnceInALifetime

Joined: 24 Sep 05
Moves: 30579

01 Dec 08

I am posting the interesting part of Gatecrasher’s analysis in one place. I think that I can safely assume that the CPU strength of the engine tested against was very high.

{ All Engines excluding Mobile Chess} [16th WCCC 2008]
{ Top 1 Match: 2450/3892 ( 62.9% )
{ Top 2 Match: 3138/3892 ( 80.6% )
{ Top 3 Match: 3395/3892 ( 87.2% )
{ Top 4 Match: 3548/3892 ( 91.2% )

{ All Players } [Linares-Morelia 2007] [Collective mean ELO 2745]
{ Top 1 Match: 1844/3167 ( 58.2% )
{ Top 2 Match: 2384/3167 ( 75.3% )
{ Top 3 Match: 2655/3167 ( 83.8% )
{ Top 4 Match: 2811/3167 ( 88.8% )

{ Top 4 Players} [6th Correspondence Chess World Cup Final 1968-71]
{ Top 1 Match: 907/1565 ( 58.0% )
{ Top 2 Match: 1170/1565 ( 74.8% )
{ Top 3 Match: 1298/1565 ( 82.9% )
{ Top 4 Match: 1365/1565 ( 87.2% )

{ All Players } [World Chess Championship Candidate Matches 1971]
{ Top 1 Match: 1553/2759 ( 56.3% )
{ Top 2 Match: 2036/2759 ( 73.8% )
{ Top 3 Match: 2261/2759 ( 81.9% )
{ Top 4 Match: 2396/2759 ( 86.8% )

{ All Players } [Gausdal Classic GM-A 2008] [Collective mean ELO 2518]
{ Top 1 Match: 1711/3239 ( 52.8% )
{ Top 2 Match: 2278/3239 ( 70.3% )
{ Top 3 Match: 2560/3239 ( 79.0% )
{ Top 4 Match: 2763/3239 ( 85.3% )

Minor anomalies in 16th WCCC:

{ Falcon - Intel Core 2, 2 x 2.1Ghz (Games: 9) } [scoring 4/9 in 16th WCCC]
{ Top 1 Match: 353/484 ( 72.9% )
{ Top 2 Match: 426/484 ( 88.0% )
{ Top 3 Match: 443/484 ( 91.5% )
{ Top 4 Match: 455/484 ( 94.0% )

{ The Baron - AMD Opteron 270, 4 x 2Ghz (Games: 9) } [scoring 2.5/9 in 16th WCCC]
{ Top 1 Match: 218/327 ( 66.7% )
{ Top 2 Match: 271/327 ( 82.9% )
{ Top 3 Match: 286/327 ( 87.5% )
{ Top 4 Match: 300/327 ( 91.7% )

Winners:

{ Rybka - Cluster 40 cores (Games: 9) } { Rybka (Games: 9) } [scoring 8/9 in 16th WCCC]
{ Top 1 Match: 249/385 ( 64.7% )
{ Top 2 Match: 315/385 ( 81.8% )
{ Top 3 Match: 346/385 ( 89.9% )
{ Top 4 Match: 359/385 ( 93.2% )

{ Fischer R (Games: 21) } [scoring 18.5/21 in 1971 Candidates matches]
{ Top 1 Match: 378/602 ( 62.8% )
{ Top 2 Match: 495/602 ( 82.2% )
{ Top 3 Match: 534/602 ( 88.7% )
{ Top 4 Match: 553/602 ( 91.9% )

{ Anand, Viswanathan - ELO 2779 (Games: 14) } [scoring 8.5/14 in Linares-Morelia 2007]
{ Top 1 Match: 224/355 ( 63.1% )
{ Top 2 Match: 270/355 ( 76.1% )
{ Top 3 Match: 304/355 ( 85.6% )
{ Top 4 Match: 314/355 ( 88.5% )

{ Rittner, H. (Games: 15) } [scoring 12.5/15 in 6th Correspondence Chess World Cup Final]
{ Top 1 Match: 203/337 ( 60.2% )
{ Top 2 Match: 254/337 ( 75.4% )
{ Top 3 Match: 276/337 ( 81.9% )
{ Top 4 Match: 293/337 ( 86.9% )

{ Kaidanov, Gregory - ELO 2596 (Games: 9) }[scoring 7/9 in Gausdal Chess Classics 2008]
{ Top 1 Match: 151/274 ( 55.1% )
{ Top 2 Match: 203/274 ( 74.1% )
{ Top 3 Match: 232/274 ( 84.7% )
{ Top 4 Match: 249/274 ( 90.9% )

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

01 Dec 08

Originally posted by Yuga
An expression that is used where I live - I do not think that we are completely on the same page. 🙂

I have a rudimentary understanding of how test statistics and hypotheses function since I have to use them in science and I have taken a statistics class. 🙂

When comparing human X’s moves directly against computer Z’s moves for match up rates we will get ...[text shortened]... nd engines as a collective is a result of differences in collective human and engine strength.

I agree, we are not completely on the same page. I am not comparing a specific engine or human (x or y in your post) against the z entity, Glaurung in this case. What I did do was compare a sample of games known to have been played by strong humans, there was no possibility of engine use in 1922, and a sample of games known to have been played by strong engines. I did not compare these individually at all.

We have two separate groups, populations in statistics parlance, namely games played by humans and games played by engines. I wanted to see if it was possible to distinguish between the two groups of games simply on the basis of whether an engine would agree with those moves or not, the match up rates often quoted on this forum. Obviously there would be a high likelihood of agreement if I analysed HIARCS' games using HIARCS, but engines supposedly produce high match up rates with engines no matter whether the engine used for analysis is the same as the engine that produced the game or not. Just in case anyone is wondering, I did try this and HIARCS can pick out those games in which HIARCS played with great accuracy, hence my use of an engine that did not play in the engine tournament.

Having had Glaurung pronounce on whether it agreed with the moves in the two groups I calculated (actually I asked some software to calculate) the percentage of non-book moves that matched Glaurung's first choice in each game. Those percentages are the numerical data which were used to perform the test. I used a two sample t-test which simply provides an assessment of how likely it is that the meas from each group are equal. Equality of means would suggest that we have no reason to say that the two samples were drawn from separate, distinct populations. In other words, there would be no reason to say that the engine games and the human games were produced by different processes or entities. We know that is not the case, an engine does not decide on a move in the same way that a human does, but the applied test does not provide any basis for saying that games played by humans are in any way different to those played by an engine.

That result surprised me, which is the reason I went on to do some further work. If the scores given by the analysis engine to its first choice and the game move differ sufficiently then the GUI used awards a question mark or two to the offending move. I just carried out the same process as above but calculated the percentage of non-book moves receiving a question mark for each game. The same test then says that it is unlikely that the mean question mark percentage is the same for both samples, i.e. we reject the hypothesis that the means are equal. That implies that the samples were drawn from separate groups or populations. That is reassuring because we know already that is the case. This version of the test is still comparing match up rates but only considers those moves that the engine considers to be sufficiently bad compared to its own first choice to be awarded a question mark.

I did not obtain contradictory results. The first result simply says that i have failed to find a significant difference between the two samples. It does not say that such a difference does not exist or that humans are the same as engines. It is possible that I was just looking in the wrong place (the second result suggests that is the case) or that there are other factors that need to be considered. The fact that the second result does find a significant difference in no way contradicts the first. Rather, it is as if I went looking for a black cat in a coal cellar and failed to find it without turning the light on. Once the light is turned on the cat is easy to find but that does not mean it was not there when the light was off. Unless it is one of these fancy superposition of states cat in box thingies that Schrodinger was fond of.

Palynka

Upward Spiral

Halfway

Joined: 02 Aug 04
Moves: 8702

01 Dec 08

Surely the point of engine tournaments is to show that they play at significantly different levels from each other?

In my view, the fatal flaw in your analysis is that you consider low match-up rates (with engine C in a match of engine A vs engine B - call if Fact 1) as evidence against what we can interpret from a high match-up rate. Note that this is a subtly different point than the one Yuga was making.

In a world where Fact 1 occurs, high match-up rates of a human with one-type of engine can still be very incriminating, just that we would expect low match-up rates with other types of engines.

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

01 Dec 08

Originally posted by Palynka
Surely the point of engine tournaments is to show that they play at significantly different levels from each other?

In my view, the fatal flaw in your analysis is that you consider low match-up rates (with engine C in a match of engine A vs engine B - call if Fact 1) as evidence against what we can interpret from a high match-up rate. Note that this is a ...[text shortened]... be very incriminating, just that we would expect low match-up rates with other types of engines.

I would suggest the point of human v human tournaments is to show that they play at significantly different levels from each other. By significantly different I assume we both mean there will be a clear winner.

Consider what high match up rates with a particular engine in a particular game tell us. That might be interpreted as evidence that that engine is being used. However, that only helps if we know what engine was playing the original game. Was it just chance or is there a high match up because we are both using the same engine? Of course, if I happen to use Patzer 3.2 to analyse and get a high match up in many games I will be confident that the person whose games I am analysing also used some version of Patzer. If I get a lower match up rate all I can say is that he or she was unlikely to be using Patzer. I know nothing about what they were using to decide on those moves, it could be another engine or even their own brain. That is why I did not use HIARCS as my analysis engine although I have it available. I am confident i could have identified the games HIARCS played in that tournament but that is only telling me what I already knew.

What I wanted was to demonstrate that it is possible to distinguish between human and engine moves in general on the basis of match up rates. That is the assumption on which a lot of cheat accusations and suspicion here and elsewhere are based. Unfortunately I don't see any evidence of a difference. That does not mean there is not a difference, just that I am not seeing it in the samples I used. That may be because there is no difference or because there is but the statistical analysis fails to find it. More work is required.