Open letter to Russ re/engine use

Squelchbelch

Only Chess

28 Nov 08

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

30 Nov 08

Originally posted by no1marauder
No one said or assumed all players on RHP are "rubbish", did they? But none are quite as good as Rubinstein or Reti, are they? Statistical studies have been done that give a reasonable outer limit for what match ups can be achieved by good human players on RHP without cheating. More tests are unnecessary.

Staistical studies may well have been done on how well humans can match an engine but that wasn't really what I was doing. I was asked elsewhere if there was any way to distinguish between human and engine and replied " yes, high engine match up rates are indicative of engine use". Then I was asked if there was any evidence of a difference in the match up rates between humans and engines. Well, I thought there must be and this would be the place to ask. I asked if anyone had ever done any research into how well engines match engines or comparative work on engines and humans in a couple of the now vanished threads. The only answers I got were of the form "If a player matches x% of top n choices then he is an engine user". Unfortunately that does not answer the question I was asking so I decided to do the work for myself.

The unexpected part of the result was not the 60% human match up rate, that seems reasonable based on earlier work I have done. The real surprise was the low a match up rate the engines got. I posted this result in very controversial manner in order to provoke responses and rebuttals. In effect, I willfully poked a hornet's nest to gain some information. It worked, I now have some possible reasons for the result I obtained and will investigate further.

I say again, I am not trying to prove the cheat detection and removal system does not work. The system plainly does work. Engine users are found and banned. It may not be as efficient as it might be and it may have some faults but it is better than the alternative. I am only interested in whether there is a significant difference between man and machine and how that difference manifests itself. If anything, I am trying to find out why we might fail to detect that difference.

DeepThought

Losing the Thread

Quarantined World

Joined: 27 Oct 04
Moves: 87415

30 Nov 08

2 edits

Originally posted by no1marauder
No one said or assumed all players on RHP are "rubbish", did they? But none are quite as good as Rubinstein or Reti, are they? Statistical studies have been done that give a reasonable outer limit for what match ups can be achieved by good human players on RHP without cheating. More tests are unnecessary.

Thanks for the nitpicking. You and ...[text shortened]... ho has followed this issue (though you actually only started following it), know what I meant.

What is required is confidence in the system. Kepler, and to a lesser extent I, are not confident that a straightforward comparison of match-up rates is sufficient to decide that someone is cheating (except in the more blatant cases). Gatecrasher made quite a long post earlier in the thread that put my mind at rest on that point, as he stated that match-up rates were never their sole criterion. Unfortunately the new games mods are anonymous and we have no way of knowing what criteria they use.

If you and the others complaining about people not being banned quickly enough were satisfied that engine users were banned within a realistic time-frame then you wouldn´t be complaining about it. So neither side in this debate has confidence in the system.

Repeating these tests in a manner that satisfies Kepler would go a long way to making people confident that the people banned for engine use were correctly banned and that future bannings were also correct. This would allow Russ to speed up the system.

Kepler is a trained statistician, you are not. When he abuses the jargon I know that he knows what he means, when you do I can´t tell whether you are just speaking loosely or don´t know what you are on about. This is an important issue, just repeated stating: ¨I´m right, this has already been done¨ sandwiched between some insults really doesn´t cut it. There have been some spectacular miscarriages of justice where people have been convicted of murder on the basis of flawed statistical evidence - eg. Sally Clark and in a similar case Angela Cannings.

And as an afterthought - I dropped out of the site for 6 months for work reasons (and too many games for too long, eventually you just can´t face them any more). This is no basis for dismissing my point. You persistently use these methods when arguing. Really if you are so confident that your statistical method works you should welcome more work as it would prove you right. So why are you resistant to the idea?

Edit: Written before I read Kepler´s last post.

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

30 Nov 08

Originally posted by DeepThought
Kepler is a trained statistician, you are not.

You might want to be careful with that sort of thing. You have no evidence other than my say so that I am a trained statistician and even less evidence that no1marauder has had no statistical training. I am not saying I am not a statistician or that no1 is a statistician, just that you have no idea how competent either of us is in this field.

smaia

Joined: 09 Aug 06
Moves: 5363

30 Nov 08

Originally posted by no1marauder
No one said or assumed all players on RHP are "rubbish", did they? But none are quite as good as Rubinstein or Reti, are they? Statistical studies have been done that give a reasonable outer limit for what match ups can be achieved by good human players on RHP without cheating. More tests are unnecessary.

Thanks for the nitpicking. You and ...[text shortened]... ho has followed this issue (though you actually only started following it), know what I meant.

How do you know there are no RHP players as good as Rubinstein and Reti? What if there are grandmasters playing in RHP?

smaia

Joined: 09 Aug 06
Moves: 5363

30 Nov 08

You said:
"If game modding was as simple as applying a benchmark to match-up rates, there would be no need for game moderators. The admins could simply program an automaton to sift through all our games, and then ban players who are above some arbitrary cut-off."

Why not creating a system in RHP where every game played is automatically checked for match-ups against common engines and the cases where the rate is too high are flaged? This way caseloads for game mods would be automatically generated.

!~TONY~!

1...c5!

Your Kingside

Joined: 28 Sep 01
Moves: 40665

30 Nov 08

How long does it take you to analyze a game automatically with Fritz?

How many players does RHP have?

How many games does each of those players have?

How many engines do you want to use?

I rest my case. 😀

Carterson

Joined: 22 Nov 08
Moves: 981

30 Nov 08

Originally posted by !~TONY~!
How long does it take you to analyze a game automatically with Fritz?

How many players does RHP have?

How many games does each of those players have?

How many engines do you want to use?

I rest my case. 😀

rhp players = 569,790

or there abouts

that is alot of people, i don't want to even think about how many games that is, which i guess cannot be computed because some are subs and some are not

gambit05

Mad Murdock

I forgot

Joined: 05 May 05
Moves: 20526

30 Nov 08

What about only games including a, let's say 1900+ player? This dramatically decreases the number of games investigated, but still catches most of the engine users (relativly early).

AlboMalapropFoozer

Joined: 02 Feb 07
Moves: 394

30 Nov 08

There is a concept in statistics that states that if one wishes to maximize the percentage of "guilty" people sent to jail, then the error known as "Type I", where innocent people are wrongly convicted, increases. Similarly, if one wishes to maximize the percentage of "innocent" people set free, then the error known as "Type II" increases. It is theoretically impossible to reduce both types of errors to zero. For a good explanation of these ideas in fairly non-technical words, I suggest reading of the following web page.

http://www.intuitor.com/statistics/T1T2Errors.html

Kepler

Demon Duck

of Doom!

Joined: 20 Aug 06
Moves: 20099

30 Nov 08

1 edit

Originally posted by AlboMalapropFoozer
There is a concept in statistics that states that if one wishes to maximize the percentage of "guilty" people sent to jail, then the error known as "Type I", where innocent people are wrongly convicted, increases. Similarly, if one wishes to maximize the percentage of "innocent" people set free, then the error known as "Type II" increases. [b]It i ...[text shortened]... reading of the following web page.

http://www.intuitor.com/statistics/T1T2Errors.html

[/b]Very good. I will be using that on one of the courses I teach.

DeepThought

Losing the Thread

Quarantined World

Joined: 27 Oct 04
Moves: 87415

30 Nov 08

Originally posted by Kepler
You might want to be careful with that sort of thing. You have no evidence other than my say so that I am a trained statistician and even less evidence that no1marauder has had no statistical training. I am not saying I am not a statistician or that no1 is a statistician, just that you have no idea how competent either of us is in this field.

True, but you had no particular reason to lie and seem to know what you are talking about. No1 stated in a thread from ages ago that he was a lawyer (unless my memory is playing tricks), and law degrees aren´t renowned for their statistics content.

DeepThought

Losing the Thread

Quarantined World

Joined: 27 Oct 04
Moves: 87415

30 Nov 08

Originally posted by !~TONY~!
How long does it take you to analyze a game automatically with Fritz?

How many players does RHP have?

How many games does each of those players have?

How many engines do you want to use?

I rest my case. 😀

No, he´s right, Russ should by RoadRunner and analyse all our games with that 🙄

Yuga

Renaissance

OnceInALifetime

Joined: 24 Sep 05
Moves: 30579

30 Nov 08

4 edits

I am not exactly sure how Kepler conducted his methodology and obtained his results but logic is sufficient to deduce the basic possibilities and implications of human-computer match up analyses.

I strongly expect a statistically significant difference between the subjects of the human and engine sample in match up rates of these subjects with another entity (let’s call it Z) when the humans and engine subjects are of sufficiently disparate strength AND the entity being tested (Z) is sufficiently dissimilar in strength and style to either the human or engine subjects and GIVEN that Z is not of strength between that of subjects whose match up rates with Z are being compared. [All edits making sure above sentence is correct.]

So when human and engine subjects are of similar strength, I would expect a similar match up of human and engine subjects to another entity that is sufficiently dissimilar to the human and engine subjects regardless of its playing strength. Please correct me if I am wrong but I believe that this is the conclusion that can be deduced from Kepler’s analysis; there was not a statistically significant difference between human and engine subjects in match-ups with a third entity Z[I think that Glaurung 2.1 was the test standard used in Kepler’s analysis].

The issue in game moderation is whether human X’s moves match up with engine Y’s moves. The game moderators can use statistics to determine the extent of this match-up and concordantly how likely it is that a player is using an engine to make moves. There is no third entity test standard Z involved in game moderation as used in Kepler’s analyses so I do not think that Kepler’s results are relevant to the game moderation system as it functions now.

Regardless, the system employed by the game moderators as I understand it functions to eliminate entities that have obviously used engine assistance well beyond reasonable doubt. There are statistical methods that can detect patterns of engine use that are less obvious than considering match-up rates of human moves with the first few choices suggested by an engine.

Employing engine-shaped chess ideas as well as consistently failing to employ human-shaped chess ideas that are not considered by engines are indicators of engine use.

Based on the games of banned players that I have analyzed, the system has not erroneously banned a player under TOS 3b so in my opinion the system operates well. Although the banning of engine users takes longer than I prefer, the game moderators are volunteers and I suspect that they have to do most of the tedious analysis of engine users themselves. Many thanks to the game moderators if these exist.

In summary, there is a significant difference between man and machine and the difference manifests itself in match-up rates, the presence of engine-shaped mistakes, and the consistent failure to use human-shaped ideas.

smaia

Joined: 09 Aug 06
Moves: 5363

30 Nov 08

Originally posted by gambit05
What about only games including a, let's say 1900+ player? This dramatically decreases the number of games investigated, but still catches most of the engine users (relativly early).

That's a good suggestion.
Any player crossing a certain threshold (1900 or other value) is automatically checked and the data stored in a database available only to game mods.

Gatecrasher

Whale watching

33°36'S 26°53'E

Joined: 05 Feb 04
Moves: 41150

30 Nov 08

I ran many control batches when I was a game mod. Stimulated by Kepler's post, I've dug some out that were run using the same time controls, cpu strength, and engine. I'm not going to make any commentary here either, rather let the numbers speak for themselves. I've included 4 human tournaments:

* Gausdal Classic GM-A 2008 (Modern GMs/IMs ELO 2611 - 2229 ) - All 45 games

* World Chess Championship Candidate Matches 1971 - All 55 games

* 6th Correspondence Chess World Cup Final 1968-1971 - 54 games (all games of the top 4)

* Morelia-Linares 2007 (Modern Super GMs ELO 2783-2690 ) - All 56 games

Finally, using the same time controls, cpu and engine:

* 16th World Computer Chess Championship 2008 - All 45 games

Just take into account that analysis was done on a typical (decent dual core) home computer - I don't have the luxury of a 40 core cluster as enjoyed by the winning Rybka engine...