Recent years have seen much enthusiasm for the wisdom of the crowd notion: the idea that large groups of individuals can generate lots of reliable knowledge together , even though those individuals are, on average, neither very knowledgeable or reliable. Authors like James Surowecki (The Wisdom of the Crowds, 2004) and Cass Sunstein (Infotopia, 2006) have argued that Internet, by vastly facilitating communication, gives us unprecedented opportunities to create knowledge together. The most conspicious example of collective knowledge generation is perhaps Wikipedia, but besides that, there has been much discussion of the potential of online collective reputation or rating systems; a topic which is also discussed at length in the anthology The Reputation Society (2012), which I have reviewed here and here. Proponents of the wisdom of the crowds notion have put high hopes on these rating systems. Here’s a vision Sunstein gives of “a possible future” right at the start of his book:
Collaborative projects, often involving numerous strangers, are growing in both scale and quality, to the benefit of millions of people. Many of these projects are open to every human being on the globe. It is also simple to find, almost instantly, the judgments of “people like you” about almost everything: books, movies, hotels, restaurants, vacation spots, museums, television programs, music, potential romantic partners, doctors, movie stars, and countless other goods and services.
By the end of this decade, power and influence will have shifted largely to those people with the best reputations and trust networks and away from people with money and nominal power. That is, peer networks will confer legitimacy on people emerging from the grassroots.
Now even though there are, to be sure, quite a few well-functioning rating systems (e.g., AirBnB, Amazon, eBay, etc.), we are nowhere near Sunstein’s or Newmark’s vision at present. Many rating systems, including those run by major companies like Google, suffer from a number of problems, which severely limit their effectiveness:
- Undersupply of reputational information (also discussed here). For instance, since Google Local doesn’t give people particularly strong incentives to rate restaurants, most visitors don’t do that. This severely limits the rating systems’ accuracy, particularly since very satisfied and very dissatisfied customers are more likely to rate a product or service, which gives you a selection bias.
- Untruthful reporting; i.e. that people don’t report their true beliefs or preferences (also discussed here). For instance, restaurant and hotel owners sometimes give hostile fake reviews and ratings of competing businesses, to lower their average rating. As for another example, dating sites such as OKCupid and Badoo offer you the opportunity to get your pictures rated (by other members) in exchange for you rating a number of pictures of other members. Since rating goes quicker if it’s done randomly than if it’s done honestly, this system gives a strong incentive to untruthful reporting. Thus this system gives you a lot of of random, noisy ratings (as opposed to the systematically biased ratings due to competing business owners).
- More accurate raters’ reports are not weighed more heavily than others. If some raters are more reliable than others, it would make sense to let their ratings be more heavily weighted. The collective judgement would thus not be a straight average, as it is today, but a weighted average.*
OKCupid and Badoo’s policies of requiring members to rate other members’ pictures to get their own pictures rated is a method to solve the undersupply problem. Unfortunately, it exacerbates the second problem, however, since it makes people more disposed to answer quickly and randomly. As a result, the results tend to be very inaccurate. What we need to do is to to reward serious and honest rating – not just any rating.**
In order to do that, we need to be able to identify the serious and honest raters. Here the subjective nature of the ratings present us with a problem, however: how can we know whether a man giving a 5 to a woman most other men rank as a 10 is rating insincerely or just having a non-standard preference?
There are two ways of identifying reliable raters: the multiple-question method and the single-question method. The first method adds one or more control questions, to which we know the answer, to the original question, in order to measure the reliability of the raters (see, e.g. this interesting paper for a smart control question). This method has, however, a number of problems: it is cumbersome, there is no guarantee that people who vote reliably on the control question vote sincerely on the original question, and the raters only have an incentive to rate reliably on the control question, and not on the original question
The single-question method replaces the original question with a proxy question. The goal iof this method is to produce the same result as the original question would have yielded under ideal rating, by which I mean what raters would have said if they all had rated sincerely. For instance, if woman A actually thinks that a man is a 6, B that he is a 7, C, 8, D, 9 and E, 10, then the collective answer (i.e. the weighted average) to the proxy question should be 8.
As it happens, there is a single-question method with a number of attractive features; namely the Keynesian beauty contest. The Keynesian beauty contest got its name after the famous economist John Maynard Keynes, who, upon reading about a beauty contest in which you got a prize if you voted for the winner, noted that in such a contest, you should not vote for the woman you think is most attractive, but rather for the woman you think that everybody else is most attractive…which is not quite right either, if everybody else is also trying to predict who everybody else thinks is most attractive, in which case you should rather vote for the woman you think that everybody else thinks that everybody else thinks that everybody else thinks that everybody else thinks…etc…is most attractive. Unlike in an ordinary beauty contest, a man who gives a 5 to a woman who everybody else gives a 10 in a Keynesian beauty contest is definitely wrong: he has failed to predict how the others will vote.
The question is thus not “How attractive do you think that this woman is on a scale from 0 to 10?” but rather “What do you think that the weighted average of all users’s ratings of this woman is, on a scale of attractiveness from 0 to 10?”. Now, will the average rating to this question track the average rating under ideal rating? To answer that question, let us first note that Keyensian beauty contests is a sort of mixed co-ordination and conflict game where you win if you consistently manage to co-ordinate with most other users better than most other users (this might seem contradictory but is not – the point is, that if you are always part of the co-ordinating majority, whereas other users sometimes are part of the majority, and sometimes not, you win). Such games are often called Schelling games, after another famous economist, Thomas Schelling. Now as in any Schelling game, it does not matter how you co-ordinate with other users. The mention of the “scale of attractiveness” is merely a prompt to make a certain focal point more salient. The hope is, of course, that this prompt will make the average score that the woman in question would have got under ideal rating the focal point.
This could, however, fail to happen. Keynesian beauty contests can, for instance, be the subject of gaming, just like most other rating systems. For instance, a large sub-group can agree to give every woman who has a certain characteristic that is more easily observable than attractiveness (say blondeness) a certain score, say 7. Alternatively, a tacit convention to the effect that all blonde women are given a 7 can slowly emerge without explicit agreement. In both of these cases, the Keynesian beauty contest will fail to track the ideal rating result. I will discuss how to deal with these problems in the next post on this topic.
In case the Keynesian beauty contest does track the ideal rating version of the original question, it does, however, allow us to identify the reliable raters, which means that we can reward them, and give their ratings greater weight. An additional advantage is that if the raters are sufficiently skilled, you can converge on the mean value much faster than you can in an ordinary beauty contest (even under ideal voting).. To see that, note that in an ordinary beauty contest, there is, as in our example above with the women A-E, typically some degre of rating variance. This means that if you are unlucky, the average value can be quite off at the start, before the law of large numbers start kicking in. In a Keynesian beauty contest where the raters are skillful, the rating variance will be much lower, which means that you will approach the long-run average value much faster.
In the Keynesian beauty contest, everybody votes without knowing what others have voted. A close cousin to the Keynesian beauty contest is the peer prediction method, the difference being that where the former is for all intents and purposes synchronic, the latter is diachronic: your ratings are evaluated on their accordance with future ratings. Hence they become essentially predictions of future ratings. Another way of expressing this is that future ratings function as meta-ratings of past ratings. Slashdot’s meta-moderation system is an example of such a meta-rating system.
Though these methods are to some extent used, overall they have not, however, received the attention they deserve from website constructors. Also, while there is some academic research on these matters, most of it is very mathematical and abstract and concerns how perfectly rational agents would act in such rating systems. While such work is important, you also need some more applied research which can provide a link between abstract rational choice theory and the nitty gritty of making a rating system happen online.
To be sure, you cannot use Keynesian beauty contests or peer prediction methods in all situations where you want to elicit ratings. For instance, in situations where people have strong private information signals (i.e. where they have information that they know that others don’t have), it is not obvious that we should apply these methods, since they don’t ask how you would rate the person or item in question, but rather how you think that the average person would rate him/her/it. Thus they encourage you to disregard private information. Such information loss may lead to these methods failing to track the ideal rating average.
I will discuss further aspects of these methods in future posts. Let me finish by giving some examples of possible applications of this method (more will be given in later posts). In the case of the aforementioned Badoo and OKCupid picture ratings, I think that the ratings could become much more accurate if the raters were asked Keynesian beauty-questions; i.e. how attractive they think that the average man (woman) think that the woman (man) man in question is. The whole thing could be turned into a game where you were allocated points on the basis of how close you are to the weighted average. You could also provide people with incentives to rate reliably by requiring unreliable raters to rate more pictures to get their own picture rated.
Studies show that a good picture is a major advantage on dating sites, so there should be a clear demand for objective ratings of your pictures. Presumably, people should, however, have use for objective peer ratings of all sorts of other things. Comedians could test jokes, applicants application letters, etc.
As for another example, collective websites could be organized by means of Keynesian beauty contests. To take one example, OKCupid has another section where members can upload self-tests (e.g. the “what is your real age-test“, etc.). Many of these tests are quite stupid, which is to be expected since they aren’t heavily moderated. To ensure a higher quality, you could launch a Keynesian beauty contest, where people would be forced to rate other tests reliably in order to be able to upload their own tests, and where only the best tests would in the end go up on OKCupid.
* In the classic paper Thirteen Theorems in Search of the Truth, there is a formula for calculating the optimal weights of different voters’ votes from estimates of their reliability, under certain specified circumstances (see p. 274).
** Obviously untruthful reporting of judgements of attractiveness on OKCupid and Badoo is not the worst social problem (though there could potentially be good money in a rating system that remedied this problem). However, it is a salient example of a very common problem. There are many structurally similar situations where people fail to give serious and honest ratings (see section V) but few are as simple and easy to grasp as this problem. Hence why I uset it. More examples of this problem will be given in later posts.