Tuesday, March 16, 2010

Meta analysis of metacritic

Metacritic.com is a website that aggregates together reviews of all kinds of media, it hopes of giving its users a more balanced perspective on the consensus and diversity of critical opinion. Critics have long tried to quantify the inherently subjective experience of art, with systems of stars, grades and the occasional thumb. Metacritic takes this to the next level, asking its employees to read all the reviews they can find and to quantify each reviewer's opinion on a 0-100 scale. Sometimes this involves some simple multiplication, sometimes something inherently more subjective.

In the end though, they combine all these scores together to form a metascore, presumably a more accurate reflection of the critics combined assessment. At its core this is an idea seen in many places across the internet. For instance, recent elections have seen the rise of sites like fivethirtyeight.com and pollster.com, where statisticians pool data from a variety of polls to come up with a meta-poll, in hopes of more accurately assessing the mood of the electorate. Elsewhere, sites like digg and reddit try to average across the opinions of its user base to find content many people will find popular. From deciding which youtube comments are displayed, to questions asked of the President, this is an idea that now pervades our culture.

It seems at first to be a purely democratic principle, but when you dig a little deeper into each of these systems you often find its not quite that simple. In analyzing polling data, statisticians have to assess how much to believe this poll of 300 people versus this poll of 1,200. It would seem illogical to weight them equally, right? Or how about whether to adjust for the fact that this Rasmussen poll always seems to give the Republican candidate a slight edge when compared to Gallup? In the context of digg, is it fair that certain users with a track record of finding good content and a large group of followers have more influence on what makes it onto the homepage than others? These questions are fiercely debated within these communities, with strong opinions on both sides, and rightfully so, because they often make the difference in our understanding of whether healthcare reform is favored by a majority or (the admittedly less weighty problem of) which internet meme will be the star of the day.
A quick look through their FAQ page, and you'll find this is no less true at Metacritic.
This overall score, or METASCORE, is a weighted average of the individual critic scores. Why a weighted average? When selecting our source publications, we noticed that some critics consistently write better (more detailed, more insightful, more articulate) reviews than others. In addition, some critics and/or publications typically have more prestige and weight in the industry than others. To reflect these factors, we have assigned weights to each publication (and, in the case of film, to individual critics as well), thus making some publications count more in the METASCORE calculations than others.
So, which critics does Metacritic weight more than others? Who are the trusted arbiters of our cultural tastes? Well, Metacritic won't tell you. But with a little math, we think we can.

To start our mathematical understanding of the problem, lets write down a simple example. Lets imagine there were only 5 film critics (A,B,C,D, and E) in the world, and metacritic decided to give them some weights (say of value a,b,c,d and e respectively) but we don't know what those weights are. What we do know is that for a particular movie, we know critic A gave it a 60, B a 75, C a 25, E an 89, and critic D didn't review this movie.  We know this because metacritic publishes its quantitative assessment of each reviewers review. We also know that metacritic gave it a metascore of 60. What that would mean quantitatively is given in equation (1) below. We multiple each of the scores by the weights, add them all up, and divide by the total amount of weight used for this movie.  This is the basic mathematical operation of a weighted average.

Now, we have an equation that describes the relationship between the critics weights, their scores and the metascore. Now we can rearrange this equation, multiplying by the sum of the weights (Equation (2)). Then subtracting the right side from the left side we get Equation (3). Now its going to be useful to have an equation which has the same basic form for every movie, so lets imagine that we wanted to have critic D in this equation, well if we pretended critic D gave the movie the metascore (60) then we could put him in here (Equation 4) and it would be the same equation, because 60-60 is zero, and so his term would disappear.

Now if you think about the set of weights that would satisfy an equation like this... its clear you could multiple each of the weights by the same amount and it would still be true, because you can multiply both sides of the equation by the same amount, but the right hand side is always going to be zero. So, you need to figure out how to pick a scale. What if you decided that weight b was going to be 1 no matter what. We decided that for our analysis we would make this critic Roger Ebert, and thus all the weights would be relative to his weight.. or in other words, we would be measuring critics in units of "Eberts". Anyway, if you move Ebert's term over to the right hand side and you'd have a linear combination of weights is equal to a constant. If you had enough of these equations you could solve for a,c,d and e. It would be the worlds's worst algebra 2 problem but you could do it.

Written out more formally where M^i is the metascore of movie i, and E^i is Epert's score of movie i, and R^i_j is the jth critics review score of movie i and we've replaced R^i_j with M^i for all the reviews that don't exist, we would have Equation 6.

and if we were to rewrite it in matrix form, where we have an altered metascore M'=M-E, and an alternate review matrix which subtracted the metascore from each review (resulting in zeros where there was no review originally) we would have Equation 7. And so we have reduced the problem to one of linear algerbra, which can be solved easily using a computer program such as Matlab.

So we looked a bunch of movies listed on metacritic (1026 in total) and the review scores of every reviewer metacritic reported for those movies (619 total reviewers, 19185 total reviews) as well as the metascore for each of those movies and constructed these matrices.

But before we took the inverse of R and multiplied it by the weights, we had to consider one additional factor metacritic hints at in their faq.
In addition, for our film and music sections, all of the weighted averages are normalized before generating the METASCORE. To put it another way that should be familiar to anyone who has taken an exam in high school or college, all of our movies, games, and CDs are graded on a curve. Thus the METASCORE may be higher or lower than the true weighted average of the individual reviews, thanks to this normalization calculation. Normalization causes the scores to be spread out over a wider range, instead of being clumped together. Generally, higher scores are pushed higher, and lower scores are pushed lower.

So our basic formula might not be right... the metascore isn't the true weighted average. Well... we thought maybe we could get a sense of what this normalization function looked like if we just took the unweighted average of all the reviews and plotted it against the metascore. The result is in the figure below.

What is evident from the graph is two things. First, that the normalization function is most likely just a linear one that we can figure out by fitting this data to a line (shown in blue). With the slope and offset of that line, we can transform the normalized metascore into an unnormalized one for which our equations should be accurate. Second, it seems like whatever weighting they are doing isn't really making a big difference because the "metascore" is pretty close to basically an unweighted average. The weighting seems only to be making a difference of at most 5 points or so.

So with this data in hand, we can invert the matrix.. or because the matrix is somewhat degenerate (not exactly invertable) we use something called the pseudoinverse and multiply by our adjusted metascore minus Epert vector and get the weights! And one other technical point: we threw out reviewers that only reviewed one movie, and all the movies in which those reviewers appeared in the dataset, leaving 870 movies and 444 reviewers.

But before we talk about the results of that, it seems prudent to give you some reason to believe that all this fancy math, pseudo inverse nonsense makes some kind of pseudo sense.

So to test this we decided to come up with our own weights, calculate the weighted averages to come up with our own fake metascores, then try to reconstruct our fake weights using only the fake metascore data. This way we could compare our "real fake weights" with the "predicted fake weights" and see how well this whole technique works. The results of that are shown below. We chose our weights to be random and uniformly distributed between .2 and 1.05.

As you can see, the reconstructed weight correlates well with the fake weight, particularly if you look only at the reviewers who had more than 25 reviews in the dataset. The blue line here represents the Y=X line, or what a perfect reconstruction would be.

We can look more carefully at the relationship between the reconstruction error (X axis) and the number of reviews in the data (Y axis) in the following plot.

What you can see here is that the error for reviewers with more than 25 reviews or so is at most .15 or so.

So with those two things in mind (that we can reconstruct reliably only for reviewers >25 reviews, and that the errorbar on our weight estimate is likely in the .15 range) we can go ahead and reconstruct the weights for the real metascores. The resulting weights (for >25 review reviewers) are all positive and lie in the range of .2-1.2.

So this is already promising for a few reasons. One, we are getting mostly positive weights as we imagined we would, but we didn't really tell the computer that they had to be positive. Second, it seems that Roger Ebert, coming in squarely at 1 Ebert, is weighted more highly than most of the other reviewers.. and that makes sense. But lets look at the actual list of critics and their weights, ordered from highest to lowest. (again this is only those with more than 25 reviews in the data).

Weight Publication||Critic (number of reviews)

1.114969 The_New_Yorker||Anthony_Lane (67)
1.016225 Washington_Post||Stephen_Hunter (27)
1.015834 Village_Voice||J._Hoberman (97)
1.015043 Washington_Post||Desson_Thomson (39)
1.000000 Chicago_Sun-Times||Roger_Ebert (404)
0.958670 The_New_Yorker||David_Denby (90)
0.924278 The_New_York_Times||Jeannette_Catsoulis (130)
0.909943 New_York_Magazine||David_Edelstein (213)
0.904165 The_Onion_(A.V._Club)||Nathan_Rabin (113)
0.901257 The_Onion_(A.V._Club)||Noel_Murray (128)
0.883370 Los_Angeles_Times||Kenneth_Turan (150)
0.868394 The_Onion_(A.V._Club)||Scott_Tobias (190)
0.866470 Austin_Chronicle||Marjorie_Baumgarten (143)
0.859080 The_Onion_(A.V._Club)||Keith_Phipps (82)
0.858842 Salon.com||Andrew_O'Hehir (130)
0.858452 Newsweek||David_Ansen (38)
0.843536 The_New_York_Times||Nathan_Lee (46)
0.836894 The_New_York_Times||Manohla_Dargis (189)
0.822908 The_New_York_Times||A.O._Scott (215)
0.819977 Variety||Todd_McCarthy (139)
0.816666 Wall_Street_Journal||Joe_Morgenstern (223)
0.811155 Salon.com||Stephanie_Zacharek (186)
0.805415 The_New_York_Times||Stephen_Holden (179)
0.804410 Variety||Joe_Leydon (48)
0.771539 Austin_Chronicle||Kimberley_Jones (93)
0.753435 Rolling_Stone||Peter_Travers (236)
0.749982 Austin_Chronicle||Marc_Savlov (173)
0.746528 The_Globe_and_Mail_(Toronto)||Jennie_Punter (26)
0.738759 The_Globe_and_Mail_(Toronto)||Stephen_Cole (55)
0.718531 Premiere||Glenn_Kenny (30)
0.701622 The_Onion_(A.V._Club)||Sam_Adams (26)
0.699040 Time_Out_New_York||Joshua_Rothkopf (54)
0.698929 Variety||Derek_Elley (37)
0.697741 New_York_Post||V.A._Musetto (167)
0.692951 Time_Out_New_York||David_Fear (35)
0.689861 Los_Angeles_Times||Betsy_Sharkey (99)
0.687799 TV_Guide||Maitland_McDonagh (125)
0.684707 Charlotte_Observer||Lawrence_Toppman (112)
0.668733 Chicago_Tribune||Michael_Phillips (344)
0.666256 Baltimore_Sun||Chris_Kaltenbach (36)
0.663155 Los_Angeles_Times||Michael_Ordona (35)
0.661545 Los_Angeles_Times||Robert_Abele (69)
0.661338 Time||Richard_Corliss (102)
0.660194 Washington_Post||Ann_Hornaday (146)
0.647286 The_Globe_and_Mail_(Toronto)||Rick_Groen (130)
0.645554 Los_Angeles_Times||Gary_Goldstein (60)
0.638082 Village_Voice||Jim_Ridley (28)
0.633965 Variety||Dennis_Harvey (80)
0.625697 Variety||Ronnie_Scheib (79)
0.619469 Village_Voice||Melissa_Anderson (44)
0.618218 Chicago_Reader||Andrea_Gronvall (123)
0.612953 Variety||Justin_Chang (63)
0.606808 Boston_Globe||Wesley_Morris (244)
0.605419 Village_Voice||Nick_Pinkerton (69)
0.601566 Baltimore_Sun||Michael_Sragow (206)
0.600312 San_Francisco_Chronicle||Mick_LaSalle (214)
0.598500 New_York_Daily_News||Elizabeth_Weitzman (295)
0.597790 TV_Guide||Ken_Fox (119)
0.595947 Time_Out_New_York||Keith_Uhlich (60)
0.595609 New_York_Post||Kyle_Smith (254)
0.589138 Orlando_Sentinel||Roger_Moore (26)
0.588722 Los_Angeles_Times||Carina_Chocano (52)
0.588114 Village_Voice||Robert_Wilonsky (53)
0.587784 Entertainment_Weekly||Owen_Gleiberman (258)
0.582131 Village_Voice||Ella_Taylor (112)
0.581034 Village_Voice||Scott_Foundas (73)
0.579652 New_York_Daily_News||Joe_Neumaier (266)
0.578529 Christian_Science_Monitor||Peter_Rainer (294)
0.571597 Los_Angeles_Times||Mark_Olsen (44)
0.567801 Entertainment_Weekly||Lisa_Schwarzbaum (224)
0.553266 Premiere||Jenni_Miller (27)
0.552701 Film_Threat||Matthew_Sorrento (49)
0.540942 Village_Voice||Vadim_Rizov (39)
0.538506 Chicago_Reader||J.R._Jones (273)
0.536262 Washington_Post||Michael_O'Sullivan (55)
0.528692 The_Onion_(A.V._Club)||Tasha_Robinson (79)
0.526989 Boston_Globe||Ty_Burr (275)
0.521606 The_Globe_and_Mail_(Toronto)||Liam_Lacey (143)
0.519646 Austin_Chronicle||Josh_Rosenblatt (68)
0.510877 Miami_Herald||Rene_Rodriguez (197)
0.509730 New_York_Post||Lou_Lumenick (272)
0.500684 Chicago_Reader||Cliff_Doerksen (46)
0.490746 NPR||Bob_Mondello (79)
0.479352 Time||Richard_Schickel (30)
0.463631 Los_Angeles_Times||Glenn_Whipp (31)
0.445547 Village_Voice||Aaron_Hillis (58)
0.414933 The_Hollywood_Reporter||Michael_Rechtshaffen (86)
0.407707 Variety||John_Anderson (74)
0.407217 Seattle_Post-Intelligencer||Bill_White (41)
0.405736 Entertainment_Weekly||Adam_Markovitz (31)
0.390501 Philadelphia_Inquirer||Steven_Rea (219)
0.382638 The_Hollywood_Reporter||Ray_Bennett (44)
0.381773 Seattle_Post-Intelligencer||Sean_Axmaker (85)
0.375513 Film_Threat||Pete_Vonder_Haar (46)
0.373956 Variety||Peter_Debruge (37)
0.354449 Washington_Post||John_Anderson (63)
0.349243 Portland_Oregonian||Shawn_Levy (146)
0.346159 Los_Angeles_Times||Kevin_Thomas (68)
0.343609 St._Louis_Post-Dispatch||Joe_Williams (88)
0.338633 New_Orleans_Times-Picayune||Mike_Scott (66)
0.337026 The_Hollywood_Reporter||Stephen_Farber (51)
0.325974 The_Hollywood_Reporter||Kirk_Honeycutt (189)
0.305710 Philadelphia_Inquirer||Carrie_Rickey (169)
0.287646 The_Hollywood_Reporter||Frank_Scheck (83)
0.275926 San_Francisco_Chronicle||Walter_Addiego (65)
0.264487 ReelViews||James_Berardinelli (325)
0.261244 USA_Today||Claudia_Puig (350)
0.256509 San_Francisco_Chronicle||Ruthe_Stein (43)
0.246452 Seattle_Post-Intelligencer||William_Arnold (72)
0.245677 Variety||Rob_Nelson (29)
0.236771 Portland_Oregonian||Marc_Mohan (77)
0.222712 San_Francisco_Chronicle||Amy_Biancolli (51)
0.220507 Washington_Post||Dan_Kois (29)
0.203427 Portland_Oregonian||M._E._Russell (86)
0.189461 Miami_Herald||Connie_Ogle (84)
0.187527 San_Francisco_Chronicle||Peter_Hartlaub (73)
0.184766 Variety||Leslie_Felperin (34)
0.163462 Slate||Dana_Stevens (150)
0.105279 TV_Guide||Perry_Seibert (32)

So this makes some sense, Critics at The New Yorker, The New York Times, and the AV Club are all high up there. Well known critics like Ebert and David Edelstein, are far up there. Keep in mind there is an errorbar on the weights of +/- .1 or so. Down at the bottom is of course interesting as well. Notably, Metacritic has Dana Stevens of Slate coming in at less than 1/6th of an Ebert. That being said there isn't that large a spread in these weights... which is why it makes sense that the unweighted average is just about equal to the weighted one. One Ebert isn't going to overwhelm 12 other reviewers weighing in at .25 Eberts, especially since presumably Ebert doesn't disagree with the other 12 that much. On average there were 18 reviews per movie. A factor of 4 or 5 difference in weight at the extremes just isn't going to move the numbers around that much. If you really wanted the weights to make a difference, you'd need fewer reviewers per movie or much larger weights. Perhaps the weighting system is a relic from when metacritic was in its "garage" stage of a business and couldn't afford to take the time to read the array of reviews they can now. Its interesting that even though it doesn't make much of a difference our simulations of fake data suggest that we can still pull out the weights with some degree of reliability.

We thought we would look at the way different reviewers scores correlated with the average score. Correlation coefficient is a metric that measures how reliably one variable tracks another. A correlation coefficient of 1 means that when one variable goes up, the other always goes up; a correlation coefficient of 0 means there is no apparent relationship between the two variables; a coefficient of -1 would mean that whenever one goes up the other goes down, and vice-versa. So when we measure the correlation coefficient between the values of Roger Ebert's reviews and the metascores for those same movies, we are measuring how well Roger Ebert's opinion tracks the opinion of the rest of the critics. In order to do this accurately we need a lot more than 25 data points, so we restricted this analysis to reviewers with more than 75 reviews in the dataset.

Correlation Weight Reviewer
0.853635 0.753435 Rolling_Stone||Peter_Travers
0.827816 0.526989 Boston_Globe||Ty_Burr
0.821585 0.816666 Wall_Street_Journal||Joe_Morgenstern
0.817070 0.822908 The_New_York_Times||A.O._Scott
0.814250 0.859080 The_Onion_(A.V._Club)||Keith_Phipps
0.810241 0.490746 NPR||Bob_Mondello
0.809907 0.805415 The_New_York_Times||Stephen_Holden
0.809497 0.749982 Austin_Chronicle||Marc_Savlov
0.809349 0.521606 The_Globe_and_Mail_(Toronto)||Liam_Lacey
0.807484 0.866470 Austin_Chronicle||Marjorie_Baumgarten
0.805233 0.924278 The_New_York_Times||Jeannette_Catsoulis
0.802280 0.771539 Austin_Chronicle||Kimberley_Jones
0.798073 0.381773 Seattle_Post-Intelligencer||Sean_Axmaker
0.792632 0.633965 Variety||Dennis_Harvey
0.788911 0.668733 Chicago_Tribune||Michael_Phillips
0.787197 0.567801 Entertainment_Weekly||Lisa_Schwarzbaum
0.783964 0.509730 New_York_Post||Lou_Lumenick
0.781994 0.904165 The_Onion_(A.V._Club)||Nathan_Rabin
0.777615 0.414933 The_Hollywood_Reporter||Michael_Rechtshaffen
0.776096 0.261244 USA_Today||Claudia_Puig
0.775757 0.836894 The_New_York_Times||Manohla_Dargis
0.774464 0.868394 The_Onion_(A.V._Club)||Scott_Tobias
0.770838 0.287646 The_Hollywood_Reporter||Frank_Scheck
0.769263 0.598500 New_York_Daily_News||Elizabeth_Weitzman
0.764392 0.606808 Boston_Globe||Wesley_Morris
0.760384 0.819977 Variety||Todd_McCarthy
0.758661 0.883370 Los_Angeles_Times||Kenneth_Turan
0.753568 0.661338 Time||Richard_Corliss
0.743028 0.660194 Washington_Post||Ann_Hornaday
0.741849 0.163462 Slate||Dana_Stevens
0.739853 0.343609 St._Louis_Post-Dispatch||Joe_Williams
0.736824 0.625697 Variety||Ronnie_Scheib
0.727285 0.390501 Philadelphia_Inquirer||Steven_Rea
0.727253 0.601566 Baltimore_Sun||Michael_Sragow
0.726622 0.597790 TV_Guide||Ken_Fox
0.724791 0.538506 Chicago_Reader||J.R._Jones
0.720790 0.510877 Miami_Herald||Rene_Rodriguez
0.720746 0.697741 New_York_Post||V.A._Musetto
0.718309 0.203427 Portland_Oregonian||M._E._Russell
0.712691 0.528692 The_Onion_(A.V._Club)||Tasha_Robinson
0.712378 1.000000 Chicago_Sun-Times||Roger_Ebert
0.711276 0.305710 Philadelphia_Inquirer||Carrie_Rickey
0.710998 1.015834 Village_Voice||J._Hoberman
0.710514 0.858842 Salon.com||Andrew_O'Hehir
0.709376 0.189461 Miami_Herald||Connie_Ogle
0.707529 0.236771 Portland_Oregonian||Marc_Mohan
0.705446 0.325974 The_Hollywood_Reporter||Kirk_Honeycutt
0.703620 0.687799 TV_Guide||Maitland_McDonagh
0.690279 0.618218 Chicago_Reader||Andrea_Gronvall
0.689808 0.582131 Village_Voice||Ella_Taylor
0.683692 0.811155 Salon.com||Stephanie_Zacharek
0.681398 0.264487 ReelViews||James_Berardinelli
0.678081 0.578529 Christian_Science_Monitor||Peter_Rainer
0.667241 0.647286 The_Globe_and_Mail_(Toronto)||Rick_Groen
0.662665 0.901257 The_Onion_(A.V._Club)||Noel_Murray
0.637519 0.689861 Los_Angeles_Times||Betsy_Sharkey
0.631693 0.579652 New_York_Daily_News||Joe_Neumaier
0.621878 0.909943 New_York_Magazine||David_Edelstein
0.615932 0.587784 Entertainment_Weekly||Owen_Gleiberman
0.607102 0.349243 Portland_Oregonian||Shawn_Levy
0.574937 0.958670 The_New_Yorker||David_Denby
0.571337 0.684707 Charlotte_Observer||Lawrence_Toppman
0.532293 0.600312 San_Francisco_Chronicle||Mick_LaSalle
0.504908 0.595609 New_York_Post||Kyle_Smith

To illustrate the difference in correlation, lets pick out two prominent members of this list at the top and bottom and plot out their review score, versus the metascore.

So here we have David Denby and AO Scott's review scores on the X axis, plotted agains the metascore on the Y axis. As you can see there is a clearer relationship between AO scott's review score and the metascore (C=.823) than Denby's (C=.575). Note we've added a little bit of scatter to the critics review score in order to make all the points not lay on top of one another. Lets for a moment drill down past the mathematical abstractions and look at the specific reviews that are causing this difference. Four red points stick out in the graph, three in the upper left where Denby thought the movie was much worse than everyone else, and one in the lower right, where he thought it was much better than everyone else. The three movies he was more critical than most about were: "A Serious Man" (Denby 37, Meta 79)", "The Dark Knight" (Denby 50, Meta 82), and "Iron Man" (Denby 50, Meta 79). The one movie he was hot on that others were not "Hancock" (Denby 90, Meta 49). Its interesting maybe that 3 of these 4 are movies about superheros. Maybe Denby views the genre through a different lens than most others, or maybe he simply doesn't care for the genre, as Hancock is really a sendup of the superhero archetype and reading Denby's review, he seems to love that idea: ""Hancock" has the grace to acknowledge the audience’s increasing impatience with digital wonder". This example is illustrative, in that I don't think because Denby sees these movies differently that he's a bad critic, and there are probably many who read and relate to Denby precisely because he has the perspective that he does.

The above plot shows the relationship between that correlation coefficent and the metacritic weight we calculated above. It appears that there is no strong relationship between the weight metacritic assigns to these critics and how reliable their reviews track the consensus opinion. Denby and Scott are good examples of two highly weighted critics on opposing ends of the correlation spectrum.

Its not clear that we would want the highly rated reviewers to have the highest correlations with consensus opinion, but it is interesting that there is no relationship in either direction. Its also interesting with respect to our predictions of the weights, as it seems to suggest that correlations between the reviewers scores and the metascores are not having a strong systematic effect on the weight predictions.

No comments:

Post a Comment