The peer-review system at Machine Learning conferences has come under much criticism over the last years. One major driver was the infamous 2014 NeurIPS experiment, where a subset of papers were given to two different sets of reviewers. This experiment showed that only about half of all accepted papers were consistently accepted by both committees and demonstrated significant influence of subjectivity. This paper revisits the data from the 2014 experiment and traces the fate of accepted and rejected papers during the 7 years since, and analyzes how well reviewers can assess future impact, among other things.
0:00 - Intro & Overview
1:20 - Recap: The 2014 NeurIPS Experiment
5:40 - How much of reviewing is subjective?
11:00 - Validation via simulation
15:45 - Can reviewers predict future impact?
23:10 - Discussion & Comments
In this paper we revisit the 2014 NeurIPS experiment that examined inconsistency in conference peer review. We determine that 50% of the variation in reviewer quality scores was subjective in origin. Further, with seven years passing since the experiment we find that for accepted papers, there is no correlation between quality scores and impact of the paper as measured as a function of citation count. We trace the fate of rejected papers, recovering where these papers were eventually published. For these papers we find a correlation between quality scores and impact. We conclude that the reviewing process for the 2014 conference was good for identifying poor papers, but poor for identifying good papers. We give some suggestions for improving the reviewing process but also warn against removing the subjective element. Finally, we suggest that the real conclusion of the experiment is that the community should place less onus on the notion of top-tier conference publications when assessing the quality of individual researchers. For NeurIPS 2021, the PCs are repeating the experiment, as well as conducting new ones.
Authors: Corinna Cortes, Neil D. Lawrence