At NeurIPS in December I met Rylan Schaeffer from Stanford, author (with Brando Miranda, and Sanmi Koyejo) of this fascinating paper about the benchmarks used to measure the capabilities of LLMs. He found that many of the most common benchmarks use non-linear or discontinuous metrics to measure capabilities that should really be measured with linear metrics. The non-linear metrics show sudden jumps in ability as models get bigger–so-called emergent abilities. But if you change the metric so it’s linear, as models get bigger they show steady, predictable progress. Nothing magical about it.
Click here for a re-print of an article I wrote for American Scientist, March-April, 2024, Vol. 112.