If legibility of expertise is a bottleneck to progress and adequacy of civilization, it seems like creating better benchmarks for knowledge and expertise for humans might be a valuable public good. While that seems difficult for aesthetics, it seems easier for engineering? I'd rather listen to a physics PhD, who gets Thinking Physics questions right (with good calibration), years into their professional career, than one who doesn't.

One way to do that is to force experts to make forecasts, but this takes a lot of time to hash out and even more time to resolve.

One idea I just had related to this: the same way we use datasets like MMLU and MMMU, etc. to evaluate language models, we use a small dataset like this and then experts are allowed to take the test and performance on the test is always public (and then you make a new test every month or year).

Maybe you also get some participants to do these questions in a quiz show format and put it on YouTube, so the test becomes more popular? I would watch that.

The disadvantage of this method compared to tests people prepare for in academia would be that the data would be quite noisy. On the other hand, this measure could be more robust to goodharting and fraud (although of course this would become a harder problem once someone actually cared about that test). This process would inevitably miss genius hedgehog's of course, but maybe not their ideas, if the generalists can properly evaluate them.

There are also some obvious issues in choosing what kinds of questions one uses as representative.