Discussion about this post

User's avatar
Jeff Bowers's avatar

Hi Konrad, I agree with your comments here, but how does this fit with the general practice in NeuroAI, and your claim that there is too much hypothesis testing in NeuroAI and neuroscience? Seems to me that most of NeuroAI is focused on making predictions with naturalistic stimuli, and ignoring experiments that manipulate stimuli to test specific hypotheses about the algorithms and mechanisms that support human and ANN performance. When models are tested on these sorts of experiments (typical in psychology), they generally fail.

Even more problematic, reviewers and editors are so committed to reporting better and better predictions that they are not much interested in publishing studies that falsify ANNs. For example, we can’t get out MindSet Vision toolkit (designed to facilitate this alternative experimental approach to NeuroAI), twice rejected in NeurIPS, once in ICLR, with comments like “It remains unclear what we should do or not do to improve the designs of these models even after benchmarking models on these experiments”.

Or consider the Centaur model published in Nature that catastrophically fails when tested on experiments that manipulate variables (Bowers et al., 2025). For instance, it can often recall 256 digits correctly in a STM digit span task (humans recall about 7 digits), and can output accurate response in a serial reaction time task in 1ms. The editor rejected our commentary, writing:

“…we have come to the editorial decision that publication of this exchange is not justified. This is because we feel that the authors have acknowledged that Centaur is not a theory, nor should it be expected to replicate in extreme situations.”

We then submitted our work to elife, and got the following:

"…as usually with LLM-stuff, and with benchmarks in general, having the criticism hinge on one result that could change with training data/regime is not ideal”.

Yes, some other model might succeed (in two sets of experiments, not one result), but the point is that the model published in Nature does not. And it will be hard to find a better model of human mind if its flaws are difficult to publish and the focus is on predicting held out samples. Falsification is not much appreciated in NeuroAI. Bowers et al. (2023) provide multiple reviews where reviewers and editors and say it is necessary to provide “solutions” to publish; that is, show that ANNs are like humans. Or check out my talk at NeurIPS from a few years ago where I go through many more examples of reviewers and editors rejecting falsification because they are more interested in solutions: https://neurips.cc/virtual/2022/63150

I agree with your post, and think that is undermines many of the claims made in NeuroAI. Am I being unfair to NeuroAI?

Jeff Bowers

Biscione et al. (2025). MindSet: Vision. A toolbox for testing DNNs on key psychological experiments https://openreview.net/forum?id=VkPUQJaoO1&referrer=%5Bthe%20profile%20of%20Jeffrey%20Bowers%5D(%2Fprofile%3Fid%3D~Jeffrey_Bowers1)

Bowers, J. S., Malhotra, G., Adolfi, F., Dujmović, M., Montero, M. L., Biscione, V., ... & Heaton, R. F. (2023c). On the importance of severely testing deep learning models of cognition. Cognitive Systems Research, 82, 101158.

Bowers, J. S., Puebla, G., Thorat, S., Tsetsos, K., & Ludwig, C. J. H. (2025). Centaur: A model without a theory. PsyArxiv https://doi.org/10.31234/osf.io/v9w37_v2

Randal Koene's avatar

Nice one, thank you.

12 more comments...

No posts

Ready for more?