Hi Konrad, I agree with your comments here, but how does this fit with the general practice in NeuroAI, and your claim that there is too much hypothesis testing in NeuroAI and neuroscience? Seems to me that most of NeuroAI is focused on making predictions with naturalistic stimuli, and ignoring experiments that manipulate stimuli to test specific hypotheses about the algorithms and mechanisms that support human and ANN performance. When models are tested on these sorts of experiments (typical in psychology), they generally fail.
Even more problematic, reviewers and editors are so committed to reporting better and better predictions that they are not much interested in publishing studies that falsify ANNs. For example, we can’t get out MindSet Vision toolkit (designed to facilitate this alternative experimental approach to NeuroAI), twice rejected in NeurIPS, once in ICLR, with comments like “It remains unclear what we should do or not do to improve the designs of these models even after benchmarking models on these experiments”.
Or consider the Centaur model published in Nature that catastrophically fails when tested on experiments that manipulate variables (Bowers et al., 2025). For instance, it can often recall 256 digits correctly in a STM digit span task (humans recall about 7 digits), and can output accurate response in a serial reaction time task in 1ms. The editor rejected our commentary, writing:
“…we have come to the editorial decision that publication of this exchange is not justified. This is because we feel that the authors have acknowledged that Centaur is not a theory, nor should it be expected to replicate in extreme situations.”
We then submitted our work to elife, and got the following:
"…as usually with LLM-stuff, and with benchmarks in general, having the criticism hinge on one result that could change with training data/regime is not ideal”.
Yes, some other model might succeed (in two sets of experiments, not one result), but the point is that the model published in Nature does not. And it will be hard to find a better model of human mind if its flaws are difficult to publish and the focus is on predicting held out samples. Falsification is not much appreciated in NeuroAI. Bowers et al. (2023) provide multiple reviews where reviewers and editors and say it is necessary to provide “solutions” to publish; that is, show that ANNs are like humans. Or check out my talk at NeurIPS from a few years ago where I go through many more examples of reviewers and editors rejecting falsification because they are more interested in solutions: https://neurips.cc/virtual/2022/63150
I agree with your post, and think that is undermines many of the claims made in NeuroAI. Am I being unfair to NeuroAI?
Bowers, J. S., Malhotra, G., Adolfi, F., Dujmović, M., Montero, M. L., Biscione, V., ... & Heaton, R. F. (2023c). On the importance of severely testing deep learning models of cognition. Cognitive Systems Research, 82, 101158.
Bowers, J. S., Puebla, G., Thorat, S., Tsetsos, K., & Ludwig, C. J. H. (2025). Centaur: A model without a theory. PsyArxiv https://doi.org/10.31234/osf.io/v9w37_v2
I think we are despairing about the same thing. Good predictions do not mean we understand things. Much of neuroAI is built on that fallacy. As is much of the rest of science and the moment. And much of the interpretable AI field.
Pleased we agree! But it is broader issue than the current a focus on prediction. Prediction serves the purpose of showing that ANNs are like humans, and that is the conclusion that reviewers and editors what to see. It is the incentives in science, with one bandwagon after another changing the standards of evidence required.
I like challenging the status quo, and in whatever domain, for a variety of reasons, if you are not in the in-crowd, I've found you are in for a rough ride. With another hat, I've recently been challenging research on reading instruction, and the focus on "phonics instruction". The level of gatekeeping is extraordinary here too. I've written a series of blogpost here: https://jeffbowers.blogs.bristol.ac.uk/blog/. I've paused, but perhaps I should start again with a focus on NeuroAI.
A long time back, I remember reading a series of posts on-line discussing the dead salmon article, that was rejected over and over again, eventually published in a relatively obscure venue (don't remember where). People were talking about the new statistical methods required to fix the problem (finding that theory of mind in located in the tail of a dead salmon). And new methods were indeed needed, just like a rejection of correlations as a method to support causal conclusions in NeuroAI. But someone at the time made a point that really struck me -- the problem is that many researches were more focused on their careers than understanding, and the fMRI methods of the time served quite well for careers.
Prediction says preciously little about function and editors and reviewers simply misunderstand (and many untrained neural networks are better than trained ones!
The microprocessor article wouldnt have gotten published I think if not for the fact that it had found 100k readers by the time PLOS CB got it. And because at least one of the reviewers is a great scientist.
The system has a good immune system. There are so many bs fields that are only propped up by the field reliably rejecting the critical papers.
Shameless plug, but very aligned with this thread: we posted “Illusions of Alignment Between Large Language Models and Brains Emerge From Fragile Methods and Overlooked Confounds” on bioRxiv last year; the revised version was accepted in principle at Nature Communications last week. Our core point is basically the same worry you’re both circling: strong model–brain prediction can arise from fragile evaluation choices and overlooked confounds, so it’s risky to treat correlation / brain-score improvements as mechanistic insight.
And, to echo the “immune system” point: more time than I’d like to admit went into getting this critique-of-evidence paper to a form that could survive review. To Jeff's point, it’s hard not to notice that skeptical / “this evidence is fragile” papers often seem to face a higher bar than “here’s a new SOTA correlation” papers.
Congratulations! I really liked (and cited) early iterations of this paper from a few years back. Question if you don't mind - what was the process like of getting this published? Were you hitting a brick wall for some time? (explaining why it has taken years). I'm so pleased it is coming out at a top venue.
Thanks Jeff! I really appreciate it, and I’m glad you liked (and cited) the earlier iterations.
On the timeline/process: the first version (“What are LLMs mapping to in the brain?…”) went online in May 2024. That paper was a critique of brain-score over-interpretation built around the Schrimpf 2021 benchmark suite, showing how much apparent “alignment” could be explained by simple structure/confounds, and how certain evaluation choices (e.g., shuffled splits) could lead to absurdly high scores from trivial autocorrelation models. The main pushback we got wasn’t “this is wrong,” but that the framing read too broad and we weren’t explicit enough about exactly which prior results/practices it did vs didn’t bear on.
Making the critique more precise meant writing the version we were initially hesitant about: a more targeted critique of Schrimpf et al. 2021’s key results and conclusions, paired with a more streamlined analysis of how simple feature spaces can explain a lot of the apparent predictivity in those three datasets. That became the “Illusions of Alignment” preprint we posted in March 2025.
For the journal submission, we also went out of our way to carve scope carefully, including a table of ~35 papers across labs to be explicit about which claims are plausibly impacted by these failure modes vs which aren’t. I totally understand why reviewers want that kind of clarity for a skeptical paper, but it does mean you end up doing a fairly comprehensive literature-audit to make the critique maximally fair and precise.
Minor note: the title for the journal version is different; it won’t include “Illusions” or “fragile.” This was, in my opinion, the greatest tragedy of the revision process.
A clear case for distinguishing predictive performance from mechanistic understanding and generalizable models.
I will add that causal associations can also fail to accurately predict perturbation outcomes, if the causal associations are indirect (e.g., GWAS) and the causal effect dependent on other factors that change across cohorts. https://www.youtube.com/watch?v=P0-_gDUNikc&t=3432s
I am pretty confused by this post. Do you consider the examples you mentioned (torcetrapib, homocysteine, and CARET) to be genuine failures to understand the difference between causation and prediction? I interpret all of these as "the observational data indicated these might be connected, so the people involved ran the (very expensive in each case) experiment, and the experiment failed." The failed experiments *were* the experiments that would've allowed mechanistic claims to be made. I'm not an expert here, so perhaps I'm misunderstanding the history in some important way --- can you clarify?
Hi Konrad, I agree with your comments here, but how does this fit with the general practice in NeuroAI, and your claim that there is too much hypothesis testing in NeuroAI and neuroscience? Seems to me that most of NeuroAI is focused on making predictions with naturalistic stimuli, and ignoring experiments that manipulate stimuli to test specific hypotheses about the algorithms and mechanisms that support human and ANN performance. When models are tested on these sorts of experiments (typical in psychology), they generally fail.
Even more problematic, reviewers and editors are so committed to reporting better and better predictions that they are not much interested in publishing studies that falsify ANNs. For example, we can’t get out MindSet Vision toolkit (designed to facilitate this alternative experimental approach to NeuroAI), twice rejected in NeurIPS, once in ICLR, with comments like “It remains unclear what we should do or not do to improve the designs of these models even after benchmarking models on these experiments”.
Or consider the Centaur model published in Nature that catastrophically fails when tested on experiments that manipulate variables (Bowers et al., 2025). For instance, it can often recall 256 digits correctly in a STM digit span task (humans recall about 7 digits), and can output accurate response in a serial reaction time task in 1ms. The editor rejected our commentary, writing:
“…we have come to the editorial decision that publication of this exchange is not justified. This is because we feel that the authors have acknowledged that Centaur is not a theory, nor should it be expected to replicate in extreme situations.”
We then submitted our work to elife, and got the following:
"…as usually with LLM-stuff, and with benchmarks in general, having the criticism hinge on one result that could change with training data/regime is not ideal”.
Yes, some other model might succeed (in two sets of experiments, not one result), but the point is that the model published in Nature does not. And it will be hard to find a better model of human mind if its flaws are difficult to publish and the focus is on predicting held out samples. Falsification is not much appreciated in NeuroAI. Bowers et al. (2023) provide multiple reviews where reviewers and editors and say it is necessary to provide “solutions” to publish; that is, show that ANNs are like humans. Or check out my talk at NeurIPS from a few years ago where I go through many more examples of reviewers and editors rejecting falsification because they are more interested in solutions: https://neurips.cc/virtual/2022/63150
I agree with your post, and think that is undermines many of the claims made in NeuroAI. Am I being unfair to NeuroAI?
Jeff Bowers
Biscione et al. (2025). MindSet: Vision. A toolbox for testing DNNs on key psychological experiments https://openreview.net/forum?id=VkPUQJaoO1&referrer=%5Bthe%20profile%20of%20Jeffrey%20Bowers%5D(%2Fprofile%3Fid%3D~Jeffrey_Bowers1)
Bowers, J. S., Malhotra, G., Adolfi, F., Dujmović, M., Montero, M. L., Biscione, V., ... & Heaton, R. F. (2023c). On the importance of severely testing deep learning models of cognition. Cognitive Systems Research, 82, 101158.
Bowers, J. S., Puebla, G., Thorat, S., Tsetsos, K., & Ludwig, C. J. H. (2025). Centaur: A model without a theory. PsyArxiv https://doi.org/10.31234/osf.io/v9w37_v2
Hi Jeff,
I think we are despairing about the same thing. Good predictions do not mean we understand things. Much of neuroAI is built on that fallacy. As is much of the rest of science and the moment. And much of the interpretable AI field.
Konrad
Pleased we agree! But it is broader issue than the current a focus on prediction. Prediction serves the purpose of showing that ANNs are like humans, and that is the conclusion that reviewers and editors what to see. It is the incentives in science, with one bandwagon after another changing the standards of evidence required.
I like challenging the status quo, and in whatever domain, for a variety of reasons, if you are not in the in-crowd, I've found you are in for a rough ride. With another hat, I've recently been challenging research on reading instruction, and the focus on "phonics instruction". The level of gatekeeping is extraordinary here too. I've written a series of blogpost here: https://jeffbowers.blogs.bristol.ac.uk/blog/. I've paused, but perhaps I should start again with a focus on NeuroAI.
A long time back, I remember reading a series of posts on-line discussing the dead salmon article, that was rejected over and over again, eventually published in a relatively obscure venue (don't remember where). People were talking about the new statistical methods required to fix the problem (finding that theory of mind in located in the tail of a dead salmon). And new methods were indeed needed, just like a rejection of correlations as a method to support causal conclusions in NeuroAI. But someone at the time made a point that really struck me -- the problem is that many researches were more focused on their careers than understanding, and the fMRI methods of the time served quite well for careers.
Prediction says preciously little about function and editors and reviewers simply misunderstand (and many untrained neural networks are better than trained ones!
The microprocessor article wouldnt have gotten published I think if not for the fact that it had found 100k readers by the time PLOS CB got it. And because at least one of the reviewers is a great scientist.
The system has a good immune system. There are so many bs fields that are only propped up by the field reliably rejecting the critical papers.
Shameless plug, but very aligned with this thread: we posted “Illusions of Alignment Between Large Language Models and Brains Emerge From Fragile Methods and Overlooked Confounds” on bioRxiv last year; the revised version was accepted in principle at Nature Communications last week. Our core point is basically the same worry you’re both circling: strong model–brain prediction can arise from fragile evaluation choices and overlooked confounds, so it’s risky to treat correlation / brain-score improvements as mechanistic insight.
Here's the link to the preprint: https://www.biorxiv.org/content/10.1101/2025.03.09.642245v1.abstract
And, to echo the “immune system” point: more time than I’d like to admit went into getting this critique-of-evidence paper to a form that could survive review. To Jeff's point, it’s hard not to notice that skeptical / “this evidence is fragile” papers often seem to face a higher bar than “here’s a new SOTA correlation” papers.
Very cool. Also advertised it on twitter bluesky and linkedin...
Thank you so much; we really appreciate it!
Congratulations! I really liked (and cited) early iterations of this paper from a few years back. Question if you don't mind - what was the process like of getting this published? Were you hitting a brick wall for some time? (explaining why it has taken years). I'm so pleased it is coming out at a top venue.
Thanks Jeff! I really appreciate it, and I’m glad you liked (and cited) the earlier iterations.
On the timeline/process: the first version (“What are LLMs mapping to in the brain?…”) went online in May 2024. That paper was a critique of brain-score over-interpretation built around the Schrimpf 2021 benchmark suite, showing how much apparent “alignment” could be explained by simple structure/confounds, and how certain evaluation choices (e.g., shuffled splits) could lead to absurdly high scores from trivial autocorrelation models. The main pushback we got wasn’t “this is wrong,” but that the framing read too broad and we weren’t explicit enough about exactly which prior results/practices it did vs didn’t bear on.
Making the critique more precise meant writing the version we were initially hesitant about: a more targeted critique of Schrimpf et al. 2021’s key results and conclusions, paired with a more streamlined analysis of how simple feature spaces can explain a lot of the apparent predictivity in those three datasets. That became the “Illusions of Alignment” preprint we posted in March 2025.
For the journal submission, we also went out of our way to carve scope carefully, including a table of ~35 papers across labs to be explicit about which claims are plausibly impacted by these failure modes vs which aren’t. I totally understand why reviewers want that kind of clarity for a skeptical paper, but it does mean you end up doing a fairly comprehensive literature-audit to make the critique maximally fair and precise.
Minor note: the title for the journal version is different; it won’t include “Illusions” or “fragile.” This was, in my opinion, the greatest tragedy of the revision process.
Nice one, thank you.
A clear case for distinguishing predictive performance from mechanistic understanding and generalizable models.
I will add that causal associations can also fail to accurately predict perturbation outcomes, if the causal associations are indirect (e.g., GWAS) and the causal effect dependent on other factors that change across cohorts. https://www.youtube.com/watch?v=P0-_gDUNikc&t=3432s
I am pretty confused by this post. Do you consider the examples you mentioned (torcetrapib, homocysteine, and CARET) to be genuine failures to understand the difference between causation and prediction? I interpret all of these as "the observational data indicated these might be connected, so the people involved ran the (very expensive in each case) experiment, and the experiment failed." The failed experiments *were* the experiments that would've allowed mechanistic claims to be made. I'm not an expert here, so perhaps I'm misunderstanding the history in some important way --- can you clarify?
Agreed and should have worded it differently.