By Aaron E. Carroll
The medical research grant system in the United States, run through the National Institutes of Health, is intended to fund work that spurs innovation and fosters research careers. In many ways, it may be failing.
It has been getting harder for researchers to obtain grant support. A study published in 2015 in JAMA showed that from 2004 to 2012, research funding in the United States increased only 0.8 percent year to year. It hasn’t kept up with the rate of inflation; officials say the N.I.H. has lost about 23 percent of its purchasing power in a recent 12-year span.
Because the money available for research doesn’t go as far as it used to, it now takes longer for scientists to get funding. The average researcher with an M.D. is 45 years old (for a Ph.D. it’s 42 years old) before she or he obtains that first R01 (think “big” grant).
Given that R01-level funding is necessary to obtain promotion and tenure (not to mention its role in the science itself), this means that more promising researchers are washing out than ever before. Only about 20 percent of postdoctoral candidates who aim to earn a tenured position in a university achieve that goal.
This new reality can be justified only if those who are weeded out really aren’t as good as those who remain. Are we sure that those who make it are better than those who don’t?
A recent study suggests the grant-making system may be unreliable in distinguishing between grants that are funded versus those that get nothing — its very purpose.
When a health researcher believes she or he has a good idea for a research study, they most often submit a proposal to the N.I.H. It’s not easy to do so. Grants are hard to write, take a lot of time, and require a lot of experience to obtain.
After they are submitted, applications are sorted by topic areas and then sent to a group of experts called a study section. If any experts have a conflict of interest, they recuse themselves. Applications are usually first reviewed by three members of the study section and then scored on a number of domains from 1 (best) to 9 (worst).
The scores are averaged. Although the bottom half of applications will receive written comments and scores from reviewers, the applications are not discussed in the study section meetings. The top half are presented in the meeting by the reviewers, then the entire study section votes using the same nine-point scale. The grants are then ranked by scores, and the best are funded based on how much money is available. Grants have to have a percentile better than the “payline,” which is, today, usually between 10 and 15 percent.
Given that there are far more applications than can be funded, and that only the best ones are even discussed, we hope that the study sections can agree on the grades they receive, especially at the top end of the spectrum.
In this study of the system, researchers obtained 25 funded proposals from the National Cancer Institute. Sixteen of them were considered “excellent,” as they were funded the first time they were submitted. The other nine were funded on resubmission — grant applications can be submitted twice — and so can still be considered “very good.”
They then set up mock study sections. They recruited researchers to serve on them just as they do on actual study sections. They assigned those researchers to grant applications, which were reviewed as they would be for the N.I.H. They brought those researchers together in groups of eight to 10 and had them discuss and then score the proposals as they would were this for actual funding.
The intraclass correlation — a statistic that refers to how much groups agree — was 0 for the scores assigned. This meant that there was no agreement at all on the quality of any application. Because they were concerned about the reliability of this result, the researchers also computed a Krippendorff’s alpha, another statistic of agreement. A score above 0.7 (range 0 to 1) is considered “acceptable.” None were; the values were all very close to zero. A final statistic measured overall similarity scores and found that scores for the same application were no more similar than scores for different applications.
There wasn’t even any difference between the scores for those funded immediately and those requiring resubmission.