part of a series it calls DREAM Challenges. The phase-one winners, whose A.I. systems trained with a set of 640,000 breast images, shared a $200,000 prize for the top scores predicting breast cancer. But they now must work together as a team, using nearly 2 million images, to win the final $1 million. And that prize comes with a much higher bar: They must at least match the gold standard of human accuracy. (Neither of the phase-one winners came close.)
Justin Guinney, director of Sage’s computational oncology group, says the software systems will require significant improvement to approach that standard.
According to the U.S. government-funded Breast Cancer Surveillance Consortium, the going rate for radiologists is 87 percent sensitivity—which means a radiologist will spot something suspicious 87 percent of the time—and 89 percent specificity, which means 89 percent of the time, that suspicious image will be identified correctly as either a tumor or not a tumor.
A technical note: Just transferring a portion of the original 640,000 images from donor Kaiser Permanente slowed Kaiser’s system to a crawl, says Guinney. But even triple that—nearly 2 million breast scans—isn’t as much data as it seems, which is another reason these grassroots competitions, and A.I. medical systems, are just scratching the surface. Only 1 in 86 images from the original mammogram data set was cancer-positive. The more images of cancer a system can digest, the more it learns. “You want as many examples as you can get, especially with deep learning,” says Guinney, referring to a type of machine learning that uses many layers of neural networks.
The competitors, now working as a team, will have much greater computing resources—like driving a Ferrari instead of a Ford Focus, as Guinney puts it. They’ll be able to add more learning layers to their A.I. algorithm: a bigger brain, in effect. And unlike the Data Science Bowl, the mammography challenge is using patient metadata (on age, cosmetic surgery or implants, family history, and so on) in the training and testing process.
Despite that boost, the human-accuracy goal “will be extremely hard to achieve” in the three months they have, says Olivier Clatz, CEO of French medical imaging firm Therapixel, one winner in the first phase of the competition. But when the contest is over, everyone will have free license to use the final algorithm. With that, Clatz believes Therapixel can eventually match the human-accuracy standard.
Ultimately, any A.I. radiology product that emerges from these two contests, or elsewhere, will have to prove itself in conditions more like the real world. It’s unclear what the parameters of a prospective clinical study—one that recruits new patients and tests the software on their scans, then waits for the patients’ health outcomes—would look like, but it could require images from hundreds of thousands of patients.
NCI’s Farahani, for one, looks forward to the day when an A.I. algorithm has enough promise to merit a real-world test. “It will be interesting to run on a new, fresh set of data,” he says, instead of old sets, which might not present enough complexities. “That’s the drawback of challenges. They train their algorithms on specific collections.”
Lung cancer image courtesy of the National Cancer Institute.