This high schooler created a drug discovery search engine

A change of perspective allowed a high school researcher to better find drug candidates from a small dataset.

June 12, 2021

Between his mom’s place in Manhattan, his dad in Queens, and his high school in the Bronx, Noah Getz is on the subway a lot. It gives him time to read and to think.

Our first coronavirus summer was waning, and he’d been wrestling with a weighty science problem: using machine learning to hunt down tiny molecules that may help treat Alzheimer’s. Thus far, his AI had been spitting out results that were “almost comically bad.”

The problem was that the algorithms Getz was using did their best when they had massive amounts of data to sift through and discover patterns in. Getz’ data set was far smaller; he was working with one lab at Mount Sinai, not a multinational pharmaceutical company with a galaxy-sized drug library.

“It (was) easier for it to assume nothing worked at all than for it to learn any trends,” he says.

Since starting work at the Mount Sinai lab of Charles Mobbs in 2019, the summer before his junior year at Bronx High School of Science, Getz had been on a personal mission. On both sides of his family, he’d seen what Alzheimer’s can do, and he’d reached out to just about every lab in the city looking for a place he could help.

Thus far, everything he’d tried to do to coax his algorithm into giving up molecules worth testing hadn’t worked.

As his train sliced through the city, third rail thrumming, Getz sifted through a pile of computer science papers when he saw one about using machine learning for information retrieval.

At its broadest, information retrieval means finding a specific thing — or, more likely, a group of things, ranked by their relevance to your search. Think Google’s search engine — but for drug discovery. It was exactly what Getz was trying to do!

Using the insight he found in the machine-learning paper, he tweaked his algorithm that night, a small change that made a big difference — and a result that earned the now MIT-bound 17-year-old second place in the Regeneron Science Talent Search.

“It worked. And that was just really, really crazy to see.”

“The Valley of Death”

Drug discovery is science at its most frustrating and messy. From an untold number of potential molecules and compounds, researchers winnow down the most promising candidates, which then go into labs, test tubes, and maybe mice — and maybe, very rarely, people.

“For every 10,000 candidate drugs, only five are ever tested in clinical trials,” Stanford University neurosurgery resident Teresa Purzner wrote in 2018. Those few that make it through that last gauntlet now cost, on average, 2.5 billion dollars, from soup to nuts.

Between discovery and clinical trials, Purzner says, is “the valley of death.”

But, thanks to breakthroughs in machine learning, researchers are now discovering potential drugs quicker and cheaper. When MIT researchers turned an AI loose on roughly 6,000 compounds, the algorithm helped them discover a powerful new antibiotic, with a structure unlike any other known; when another AI predicted the new drug wouldn’t be toxic in humans, they named it and research began.

University of Pittsburgh assistant professor of biological science Jacob Durrant — whose lab focuses on computer-aided drug design — told The Guardian last year that “any method that can speed early-stage drug discovery has the potential to make a big impact.”

And the biotech and pharmaceutical sectors are paying attention — billions of dollars worth of attention, in fact.

The Hunt for a Small Molecule

Charles Mobbs’ lab at Mount Sinai’s Icahn School of Medicine, where Getz is a research volunteer, studies the mechanisms behind age-related diseases, including Alzheimer’s.

Specifically, Mobbs and his team are searching for molecules that can reduce inflammation in the brain — a factor tied to Alzheimer’s disease — by cutting down a protein called TNF alpha.

But not just any old molecule will do. The brain is protected by the blood-brain barrier, a wall of cells that is extremely selective about what gets inside. Ensconced behind the barrier, the brain is also inherently difficult to get drugs to — without boring a hole in the skull, of course.

To have any shot of affecting Alzheimer’s disease, they need molecules that are both able to block TNF alpha and small enough to slip through the barrier.

Mobbs’ lab has already found a few such candidates, but with such long odds against any potential new drug, they knew they needed more. Potentially, a lot more.

“I started doing this machine-learning project as a side project when I got home from the lab,” Getz says.

As it took shape, Getz brought his model to Mobbs, who was excited about the idea.

Getz buried himself in the scientific literature, but assembled his machine-learning model from scratch.

“I made sure I understood it,” Getz says. “The logic as well as the math that would go behind a particular model. And I think doing that, even though it took so much longer then (it) probably would have otherwise, really gave me a pretty decent foundation of knowledge.”

Which would come in handy when the model failed.

A Drug Discovery Search Engine

When she heard about Getz’ revelation on the MTA, Sally Jo Cunningham had one thought: “Wow. That was clever.”

An associate professor of computer science at New Zealand’s University of Waikato, Cunningham says Getz’ original plan to enlist machine learning to retrieve information on drug molecules didn’t seem promising, because his database was just too small.

The pattern-finding abilities that make machine learning such a powerful tool for discovering drugs only work with enough data to find those patterns in. With a small data set, you’d just be “asking an inappropriate question,” Cunningham says.

Those machine learning models treat each compound as a data point, Getz says; Getz was working with, like, 20 compounds, which means only 20 data points.

But because an information retrieval algorithm ranks and compares all of the compounds to one another, it creates exponentially more data to work from — each one of those comparisons, instead of just the original 20 by themselves.

Suddenly, Getz’ machine-learning model had enough data points to get to work on.

Within a few days of his subway epiphany (with his computer periodically slotted into the freezer as it overheated), Getz had his initial results.

“Seeing the numbers pop up, and seeing how much better it performed than what I was doing before, was just a really, really nice thing to see; just sort of like a great ‘a-ha’ moment,” Getz says.

The machine-learning algorithm was now giving him a list of small molecules it thought may work to lower TNF alpha levels, ranked by relevance — i.e., which were most likely to work.

The top two small molecules Getz’ information retrieval model selected almost completely eliminated TNF alpha levels in the lab, “which was really crazy to see.”

“Using Noah’s artificial intelligence platform, we optimized drugs and made predictions of compounds that would be even more protective” than the ones they’d already found, Mobbs says.

Small Sets, Strong Results

Traditionally, information retrieval algorithms (like in search engines) aren’t powered by machine learning, according to Frédéric Dubut, principal PM manager of the Core Ranking Team at Bing. Instead, they are based on ordinary statistics and probabilities.

But there are plenty of smaller, more specific data sets researchers are interested in plumbing, which lack big data and require a different approach. In a way, Getz is taking the field back to its roots — information retrieval pioneer Hans Peter Luhn introduced some of the field’s foundational techniques to retrieve chemical compounds from a database, Dubut says.

Getz is now in the process of designing a more robust version of his information retrieval algorithm, hoping to incorporate multiple factors, like dosage and toxicity, into the mix for ever-sharper results.

Smaller labs, Getz hopes, may look at his project and see the shift to machine learning as a way that they, too, can use machine learning on their smaller datasets.

“And that opens up lots of smaller drug discovery labs that were previously at a really big disadvantage to larger companies,” Getz says.

It also allows labs, who may have a deeper knowledge base on their compounds, to more effectively compete against large companies, which may make up with capital whatever they may lack in expertise.

He will also need a user interface built out so you don’t need to be a computer scientist to use it, a necessary step for his more ambitious goal: democratized drug discovery.

“That was Noah’s first instinct,” Mobbs says.

“How can I make it so that everybody in the world can use it?”

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at tips@freethink.com.