TGen saves time and expands research using MemVerge
TGen needed more memory to get over HPC analysis issues; the nonprofit used MemVerge's Big Memory software to create larger memory pools and get research and analysis done faster.
When researchers at the Translational Genomics Research Institute sought to accelerate genetic testing on a lung disease, they found the analyses were taking too long -- in one case, several months for a single test -- due to a lack of memory. Needing a way to expand its memory footprint without running over in costs, the nonprofit turned to MemVerge and its Big Memory software.
Headquartered in Phoenix, TGen is a nonprofit organization that focuses on improving diagnostics and therapies for diseases including Alzheimer's, Parkinson's, cancer and a lung disease known as idiopathic pulmonary fibrosis.
TGen uses a high-performance computing environment for research and analysis to discover gene expression levels in certain cell populations that correlate to and are likely responsible for disease states, according to Glen Otero, vice president of scientific computing at the company and former life sciences architect for Dell's HPC team.
Otero joined TGen four years ago, with the aim of keeping scientific computing for genomic analysis on the cutting edge of performance, he said.
"We test a lot of new technologies, like storage, processors, GPUs, networking, and also look at different types of computation, including cloud computing," Otero said. "[We also test] software that could be useful for compression or encryption."
Otero and his team decided to focus on different ways to solve the issue of time consumption around testing, eventually turning to MemVerge's software.
TGen faced two problems in 2020, Otero said. First, a single analysis that investigated alternative RNA splicing from a general RNA analysis to find gene expression differences was taking months to finish. The analyses were prone to crashing due to a lack of processing power, and ran on a dedicated server, which meant the equipment was not free to run other analyses.
The RNA splicing analysis looked at all possible RNA for a gene and required a significant amount of memory to provide the necessary throughput. TGen's HPC environment consisted of 100 servers, more than 2,700 CPU cores and eight GPU cards, but the largest memory server at the time only had 750 GB of RAM, Otero said.
"That's one of the reasons the [analysis] was taking so long to run," he said.
HPC generally uses checkpointing, a snapshot of the job and the system it is running on, to quickly restart. TGen didn't use checkpoints at the time.
The second problem for TGen was the uptick in data, which required more computation. This is a growing problem for genomic research, specifically in RNA sequencing or RNA-seq single cell analysis, which allows for the observation of gene expression at a single cell level, according to Otero.
For the RNA-seq single cell analysis, "researchers were running applications in the eight-to-nine-hour range, and they were having to manually start and restart different permutations of the program as they tried different parameters for analysis," he said.
TGen had a parallel architecture in its HPC environment, but researchers couldn't take advantage of it. The code used for analysis was developed by the research community, outside of TGen, and didn't support parallel processing, Otero said.
Killing two birds with one snapshot
Intel and Dell Technologies collaborated on the HPC infrastructure that TGen now uses, according to Intel. Otero began investigating its HPC checkpoint feature to alleviate some of the issues researchers were experiencing. Otero spoke with Intel, as the vendor is a partner with TGen, but the vendor sent Otero down a different path.
Intel suggested that TGen use its Optane storage class memory (SCM) for a larger memory footprint, as well as using MemVerge software to manage it, Otero said.
Having someone manage SCM was a boon, he said. "Otherwise, managing Optane, and its PMem, manually, is a really unwieldy beast," he said.
Optane PMem is Intel's persistent memory module that sits in the memory bus, in DIMM slots. At the time that TGen started using PMem, only technical documentation existed, Otero said. PMem comes in two modes of operation, and correct configuration required several reboots and installing tools.
Installing MemVerge's Big Memory software went smoothly, taking about 10 minutes. TGen ran into a couple of bugs, mainly in updating and testing, but MemVerge engineers addressed the issues, Otero said.
MemVerge helped to write code that was incorporated into TGen's applications. Written in either R or Python, this code automated snapshot creation and replicated the snapshots for further analysis in parallel.
Looking to address both the long-running splicing analyses and the need to rerun tests for different potential outcomes, TGen turned to MemVerge and its ZeroIO snapshot technology, which uses persistent memory and acts as a cross between checkpointing and storage snapshots.
For data-rich analyses, MemVerge started taking snapshots and backing them up while the analysis was running, Otero said. If the analysis crashed, researchers could begin again from the last snapshot, cutting the analysis time from two to three months down to 13 days.
The second problem -- the RNA-seq analysis of running different parameters on a test to see different results -- was solved in a similar fashion. At the point of change, MemVerge enabled TGen to take four different snapshots and run four different tests simultaneously by cloning snapshots taken at a particular point while the program was executed. Each cloned snapshot can now take up a different part of the memory and be run at the same time.
"Those four separate analyses, that would normally run one after the other, could actually be run at the same time in parallel," he said. "That gave us a 35% speed-up in the runtime of the job."
While MemVerge was helping alleviate some of the major pain points, it wasn't solving all of them. The RNA-seq analyses run in containers, and while MemVerge could capture snapshots of the application running in a container, it could not capture a snapshot of the entire container. If larger memory pools were needed, the snapshot would have to extend to the entire container so that it could be moved.
MemVerge recently introduced a feature to address the need, which TGen will be testing soon. MemVerge added its ZeroIO snapshot technology to the Distributed MultiThreaded Checkpointing technology a few months ago. Otero said the technology could help TGen move jobs around more easily, and even to the cloud to take advantage of bursting -- a benefit considering it is in the middle of a large cluster upgrade.
Saving time, not necessarily money
The time savings, particularly on the RNA-seq analysis, is critical for a research organization like TGen. Not only can it conduct more tests faster, but it also makes TGen more competitive in a grant-based industry.
"If we could do one of these analyses, 10 million cells in the same amount of time that someone else can do 1 million cells, the chance of our getting grants is much higher," Otero said.