SARS-CoV-2 was not known to science before the start of the global COVID-19 pandemic and was only sequenced for the first time in January 2020 about a month after its discovery. Since then about 100 organizations worldwide have contributed genomic data to the study of the virus that causes COVID-19, based on infrastructure developed during past efforts to sequence HIV, Ebola, Zika, influenza, Hepatitis C, and other viruses. The Galaxy project is one of the world's largest bioinformatics "gateways", supporting a community of more than 30,000 researchers, and is an important resource for analysis of the structure and stability of the SARS-CoV-2 genome. Galaxy users have access to a wide variety of computers, including TACC's Stampede2 and Jetstream supercomputers for large-scale computations, and the Bridges platform at the Pittsburgh Supercomputing Center (PSC) for genome assembly jobs that require large amounts of shared memory.
Temple University's Sergei Pond has developed a widely used set of tools called HyPhy, specifically for selection analysis in infectious diseases. With Galaxy and HyPhy working together, researchers can perform robust, reproducible analysis of SARS-CoV-2 genomic sequences. Because of the rapid availability of SARS-CoV-2 genome data from outbreaks around the world, researchers can use these while the pandemic is evolving to evaluate the degree to which the virus is -- or is not -- mutating over time. Thus far these analyses reveal that the SARS-CoV-2 virus is changing more slowly than influenza because of an enzyme that does error checking and correction during RNA synthesis and RNA replication. This is important as less stable viruses make it difficult to develop effective vaccines.
Another science gateway being used for COVID-19 research is the I-TASSER gateway for automated protein structure and function prediction hosted at the San Diego Supercomputer Center (SDSC). Yang Zhang, a professor of computational medicine and bioinformatics at the University of Michigan has been using the gateway to analyze sequences of the SARS-CoV-2 genome and compare them with coronaviruses in other species. Thus far these results suggest that pangolins, along with bats, may have played a role in the introduction of the virus to humans.
Discovering the Mechanisms of Infection
The first step in treating and potentially preventing COVID-19 infections is understanding how the virus infects its host cells. In general, scientists understand that, once inside an organism, the SARS-CoV-2 virus builds an extendable apparatus from core helical amino acids in its spike protein that latch on to a target host cell, leading to infection. However, if scientists can refine this general understanding into a complete picture of how the spike protein extends and then binds to its host cell, it may be possible to use the details of the process to find a way to disrupt the extension movement of the receptor-binding domain on the spike, preventing the virus from entering the cell and creating an infection in the first place.
Molecular dynamics simulations play an important role in understanding this behavior, but conventional methods are limited to timescales that are too large to develop a detailed understanding. Rommie Amaro's lab at the University of California, San Diego, is helping to accelerate the development of new treatments using the enhanced sampling weighted ensemble method on Frontera [cite marker paper] to reach biologically relevant timescales of the spike protein. As of this writing, TACC's Frontera is the 8th largest supercomputer in the world [cite https://top500.org/lists/top500/list/2020/06/], and its relevance to efforts to fight the pandemic indicate both the difficulty of the scientific challenge and the value of investments in leadership-class supercomputers.
Amaro's simulations have resulted in the discovery of important features of the virus, including the role that glycans play in camouflaging the virus from the immune system and revealing how the spike protein changes shape in a way that helps the virus bind with the ACE2 receptor on human cells.
Mahmoud Moradi from the University of Arkansas is using Frontera for simulations that study how the spike extension apparatus works, beginning with the observation that both SARS-CoV-2 and SARS-CoV (the cause of the 2002-2003 SARS epidemic) have spike proteins. Moradi's work relies on experimentally determined high-resolution 3-D structures of spike proteins as initial structures in simulations to determine the features of both proteins, and to investigate how the behavior between the two viruses differ. Thus far, the group has been able to observe significant differences in the dynamics of the binding mechanisms of the two viruses.
These kinds of numerical simulations are difficult and time-consuming, and reaffirm the unique value of leadership-class supercomputers like Frontera. In Baylor College of Medicine's Numan Oezguen's case, molecular dynamics simulations like these 50 days of processing time to simulate one microsecond of viral action.
Disrupting the Ability of the Virus to Copy Itself
Once the virus binds to a host cell it hijacks that cell's replication machinery to make new copies of itself, furthering the infection in the target organism. If this process can be disrupted, it is possible that the duration and severity of infections can be reduced.
Scientists know from previous studies that the antiviral drug remdesivir interrupts the chemical processes the virus uses to copy itself by binding to enzymes responsible for the final assembly of copies. A team of scientists led by Andres Cisneros of the University of North Texas is using the Stampede2 and Frontera supercomputers at TACC to model the mechanisms that SARS-CoV-2 uses to copy itself in hopes of improving the effectiveness of antiviral treatments for COVID-19. His work investigates how remdesivir and other available drugs inhibit the proteins NSP-12 and the main protease, both enzymes the coronavirus needs for replication. NSP-12 puts together the nucleotides that make up viral RNA, building complete sets of genetic material for new coronavirus copies. NSP-12 is part of a larger structure called the RNA-dependent RNA polymerase (RDRP) that copies the complete RNA. Remdesivir binds with RDRP, interrupting the mechanism. The other protein Cisneros is studying is the main protease, which separates a polyprotein produced by SARS-CoV-2 into functional proteins used to build the virus.
The key chemical reactions are simulated using a hybrid method called QM/MM (quantum mechanics/molecular mechanics) that dramatically reduces the time to solution by focusing more intently on interactions at the active site, using the more approximate straight molecular dynamics for everything else.
Drug Discovery
Once SARS-CoV-2 has gained a foothold in an organism and is effectively copying itself to increase viral load, it is time for pharmaceutical intervention. Thomas Cheatham, a professor of medicinal chemistry and director of the Center for High Performance Computing at the University of Utah, is using Longhorn, an IBM/NVIDIA system at TACC to generate molecular models of compounds relevant for treatment of COVID-19. Once identified, the most promising candidates can then be tested in the lab in collaboration with medicinal chemists. Cheatham is using an approach he developed in 2015 to identify molecules for treatment of Ebola. The workflow, based on known crystal structures, uses software to select promising amino acid chains on a fixed peptide backbone template and then performs molecular dynamics simulations with Amber to optimize the structures, which are then ranked based on free energetics estimates. For COVID-19 the researchers are investigating the crystal structure of the COVID-19 main protease — an enzyme that breaks down proteins and peptides — in the presence of peptide inhibitors. These simulations will then serve as the basis for laboratory-based protease assays which with test the efficacy of the most promising inhibitors from the simulations.
Understanding the Spread of the Virus at Large and Small Scale
While the scientific community grapples with the urgent and difficult challenge of understanding the biology of the virus, its pathways of infection, and ways to effectively treat and prevent the COVID-19, the medical community deals with the devastating effects of the disease in patients on a day-to-day basis. Because the virus is new, effective approaches to manage and treat patients have been developed in real-time through trial and error. One area in which supercomputing is helping to fill knowledge gaps in the understanding of transmission pathways in hospitals and other indoor areas. Som Dutta from Utah State University leads a computational fluid dynamics project on Frontera to study how virus-laden droplet clouds get transported and mixed within the indoor environment. These simulations may help scientists develop procedures and guidelines to reduce the droplet-based viral loading in a room, making it safer for health care professionals in contact with COVID-19 patients, and for other patients in the same facility. Dutta's simulations use high-fidelity multiphase large-eddy simulations (LES) to determine the dynamics of the droplet cloud, to understand how long the pathogen-cloud persists, and where particles settle in an idealized hospital environment.
On a much larger scale, supercomputer models of virus transmission are an important tool for decision-making by local, state, and national leaders. UT Austin epidemiologist Lauren Ancel Meyers leads the UT Austin COVID-19 Modeling Consortium, whose model is driven by anonymized mobile-phone data and case count and hospitalization data from Johns Hopkins University. They take an ensemble approach, combining results from two models to arrive at predictions. The first model fits a regression curve for daily death rates versus mobility data, and then makes extrapolations from that regression. This is not an epidemiological model as it does not make any attempt to describe the process of disease transmission. To account for this potential shortcoming, an "SEIR" epidemiological model is used as the second partner in the ensemble. The letters stand for the four categories of information used in the simulation: data on Susceptible (S), Exposed (E), Infected (I), Recovered (R), and Dead (D) patients. The key output of this model is each state's transition rate between S and E. In making a prediction for a state, both models are fit to the state's data, and the final prediction is a weighted combination of the two models. Ancel Meyers's model, and others like it, have proved invaluable tools for public health officials and policymakers struggling to contain the spread of the virus and to ensure the availability of critical care facilities to support patients experiencing the most devastating effects of the disease.