When scale-up wins over scale-out computing

william-bout-264826.jpg

Graph databases

deep learning, metagenomics, ...

High Performance Computing has been the bread and butter of scientific computing. Cloud computing has enabled massive scale distributed computing but there are still some applications that benefit from scale-up (aka supercomputer i.e. a single computer with large number of cores or RAM) computing. For example, researchers from Oklahoma State University completed the largest metagenomics assembly to date by sequencing data from a soil metagenome that required 4TB of memory. In another example, the NVidia Pascal GPUs (P100) show deep learning acceleration in recent benchmarks

A biomedical application that is nascent but shows promise for scale-up computing paradigm is use of graph databases for modern biomedical data mining. In complex multi-modal biology (e.g. omics, wearable, imaging, ...), the relationships between datasets are hard to characterize using relational databases. The appropriate paradigm for storing and mining these datasets is a graph database. Graph analytics offers capability to search and identify different characteristics of a graph dataset: nodes connected to each other, communities containing nodes, the most influential nodes, chokepoints in a dataset, and nodes similar to each other. New implementations in industry has shown that using graph algorithms can solve real-world problems such as detecting cyberattacks, creating value from internet of things sensor data, analyze the spread of epidemics (Ebola), and precisely identifying drug interactions faster than ever before. An open source tool, Bio4j, is a graph database framework for protein related information querying and management that integrates most data available in Uniprot KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50,90,100), NCBI Taxonomy, and Expasy Enzyme DB. NeuroArch is a graph database framework for querying and executing fruit fly brain circuits. Researchers are increasingly looking towards graph database when current data models and schemas will not support research queries and study has a lots of new and disparate data sources that are inherently unstructured.

Scale-up (aka supercomputing) architectures tend to be expensive. For academic researchers, access to these supercomputers are available at large academic centers or supercomputing centers. Here are some of the recent supercomputer installations in news:

  1. May 2017: Department of Genetics at Stanford University has acquired its first supercomputer, a SGI (now part of HPE) UV300 unit, via an NIH S10 Shared Instrumentation Grant. This is a newer and badder version of TGAC system. It has 360 cores, 10 terabytes of RAM, 20 terabytes of flash memory (essentially SSDs with NVMe* storage technology), 4 NVidia Pascal GPUs (P100s are especially suited to deep learning), and 150+ terabytes of local scratch storage. (More)
  2. Aug 2016: Pittsburgh Supercomputing Center (PSC) funded by NSF has two HPE Integrity Superdome Xs, each with 16 CPUs (22 cores per CPU totalling 352 cores), 12TB RAM, and 64TB on-node storage (More)
  3. May 2016: The Genome Analysis Centre (TGAC) has recently procured a set of SGI UV300 supercomputers. TGAC is a UK hub for innovative Bioinformatics and hosts one of the largest computing hardware facilities dedicated to life science research in Europe. Their new TGAC platform comprises two SGI UV 300 systems totalling 24 terabytes (12 terabytes each) of RAM, 512 cores and 64TB NVMe storage. (More)