by Rini Pauly, Cameron Ogle, Cole Mcknight, David Reddick, Justin Presley, Susmit Shannigrahi, Alex Feltus
December 21, 2020
Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading towards the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA[1] central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in var- ious repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab. NDN[2] has several useful properties such as name based data access, data from anywhere, in-network caching that can address data management and cyberinfras- tructure challenges faced by the genomics community. This document outlines how we integrated NDN with a contemporary workflow, point out the methods used, and the challanges. This tech report will serve as a starting point for other communities trying to integrate NDN in their workflows.