Manifest-based DRS import: A practical solution for cross-DCC dataset analysis to empower translational discovery using Kids First and GTEx data

posted on 2022-10-03, 17:04 authored by Surya SahaSurya Saha, Michele MattioniMichele Mattioni, Milos Trboljevac, Eric Wenger, Allison Heath, Yuankun Zhu, Robert Carter, Michelle Giglio, Suvarna Nadendla, C. Titus BrownC. Titus Brown, Bailey K Farrow, Daniel J. B. Clarke, Adam Kraya, Kristin Ardlie, Jared Nedzel, Lan Nguyen, Avi Ma'ayanAvi Ma'ayan, Owen White, Jack DiGiovanna, Adam ResnickAdam Resnick

A key challenge in data discovery is the coordination and assembly of datasets from across Common Fund Data Ecosystem (CFDE) Data Coordination Centers (DCC) in an easy to use and meaningful manner to accelerate usage by researchers. We have implemented a manifest-based import on our CAVATICA platform for a user to create a cross-Common Fund dataset cohort and combine the results with their own data in order to accelerate platform-based discovery and clinical translation. We propose the required field: drs_uri followed by these optional fields: file_name, study_registration, study_id, participant_id, specimen_id, experimental_strategy, file_format and fhir_document_reference. The study_registration is the external source of the study_id (e.g. dbGaP). The study_id, participant_id and specimen_id fields are unique identifiers that can be used to retrieve more information. The experimental_strategy and file_format fields are based on the Genomics Data Commons definitions. The fhir_document_reference points to the FHIR Document Reference, if metadata is available on a FHIR server. This process provides an efficient method to import a list of DRS URIs along with relevant metadata. In this use case, a manifest is created from the Common Fund Data Ecosystem portal with GTEx and Kids First (KF) neuroblastoma RNA sequencing assays and brought into a collaborative CAVATICA workspace. The data authorization aspect is managed by CAVATICA. For KF and GTEx datasets which have controlled access, the user’s dbGaP access authorizations are checked and the data becomes accessible only if the user has proper authorization. Authorized users can choose to run their own pipelines or use a KF standard pipeline to harmonize and analyze the combined data set. This use case demonstrates how a user can easily search for and generate a cohort across a federated DCC resource framework followed by DRS-based import into CAVATICA collaborative workspace for democratized access and translational knowledge mining.


