Indices for NGS Data and Gene Expression Data Registered in Public Databases
In the integrated database project in Japan, Database Center for Life Science (DBCLS) has developed various computational tools for reuse of huge amount of data archived in the public repository (https://dbcls.rois.ac.jp/services-en.html).
Meanwhile, DNA Data Bank of Japan (DDBJ) has archived and maintained data from the high-throughput sequencing platforms in the International Nucleotide Sequence Database Collaboration (INSDC) with NCBI Genbank and EBI ENA (http://www.insdc.org/). In collaboration with DDBJ, we made a search engine for metadata of these INSDC databases which consist of Bioproject, Biosample and Sequence Read Archive (SRA), which is called DBCLS SRA (http://sra.dbcls.jp/). DBCLS SRA is now linked from DDBJ website, and it is planned to be used in DDBJ officially.
Because of the high-throughput sequencing platform, tens of thousands of RNA-seq data have been archived as transcriptome data in SRA described above. On the other hand, transcriptomic data from microarray is still the majority of data in the public gene expression databases known as NCBI Gene Expression Omnibus (GEO) and EBI ArrayExpress (AE). Furthermore, DDBJ started yet another gene expression data repository called Genomic Expression Archive (GEA) in 2018. Thus, it is not easy to draw new discoveries by comparing datasets from those transcriptomic data because of the complexity of relationships among those databases. We therefore constructed an index for those gene expression data repositories, called all of gene expression (AOE) to integrate publicly available transcriptomic data (GEO, AE and GEA). The web interface of AOE (https://aoe.dbcls.jp/) can graphically query data in addition to the application programming interface. By collecting gene expression data by RNA-seq from SRA, AOE also includes data not included in GEO, AE and GEA.
Both DBCLS SRA and AOE are freely available without any registration.