The hetnet awakens: understanding complex diseases through data integration and open science

2017-03-04T22:42:12Z (GMT) by Daniel Himmelstein
The PhD Dissertation of Daniel S. Himmelstein.

This is my thesis from my PhD in Biological & Medical Informatics from the University of California, San Francisco.

The versionless DOI for this record is 10.6084/m9.figshare.4724797. The corresponding shortened URL is


dhimmel-thesis-figshare.pdf — PDF version of the dissertation produced specially for figshare. This document is identical to the ProQuest version, except that the ProQuest copyright and UCSF library release pages have been removed. This version also has additional PDF metadata including a document outline. — The LaTeX source of the thesis as downloaded from ShareLatex. The PDF output from compiling this source was used to create dhimmel-thesis-figshare.pdf. Note that dhimmel-thesis-figshare.pdf contains manually modified metadata and the official UCSF cover page. 

dhimmel-ucsf-diploma.pdf — PDF photograph of my diploma from the Regents of the University of California.

Title: The hetnet awakens: understanding complex diseases through data integration and open science

Dates: I submitted my dissertation on June 2, 2016. However, my official graduation date was June 10, 2016.


Human disease is complex. However, the explosion of biomedical data is providing new opportunities to improve our understanding. My dissertation focused on how to harness the biodata revolution. Broadly, I addressed three questions: how to integrate data, how to extract insights from data, and how to make science more open.

To integrate data, we pioneered the hetnet—a network with multiple node and relationship types. After several preludes, we released Hetionet v1.0, which contains 2,250,197 relationships of 24 types. Hetionet encodes the collective knowledge produced by millions of studies over the last half century.

To extract insights from data, we developed a machine learning approach for hetnets. In order to predict the probability that an unknown relationship exists, our algorithm identifies influential network patterns. We used the approach to prioritize disease—gene associations and drug repurposing opportunities. By evaluating our predictions on withheld knowledge, we demonstrated the systematic success of our method.

After encountering friction that interfered with data integration and rapid communication, I began looking at how to make science more open. The quest led me to explore realtime open notebook science and expose publishing delays at journals as well as the problematic licensing of publicly-funded research data.

Thesis Committee

Sergio E. Baranzini (chair & advisor)
John S. Witte
Andrej Sali

ProQuest Information:

Dissertation/thesis number: 10133408
ProQuest document ID: 1801982909
ISBN: 9781339919881
ISBN: 1339919885
OCLC Number: 970819555