figshare
Browse

STEAL

Version 15 2024-03-25, 11:17
Version 14 2024-03-22, 16:58
Version 13 2024-03-22, 16:23
Version 12 2024-03-22, 16:21
Version 11 2024-03-19, 16:59
Version 10 2024-03-15, 15:34
Version 9 2024-02-26, 12:07
Version 8 2024-02-26, 12:07
Version 7 2024-02-25, 17:48
Version 6 2024-02-25, 09:25
Version 5 2023-09-12, 08:42
Version 4 2023-09-12, 08:30
Version 3 2023-07-26, 12:42
Version 2 2022-03-31, 15:50
Version 1 2022-03-31, 13:40
software
posted on 2024-03-25, 11:17 authored by Paolo MignonePaolo Mignone
<p dir="ltr">Authors: Paolo Mignone, Gianvito Pio, Michelangelo Ceci.</p><p dir="ltr">Paper title: Distributed Heterogeneous Transfer Learning. Big Data Research. DOI: 10.1016/j.bdr.2024.100456</p><p dir="ltr">Two datasets from this study have been made public.</p><h2><br><b>Human and mouse gene regulatory networks</b></h2><p dir="ltr">Link: <a href="https://data.d4science.net/xQ7P" target="_blank">https://data.d4science.net/xQ7P</a></p><p dir="ltr">The dataset was built by considering gene expression data related to 6<br>different organs (liver, lung, brain, skin, bone marrow, heart), obtained<br>by control samples available at Gene Expression Omnibus (GEO)<br>(www.ncbi.nlm.nih.gov/geo/). Overall, 161 and 174 raw samples were<br>considered for mouse and human organisms, respectively (see "heterogeneous"<br>folder). All the samples were processed according to the workflow adopted<br>for the DREAM5 challenge (Marbach et al., 2012, DOI: 10.1038/nmeth.2016),<br>that led to a dataset of 5404 mouse genes and to a dataset of 15345 human<br>genes. Each gene was also associated with six features (one for each<br>organ), by averaging the expression levels measured within the same organ<br>(see "homogeneous" folder). Finally, the dataset of the interactions among<br>genes was built by considering all the possible pairs of genes (excluding<br>the self-links), each associated with the concatenation of the feature<br>vectors of the genes involved in the interaction. We extracted 235706<br>validated human gene interactions and 14613 validated mouse gene<br>interactions from BioGRID (available at <a href="https://thebiogrid.org/" rel="nofollow" target="_blank">https://thebiogrid.org</a>). As regards<br>the unlabeled examples of interactions, we randomly selected, without<br>replacement, a balanced number (i.e. equals to the number of labeled<br>examples) of interactions involving at least one gene that appears in the<br>set of labeled interactions.</p><p><br></p><h2><b>Stroke and sepsi</b></h2><p dir="ltr">Link: <a href="https://data.d4science.net/eEn3" target="_blank">https://data.d4science.net/eEn3</a></p><p dir="ltr">The considered stroke dataset (DOI:10.17632/x8ygrw87jw.1,<br>DOI:10.1016/j.artmed.2019.101723) was pre-processed by removing attributes<br>with more than 30% missing values, by converting categorical descriptive<br>variables into numerical ones through one-hot encoding. After these steps,<br>the dataset is described through 1 id (the first column), 13 descriptive<br>variables, and the target variable (the last column). The dataset contains<br>instances representing patients who have not relapsed to stroke (the<br>majority class, last_column=0), and instances representing patients who had<br>a recurrent stroke (the minority class, last_column=1). The sepsi dataset<br>(DOI:10.1038/s41598-020-73558-3) is described through 1 id (the first<br>column), 3 descriptive variables, and the target variable (the last<br>column). The task is to predict if the patient survived or not, due to<br>sepsis. The dataset consists of patients died because of sepsis<br>(last_column=1), while the remaining survived (last_column=0). The target<br>variable of the two datasets was aligned so that the label that indicates<br>the relapse to stroke in the cerebral stroke dataset corresponds to the<br>label indicating people who did not survive in the sepsis dataset. Finally,<br>a reduced version was built by: i) imposing a balanced class distribution<br>(obtained by downsampling the majority class); ii) further reducing the<br>source domain dataset through a 10% stratified random sampling in order to<br>facilitate the competitors to run the experiments to perform a comparative<br>analysis.</p><p><br></p>

History