<p dir="ltr">Authors: Paolo Mignone, Gianvito Pio, Michelangelo Ceci.</p><p dir="ltr">Paper title: Distributed Heterogeneous Transfer Learning. Big Data Research. DOI: 10.1016/j.bdr.2024.100456</p><p dir="ltr">Two datasets from this study have been made public.</p><h2><br><b>Human and mouse gene regulatory networks</b></h2><p dir="ltr">Link: <a href="https://data.d4science.net/xQ7P" target="_blank">https://data.d4science.net/xQ7P</a></p><p dir="ltr">The dataset was built by considering gene expression data related to 6<br>different organs (liver, lung, brain, skin, bone marrow, heart), obtained<br>by control samples available at Gene Expression Omnibus (GEO)<br>(www.ncbi.nlm.nih.gov/geo/). Overall, 161 and 174 raw samples were<br>considered for mouse and human organisms, respectively (see "heterogeneous"<br>folder). All the samples were processed according to the workflow adopted<br>for the DREAM5 challenge (Marbach et al., 2012, DOI: 10.1038/nmeth.2016),<br>that led to a dataset of 5404 mouse genes and to a dataset of 15345 human<br>genes. Each gene was also associated with six features (one for each<br>organ), by averaging the expression levels measured within the same organ<br>(see "homogeneous" folder). Finally, the dataset of the interactions among<br>genes was built by considering all the possible pairs of genes (excluding<br>the self-links), each associated with the concatenation of the feature<br>vectors of the genes involved in the interaction. We extracted 235706<br>validated human gene interactions and 14613 validated mouse gene<br>interactions from BioGRID (available at <a href="https://thebiogrid.org/" rel="nofollow" target="_blank">https://thebiogrid.org</a>). As regards<br>the unlabeled examples of interactions, we randomly selected, without<br>replacement, a balanced number (i.e. equals to the number of labeled<br>examples) of interactions involving at least one gene that appears in the<br>set of labeled interactions.</p><p><br></p><h2><b>Stroke and sepsi</b></h2><p dir="ltr">Link: <a href="https://data.d4science.net/eEn3" target="_blank">https://data.d4science.net/eEn3</a></p><p dir="ltr">The considered stroke dataset (DOI:10.17632/x8ygrw87jw.1,<br>DOI:10.1016/j.artmed.2019.101723) was pre-processed by removing attributes<br>with more than 30% missing values, by converting categorical descriptive<br>variables into numerical ones through one-hot encoding. After these steps,<br>the dataset is described through 1 id (the first column), 13 descriptive<br>variables, and the target variable (the last column). The dataset contains<br>instances representing patients who have not relapsed to stroke (the<br>majority class, last_column=0), and instances representing patients who had<br>a recurrent stroke (the minority class, last_column=1). The sepsi dataset<br>(DOI:10.1038/s41598-020-73558-3) is described through 1 id (the first<br>column), 3 descriptive variables, and the target variable (the last<br>column). The task is to predict if the patient survived or not, due to<br>sepsis. The dataset consists of patients died because of sepsis<br>(last_column=1), while the remaining survived (last_column=0). The target<br>variable of the two datasets was aligned so that the label that indicates<br>the relapse to stroke in the cerebral stroke dataset corresponds to the<br>label indicating people who did not survive in the sepsis dataset. Finally,<br>a reduced version was built by: i) imposing a balanced class distribution<br>(obtained by downsampling the majority class); ii) further reducing the<br>source domain dataset through a 10% stratified random sampling in order to<br>facilitate the competitors to run the experiments to perform a comparative<br>analysis.</p><p><br></p>