figshare
Browse

Encoding of sequence information.

Download (64.75 kB)
journal contribution
posted on 2024-02-15, 18:26 authored by Julius Ramakers, Christopher Frederik Blum, Sabrina König, Stefan Harmeling, Markus Kollmann

RNA sequences of length L were encoded as unique bit patterns of shape (L × L × 8): A First, every nucleotide in an RNA sequence was one-hot encoded, e.g. G: (1, 0, 0, 0), C: (0, 1, 0, 0), A: (0, 0, 1, 0), U: (0, 0, 0, 1), N: (0.25, 0.25, 0.25, 0.25). For the full sequence, these one-hot encodings led to a tensor of shape (L × 4). Unknown nucleotides that were denoted by an “N” in the sequence were encoded by setting all values in the one-hot encoding to 0.25. This one-hot encoded sequence was then copied L times (a1) to obtain a tensor of shape (L × L × 4). Then, this tensor and its transpose (a2) were stacked along the last dimension to obtain a tensor of shape (L × L × 8). b A sample sequence Tensor that corresponded to a unique bit pattern for each possible pairing and also contained directional information. For sequences with L < 100, the sequence tensor was uniformly padded with −1 Red insert: example bit pattern for the Tensor at the first three pixels with depth 8.

(PDF)

History