Fixed and variable length generic string graphs for biological sequence defined polymers and human textual tokens, as a method for general dimensionality reduction and compression.

ORAL

Abstract

We present a framework for constructing generic string graphs using fixed and variable length nodes, applicable to both biological polymers (DNA, RNA, amino acids, with reverse complement awareness in the nucleic acid space) and raw ASCII/Unicode text. We also leverage Google's SentencePiece tokenizer, enabling the creation of tokenization models—including Byte Pair Encoding (BPE) and unigram models—derived from diverse raw text inputs. We demonstrate implementations and algorithms across multiple scenarios: biological sequences (DNA/RNA and amino acids) and human languages utilizing the official UN transcription database in six languages (English, Spanish, French, Russian, Chinese, and Arabic). This exploration reveals insights into the graph types' interconnections and accompanying compression ratios achieved through graph embeddings. We aim to illustrate that these constructed graphs facilitate search functions that can rectify transcription errors in biological sequencing while offering generative modeling capabilities to synthesize novel, biologically plausible sequences. We also delve into phenomena such as Shannon entropy and fuzzy matching to enhance our understanding of compression and sequence variation. Ultimately, our work highlights the implications of graph structures in bioinformatics and linguistics, paving the way for future research and applications in both domains.

*Work at the Molecular Foundry was supported by the Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Publication: C. J. Prybol, A. T. Hammack, E. A. Ashley, and M. P. Snyder, A Novel Approach for Accurate Sequence Assembly Using de Bruijn graphs, (2024). bioRxiv 2024.05.29.596541; doi: 10.1101/2024.05.29.596541

Presenters

  • Aeron T Hammack

    • Lawrence Berkeley National Labs

Authors

  • Aeron T Hammack

    • Lawrence Berkeley National Labs
  • Cameron J Prybol

    • Lawrence Berkeley National Labs