IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB 2022)

Compressed Sparse FM-Index:
Fast Sequence Alignment Using Large K-Steps

Rubén Langarita1 Adrià Armejach1,2 Javier Setoain3
Pablo Ibáñez4 Jesús Alastruey-Benedé4 Miquel Moretó1,2
1Barcelona Supercomputing Center (BSC) 2Universitat Politècnica de Catalunya (UPC)
3ARM Cambridge 4Universidad de Zaragoza, I3A

News

Abstract

The FM-index is a data structure used in genomics for exact search of input sequences over large reference genomes. Algorithms based on the FM-index show an irregular memory access pattern, resulting in a memory bound problem. We analyze a recent implementation of the FM-index and highlight existing throughput-memory trade-offs, showing that memory requirements limit implementation of large k-steps. We propose COFI, a COmpressed FM-Index for large K-steps. COFI enables a 15-step FM-index using less than 16~GB for a human genome reference of 3 giga base pairs. An algorithm based on this new layout is evaluated on both a Knights Landing (KNL) and an Skylake-based system (SKX). We achieve average speed-ups of 1.46x and 1.39x, respectively, with respect to an state-of-the-art FM-index implementation that is already well optimized.

Downloads

Bibtex

  @ARTICLE{9109660,
  author={Langarita, Rubén and Armejach, Adrià and Setoain, Javier and Ibáñez-Marín, Pablo and Alastruey-Benedé, Jesús and Moretó, Miquel},
  journal={IEEE/ACM Transactions on Computational Biology and Bioinformatics},
  title={Compressed Sparse FM-Index: Fast Sequence Alignment Using Large K-Steps},
  year={2022},
  volume={19},
  number={1},
  pages={355-368},
  doi={10.1109/TCBB.2020.3000253}
}

Related

Acknowledgements

This work has been partially supported by the Arm-BSC Centre of Excellence initiative, the Spanish Ministry of Economy and Competitiveness (contracts TIN2015-65316-P and TIN2016-76635-C2-1-R (AEI/ERDF, EU)), the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR-1414), Gobierno de Aragón (T58_20R research group), ERDF 2014-2020 "Construyendo Europa desde Aragón", and the European Union's Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877). M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally, A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva postdoctoral fellowship number IJCI-2017-33945.