The first algorithm to compress genetic sequences

Stéphane Grumbach, INRIA

Back to my homepage      
Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, lead to the highest compression of DNA. The results, although not satisfactory, gives insight to the necessary correlation between compression and comprehension of genetic sequences.
Stéphane Grumbach, Fariza Tahi:
A New Challenge for Compression Algorithms: Genetic Sequences.
Information Processing and Management 30(6): 875-886 (1994)

Download the paper in the HAL archive
The Algorithm
The program in C of the algorithm Biocompress-2, can be found here: biocompress