AutoBitArray
(ABA) is a wrapper around the BitArray
module allowing you to store any sequence containing hashable elements using a constant bit size. Originally designed for genomic sequences, this module provides a convenient way to store encode and decode sequence in a fast and reliable way using a minimum memory size. Beside using low memory, bitarray are also mutable, allowing you to mutate your genomic sequence easily. However, if you look to a more convenient way to use mutable sequences, look at the MutableSeq array from BioPython which is not memory efficient but contains lot of usefull methods.
AutoBitArray
takes two arguments, the seq
argument which can be an iterator or a string, and a char_size
which will indicates with how many bites a character should be encoded. Note that the limit of unique characters and the size of the bitarray is defined as 2 ^ char_size
. For example, a char_size
of 2 will allows 2 ^ 2 possibilities (00, 01, 10, 00) while a char_size
of 3 will allows 2 ^ 3 possibilities. The default value of char_size
is set to 3.
from aba import AutoBitArray
seq = AutoBitArray("ATGC", 2)
seq.extend("TA")
seq[0] = "GA"
print (seq) # GATGCTA
To retrieve all the elements, use the decode
function.
seq = AutoBitArray("ATGC" * 10, 2)
print ( "".join(seq.decode()) )
MutableBitArray
can be sliced, allowing to decode only a part of the array.
seq = AutoBitArray("ATGCA", 2)
for i in range(0, len(seq), 2) :
print ("".join(seq[i:i+2].decode()))
If you're not sure of the char_size
value to use :
from aba import BitDict
# If you know the number of unique elements
print (BitDict.char_size_from_nunique(2)) # -> 1
# Or if you have the sequence you want to encode
print (BitDict.char_size_from_nunique("ATGC")) # -> 2
Test using python 3.5.2 on a 64 bits system and a sequence of 10k elements with 4 possible characters.
Method | Bytes size |
---|---|
String | 10.049 |
List | 90.312 |
Array | 40.264 |
BitArray | 2.556 |
To encode and decode bits sequences, MutableBitArray
uses an internal dictionary named BitDict
. This dictionary is filled whenever a new character is observed in the input sequence. This is a requirement for the proper functioning of the instance, however it blocks the possibility to concatenate two bit arrays for which their dictionaries differ. To bypass this constraint, a BitDict
can be initialized prior to the bit array like this :
bd = BitDict()
dna = AutoBitArray("TGG", bit_dict=bd)
dna2 = AutoBitArray("ATT", bit_dict=bd)
dna3 = dna + dna2
print ( "".join(dna3.decode()) )
BioAba contains two different class similar to biopython SeqRecord
and MultipleSeqAlignement
. These class are designed to produce multiple sequences alignment without string objects. The current object contains basic methods to slice the alignment and to save it into several formats (for now only fasta, phylip, structure and nexus are available).
from random import choice
from bioAba import Record, MultipleSeqAlignement
bases = "ATGC"
data = {"name-%i" %j : "".join(choice(bases) for i in range(100)) for j in range(10)}
records = [Record(name, sequence) for name, sequence in data.items()]
msa = MultipleSeqAlignement(records)
# if you want to switch to biopython :
msa_biopython = msa.to_biopython()
outfile = "./myalignment.fasta"
msa.save(outfile, "fasta")