-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect embedding dimension after training #26
Comments
@pnpnpn please take a time to help me, it is important for me, thanks! inputs: inputs/E_coli_K12/*.txt
k-low: 2
k-high: 8
vec-dim: 100
epoch: 10
context: 5
out-dir: results/E_coli/ |
Today, I make a comparation between kmers of embedding vector and the complete kmer composition where
So, I'd like to know if there is a better way to solve a such situation when I make a embedding operation. After all there are some kmer compositions lacking when embedding. Thanks! |
I found that it seems to be related to the parameter min_count, but why doesn't the first dimension of the embedding vector obtained from 2<=k<=8 and 3<=k<=8 differ by 16? |
Maybe your problem has something to do with this place. When reading the source code, you will find that when extracting k-mers, they are not completely extracted from start to end of a sequence, but rather there is randomness: @staticmethod
def random_chunks(rng, li, min_chunk, max_chunk):
"""
Both min_chunk and max_chunk are inclusive
"""
it = iter(li)
while True:
head_it = islice(it, rng.randint(min_chunk, max_chunk + 1))
nxt = '' . join(head_it)
# throw out chunks that are not within the kmer range
if len(nxt) >= min_chunk:
yield nxt
else:
break Because the human genome is relatively large, after random sampling, it is likely to obtain all combinations of k-mers. However, for the genome of E. coli, which is much smaller, the sample size is relatively small, so it is possible that some k-mers were not included in the statistics you mentioned earlier. |
I want to use dna2vec for E. coli genome.$87360+16=4^2+4^3+4^4+4^5+4^6+4^7+4^8$ .$65536-64450 != 87630-86614$ .
When I set
2<=k<=8
, I got(86479,100)
;When I set
3<=k<8
, I got(86614,100)
, and the correct dimension should be(87360,100)
thatSo I don' know why I got 2 different results.
I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7.
However, in
k=8
, the dimension is(64450,100)
rather than(65536,100)
, andThis is horrible! There is nowhere to match.
The text was updated successfully, but these errors were encountered: