Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect embedding dimension after training #26

Open
BinchaoPeng opened this issue Jul 6, 2022 · 4 comments
Open

Incorrect embedding dimension after training #26

BinchaoPeng opened this issue Jul 6, 2022 · 4 comments

Comments

@BinchaoPeng
Copy link

BinchaoPeng commented Jul 6, 2022

I want to use dna2vec for E. coli genome.
When I set 2<=k<=8, I got (86479,100);
When I set 3<=k<8, I got (86614,100), and the correct dimension should be (87360,100) that $87360+16=4^2+4^3+4^4+4^5+4^6+4^7+4^8$.
So I don' know why I got 2 different results.
I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7.
However, in k=8, the dimension is (64450,100) rather than (65536,100), and $65536-64450 != 87630-86614$.
This is horrible! There is nowhere to match.

@BinchaoPeng BinchaoPeng changed the title Embedding dimension Incorrect embedding dimension after training Jul 6, 2022
@BinchaoPeng
Copy link
Author

@pnpnpn please take a time to help me, it is important for me, thanks!
The E. coli genome can be downloaded from https://regulondb.ccg.unam.mx/menu/download/datasets/files/Gene_sequence.txt.
The config:

inputs: inputs/E_coli_K12/*.txt
k-low: 2
k-high: 8
vec-dim: 100
epoch: 10
context: 5
out-dir: results/E_coli/

@BinchaoPeng
Copy link
Author

Today, I make a comparation between kmers of embedding vector and the complete kmer composition where 3<=k<=8.
I find there are two difference sites:

  1. the occurence of some kmer compisitons is low frequency;
  2. there is no occurence of some kmer compisitons

So, I'd like to know if there is a better way to solve a such situation when I make a embedding operation. After all there are some kmer compositions lacking when embedding. Thanks!
@pnpnpn @aldro61 @alevenberg

@BinchaoPeng
Copy link
Author

BinchaoPeng commented Jul 7, 2022

I found that it seems to be related to the parameter min_count, but why doesn't the first dimension of the embedding vector obtained from 2<=k<=8 and 3<=k<=8 differ by 16?

@eternal-bug
Copy link

Maybe your problem has something to do with this place. When reading the source code, you will find that when extracting k-mers, they are not completely extracted from start to end of a sequence, but rather there is randomness:

generators.py

    @staticmethod
    def random_chunks(rng, li, min_chunk, max_chunk):
        """
        Both min_chunk and max_chunk are inclusive
        """
        it = iter(li)
        while True:
            head_it = islice(it, rng.randint(min_chunk, max_chunk + 1))
            nxt = '' . join(head_it)

            # throw out chunks that are not within the kmer range
            if len(nxt) >= min_chunk:
                yield nxt
            else:
                break

Because the human genome is relatively large, after random sampling, it is likely to obtain all combinations of k-mers. However, for the genome of E. coli, which is much smaller, the sample size is relatively small, so it is possible that some k-mers were not included in the statistics you mentioned earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants