Incorrect embedding dimension after training #26

BinchaoPeng · 2022-07-06T15:47:04Z

I want to use dna2vec for E. coli genome.
When I set 2<=k<=8, I got (86479,100);
When I set 3<=k<8, I got (86614,100), and the correct dimension should be (87360,100) that $87360+16=4^2+4^3+4^4+4^5+4^6+4^7+4^8$.
So I don' know why I got 2 different results.
I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7.
However, in k=8, the dimension is (64450,100) rather than (65536,100), and $65536-64450 != 87630-86614$.
This is horrible! There is nowhere to match.

The text was updated successfully, but these errors were encountered:

BinchaoPeng · 2022-07-06T15:50:36Z

@pnpnpn please take a time to help me, it is important for me, thanks!
The E. coli genome can be downloaded from https://regulondb.ccg.unam.mx/menu/download/datasets/files/Gene_sequence.txt.
The config:

inputs: inputs/E_coli_K12/*.txt
k-low: 2
k-high: 8
vec-dim: 100
epoch: 10
context: 5
out-dir: results/E_coli/

BinchaoPeng · 2022-07-07T02:02:20Z

Today, I make a comparation between kmers of embedding vector and the complete kmer composition where 3<=k<=8.
I find there are two difference sites:

the occurence of some kmer compisitons is low frequency；
there is no occurence of some kmer compisitons

So, I'd like to know if there is a better way to solve a such situation when I make a embedding operation. After all there are some kmer compositions lacking when embedding. Thanks!
@pnpnpn @aldro61 @alevenberg

BinchaoPeng · 2022-07-07T03:21:18Z

I found that it seems to be related to the parameter min_count, but why doesn't the first dimension of the embedding vector obtained from 2<=k<=8 and 3<=k<=8 differ by 16?

eternal-bug · 2023-04-23T04:53:10Z

Maybe your problem has something to do with this place. When reading the source code, you will find that when extracting k-mers, they are not completely extracted from start to end of a sequence, but rather there is randomness：

generators.py

    @staticmethod
    def random_chunks(rng, li, min_chunk, max_chunk):
        """
        Both min_chunk and max_chunk are inclusive
        """
        it = iter(li)
        while True:
            head_it = islice(it, rng.randint(min_chunk, max_chunk + 1))
            nxt = '' . join(head_it)

            # throw out chunks that are not within the kmer range
            if len(nxt) >= min_chunk:
                yield nxt
            else:
                break

Because the human genome is relatively large, after random sampling, it is likely to obtain all combinations of k-mers. However, for the genome of E. coli, which is much smaller, the sample size is relatively small, so it is possible that some k-mers were not included in the statistics you mentioned earlier.

BinchaoPeng changed the title ~~Embedding dimension~~ Incorrect embedding dimension after training Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect embedding dimension after training #26

Incorrect embedding dimension after training #26

BinchaoPeng commented Jul 6, 2022 •

edited

Loading

BinchaoPeng commented Jul 6, 2022

BinchaoPeng commented Jul 7, 2022

BinchaoPeng commented Jul 7, 2022 •

edited

Loading

eternal-bug commented Apr 23, 2023

Incorrect embedding dimension after training #26

Incorrect embedding dimension after training #26

Comments

BinchaoPeng commented Jul 6, 2022 • edited Loading

BinchaoPeng commented Jul 6, 2022

BinchaoPeng commented Jul 7, 2022

BinchaoPeng commented Jul 7, 2022 • edited Loading

eternal-bug commented Apr 23, 2023

BinchaoPeng commented Jul 6, 2022 •

edited

Loading

BinchaoPeng commented Jul 7, 2022 •

edited

Loading