Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The meaning of the divsum file obtained from calcDivergenceFromAlign.pl #290

Open
ltswx opened this issue Oct 14, 2024 · 1 comment
Open
Labels

Comments

@ltswx
Copy link

ltswx commented Oct 14, 2024

Hi~ o( ̄▽ ̄)ブ

I would appreciate it if any could help me understand what those values (numbers) mean in simple words :

I think that the numbers in the column except div represent the length of the repeating sequence in different divs,But I added it all up and divided it by the size of the genome, and the calculated bases masked was different from the bases masked in the *tbl file;There is a huge difference between the results of the two, the result of the *tbl file is 71.01%, and the result of my calculation is 98%!, I would like to ask where I misunderstood QAQ

tbl file:
total length: 1333551035 bp (1322312127 bp excl N/X-runs)
GC level: 36.87 %
bases masked: 946989405 bp ( 71.01 %)
——————————————————————————
divsum file:
Div DNA/DTA DNA/DTC DNA/DTH DNA/DTM DNA/DTT DNA/Helitron LTR/Copia LTR/Gypsy LTR/unknown MITE/DTA MITE/DTC MITE/DTH MITE/DTM MITE/DTT
0 2372488 1566743 448260 1399704 1862233 3080962 4090922 14477189 2989812 125802 12497 41037 54060 84787
1 1070198 565642 179819 699623 1098895 2032132 2420914 10119770 2200132 10879 422 4182 3855 8880
2 715860 855177 218688 527116 470136 824655 1995255 11309963 2177667 10787 217 13666 6496 22494
3 1035157 842382 208109 521912 349325 986025 3128677 13492012 2677114 34770 2122 23326 24197 34870
4 781227 794300 166625 727909 508392 1212696 5397677 12978613 3680115 38307 642 28485 42707 45816
5 1406210 769048 173994 906731 624432 1341063 4343808 15285779 4647044 52326 0 27940 63166 54184
6 770654 831030 197007 713560 703585 1545371 4483437 17083630 5650726 92590 2812 42051 40711 60726
7 726316 858708 154043 691578 791870 1792639 4613580 20229613 8154480 112670 1532 42463 33844 86128
8 681396 938817 168054 743208 952169 1854848 5054305 21196254 9894821 140268 2550 51597 43002 107128
9 604952 978478 167630 809106 1004107 2419183 4790173 20091691 11198972 180950 1613 55319 55831 127289
10 639736 990794 224385 941584 1166862 2545176 5273654 19311001 10555231 148188 1665 74280 64342 160882
11 620142 995990 229171 1005233 1330302 2645505 6078967 19495570 10611831 151829 1756 94766 81253 168221
12 596736 979397 262376 954856 1551576 3140240 6313904 18661570 10322872 151098 2935 100498 82606 191351
13 728465 910078 260739 934970 1566219 2568548 5546549 16320060 9657407 161395 1732 87144 69791 205104
14 696420 827986 224541 956190 1570759 2588673 4749201 15004961 9009895 148651 728 90783 81420 198661
15 669376 833180 231964 999698 1651570 2570283 5255747 13781616 7720243 150840 996 97759 90209 182999
16 626601 849532 220414 1187371 1608941 2571378 4553700 12602258 6722391 143630 982 87894 84362 168931
17 606426 820565 211516 1639720 1620009 2476461 4492894 11664936 4912886 141461 1416 95705 73054 169030
18 498113 818626 206386 3778260 1699336 2249800 4037018 10565373 3775558 126250 2077 95020 67744 157855
19 440143 853099 215911 13342788 1875211 2275643 3471026 9478961 3253924 109898 1334 89650 68048 141827
20 392077 921635 193810 42271944 1871490 2216555 2826372 9014891 2963606 106505 1689 75025 63812 136338
21 335578 906159 207703 74051399 1808891 2058996 2201190 8725154 2666056 93689 3071 77575 59535 137876
22 337415 814555 199042 77079780 1669412 2297319 2156443 9062914 2592126 96448 2413 68705 47452 127404
23 323405 841229 220502 50788912 1512184 1951457 2198345 8346377 2718104 81280 2700 61415 41378 118290
24 291419 945761 195468 22965594 1384568 1826849 1968953 7620308 2271787 83316 1625 59250 42139 119858
25 282724 969595 185307 8630729 1259230 1717695 2008782 7822866 1933346 75091 1906 63850 45515 112735
26 258653 1137225 187122 3661803 1227592 1717284 2137466 8457821 1745714 71195 3165 58693 39017 106227
27 219445 1161779 195278 2189846 1125775 1608901 2016681 10016293 1756664 65509 2889 57048 40999 96926
28 231243 899960 220493 1680100 1043300 1454815 1593123 10659534 1419714 61426 1396 51211 35632 90255
29 196252 945695 179355 1347855 944943 1367978 1354270 10214950 931468 54643 1544 45752 35158 79617
30 166707 1050287 151917 1082359 857792 1246436 1140561 10337721 719851 46964 3272 42227 30448 67622
31 148241 1082000 121590 942409 751039 1241058 1172321 11160153 684410 40465 1249 36663 29725 60376
32 120702 1228329 107612 798325 688518 1184030 1591327 11458913 522652 31872 2041 28835 25363 50892
33 93286 1346042 102054 808583 607963 1215827 3500491 15885122 398462 30841 2728 26558 18758 33776
34 88289 1300309 92674 864415 532755 940712 3111923 12654879 318864 23740 1587 25872 13842 34598
35 65974 1239054 88656 717085 457126 895603 1933532 6521390 253536 17706 1642 23493 14958 22788
36 62436 1009345 72722 596948 378460 871168 1322333 4951858 186387 14138 1756 17894 14374 16911
37 59819 964493 61552 598119 342657 909615 1271040 4205225 152514 12546 1540 12958 7619 11667
38 71842 827383 58303 510491 281851 876939 816970 3088871 114682 10054 2013 9517 6057 7412
39 80441 600820 43224 446275 242803 844172 475190 2727504 83851 5793 945 6229 4485 7358
40 85481 532881 31149 452050 199443 785084 411575 2318517 64842 7028 1331 5331 4348 5314
41 71160 394552 23060 450610 197408 745368 383154 1716829 57607 5927 621 4419 3544 5126
42 66245 301210 29843 423033 170501 642513 474986 1313448 40232 2742 1318 3570 3048 2752
43 45227 246635 15450 367600 153636 548760 446599 1091995 28303 2838 1858 1500 2678 970
44 44203 188813 12042 292453 135443 469329 258269 822704 18425 1827 442 993 1723 755
45 35052 125046 8460 275099 70085 369097 165378 600995 19951 974 1237 2323 1396 372
46 26973 83028 4980 176374 57059 247696 88752 456999 14717 923 1121 906 357 618
47 32331 56993 2486 107917 49920 170429 43761 281656 8232 397 629 344 144 90
48 20182 33233 2104 64963 31659 104414 21335 209202 5157 479 455 256 138 0
49 7346 19459 1424 51345 15897 66061 8467 147324 3541 35 0 0 0 241
50 2521 10815 714 18920 6571 46435 4718 76488 2311 0 0 0 0 108
51 2478 5220 401 13105 6192 24298 1617 33519 1250 0 0 0 0 0
52 791 4161 300 9694 5093 19305 1071 15578 798 0 0 170 0 0
53 886 2615 128 2635 984 10098 256 7948 988 168 0 0 0 0
54 440 391 0 2457 925 2453 849 3052 1432 166 0 0 0 0
55 0 217 0 632 0 2198 39 2675 777 0 0 0 0 0
56 0 517 0 245 0 900 0 161 631 0 0 0 0 0
57 0 230 0 0 0 507 78 38 390 0 0 0 0 0
58 0 0 0 0 0 154 151 0 1169 0 0 0 84 0
59 0 0 0 0 0 0 107 0 87 0 0 0 0 0
60 0 0 0 0 0 259 0 338 216 0 0 0 0 0
61 0 0 0 0 0 0 137 159 148 0 0 0 0 0
62 0 0 0 123 0 0 0 17 276 0 0 0 0 0
63 0 0 0 0 0 0 9 0 586 0 0 0 0 0
64 0 0 0 0 0 0 0 0 179 0 0 0 0 0
65 0 0 0 0 0 0 0 0 0 0 0 0 0 0
66 0 0 0 0 0 0 0 0 144 0 0 0 0 0
67 0 0 0 0 0 0 0 0 0 0 0 0 0 0
68 0 0 0 0 0 0 38 0 125 0 0 0 0 0
69 0 0 0 0 0 0 11 0 0 0 0 0 0 0
70 0 0 0 0 0 0 18 0 0 0 0 0 0 0

@ltswx ltswx added the question label Oct 14, 2024
@rmhubley
Copy link
Member

rmhubley commented Dec 4, 2024

The *.tbl file is a non-redundant accounting of what is presented in the annotation file (*.out). That is, if two or more annotation ranges cover the same base, the base is only counted once. The calcDivergenceFromAlign.pl function simply calculates base counts from the individual alignments which may or may not have some overlap. For a highly curated library such as human, mouse, etc, this is not a problem as overlaps are rare. However, if you are using a de novo generated, or custom library with high redundancy - there is a much higher likelihood RepeatMasker will report overlapping annotations and this script will double-count bases. Without curation, these sort of libraries are less effective for concise annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants