Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大数据量训练的时候卡住 #12

Open
yzlnew opened this issue Jan 18, 2024 · 3 comments
Open

大数据量训练的时候卡住 #12

yzlnew opened this issue Jan 18, 2024 · 3 comments

Comments

@yzlnew
Copy link

yzlnew commented Jan 18, 2024

20GB 数据量可以正常训练,100GB 在跑到某一步的时候会卡住。bytepiece==0.6.3

image

某个 thread 的堆栈信息,看不出来,直接问 GPT 似乎是多进程的问题:

#0  0x00007f168f6207a4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f168f620898 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f1683848699 in semlock_acquire ()
   from /opt/rh/rh-python38/root/usr/lib64/python3.8/lib-dynload/_multiprocessing.cpython-38-x86_64-linux-gnu.so
#3  0x00007f168f7ed4e6 in PyCFunction_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#4  0x00007f168f7ac932 in _PyObject_MakeTpCall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#5  0x00007f168f862c5c in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#6  0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#7  0x00007f168f7ab7bd in PyObject_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#8  0x00007f168f860081 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#9  0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#10 0x00007f168f85e323 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#11 0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#12 0x00007f168f85e323 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#13 0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#14 0x00007f168f8507cb in method_vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#15 0x00007f168f7ab7bd in PyObject_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#16 0x00007f168f8ad6d1 in t_bootstrap () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#17 0x00007f168f86bbc4 in pythread_wrapper () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#18 0x00007f168f6174e2 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f168f3f25b3 in clone () from /lib64/libc.so.6
@bojone
Copy link
Owner

bojone commented Jan 18, 2024

系统内存多大呢?以及Trainer的参数是多少?

@yzlnew
Copy link
Author

yzlnew commented Jan 18, 2024

内存是 1.3T,trainer 大致是这样,内存的 peak 一般是在 merge 时候出现么

trainer = Trainer(order=6, max_vocab_size=80000, min_count=32, isolate_digits=True)
trainer.train(corpus_instance, workers=128, batch_size=2000)

Edit
观察到卡住之后这个配置下,used mem 大致是 300GB

@FlyCarrot
Copy link

FlyCarrot commented Jul 5, 2024

内存是 1.3T,trainer 大致是这样,内存的 peak 一般是在 merge 时候出现么

trainer = Trainer(order=6, max_vocab_size=80000, min_count=32, isolate_digits=True)
trainer.train(corpus_instance, workers=128, batch_size=2000)

Edit 观察到卡住之后这个配置下,used mem 大致是 300GB

同有这个现象,测试是200G的wudao数据,内存是1.0T,观察到内存的峰值使用率是100%,似乎是爆内存了
bytePiece版本是commit ID为c50c43ec
输出log如下:

Count Ngrams: 59132213it [4:28:03, 3676.50it/s]                                                                                                                           
Merge Ngrams:  23% 15/64 [27:49<1:31:37, 112.19s/it]                                                                                                                      
Merge Ngrams:  30% 19/64 [35:35<1:26:36, 115.48s/it]                                                                                                                      
Merge Ngrams:  91% 58/64 [1:53:14<12:12, 122.03s/it]                                                                                                                      
Merge Ngrams: 100% 64/64 [2:05:33<00:00, 117.71s/it]                                                                                                                      
Prune Ngrams: 100% 7/7 [07:19<00:00, 62.81s/it]                                                                                                                           
Count Pieces: 5722348it [41:21:27, 3193.81s/it][1]    1991 killed     python train_tokenizer.py

所以应当如何限制一下内存吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants