Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in Writer? #4

Open
JohnEmhoff opened this issue Jan 26, 2020 · 8 comments
Open

Memory leak in Writer? #4

JohnEmhoff opened this issue Jan 26, 2020 · 8 comments

Comments

@JohnEmhoff
Copy link

JohnEmhoff commented Jan 26, 2020

Hello! Thanks for pyorc; using it has been a pleasure so far, with the exception that we seem to be running into memory issues. I think Writer is leaking memory? Our workload is roughly:

  • Open ~100 writers to different files
  • Iterate over our input rows (in the millions) and send each row to exactly one writer
  • Close all writers
  • Repeat

Memory usage will grow without bound between iterations. This, coupled with the fact that lowering the stripe size all the way down to 1M has no effect, makes me suspect a memory leak. Below is a script that will reproduce -- around iteration 10 it gets to 20G and then killed by the OOM killer on my machine. Let me know if there's anything I can do to help track it down!

https://gist.github.com/JohnEmhoff/274f6e05cba3f17a16683eb394bfe6b5

@JohnEmhoff
Copy link
Author

I managed to trim down the script a good bit -- it turns out writing data is unnecessary, the leaks happen just creating writers:

https://gist.github.com/JohnEmhoff/55f562c2de701dfb426643a3e7751ef8

@noirello
Copy link
Owner

Thank you for reporting it.

I think I successfully pinpointed the problem when Writer's constructor build an orc::Type from the TypeDescription.

I'm still looking for the concrete source of the leak.

@JohnEmhoff
Copy link
Author

Thanks for looking into it. I think you're right -- I noticed that when my spec in the script above is just a column or two, it leaks much, much more slowly.

@noirello
Copy link
Owner

noirello commented Feb 9, 2020

After by passing the TypeDescription object still failed to run the iterations to the end. It seems like the orc:Writer object is somehow mishandled. Valgrind is not very helpful (although using it was never my strongest suit).

@clynamen
Copy link

clynamen commented Sep 3, 2020

I have the same problem. I tried to dig a bit and it seems the source of the leak is the creation of multiple ColumnWriter (of any type, string, float or int).
The leak is proportional to the number of columns.
Even more memory is leaked when ZLIB or ZSTD compression is enabled (currently enabled by default)

@carlosfvp
Copy link

carlosfvp commented Jun 30, 2021

Also I noticed the stripe size is not being honored.
The stripe is not being flushed to disk and neither the memory freed (probably), but this part is being handled by the C++ library which make it harder to debug :(

@carlosfvp
Copy link

I found this recomendation.
Using a method named writeIntermediateFooter will flush the content to the file and free some memory, but this only exist in the Java version of the OrcWriter 😥

https://www.mail-archive.com/[email protected]/msg00225.html

https://orc.apache.org/api/orc-core/org/apache/orc/impl/WriterImpl.html#writeIntermediateFooter--

@pokerc
Copy link

pokerc commented Jul 27, 2021

Fund a similar problem, can not flush content to file manually, and batch_size in Writer parameter seems invalid. Any solutions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants