Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects #22110

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Feb 23, 2024

Motivation

In Pulsar, users have reported issues that could be caused by "double release" or "use after release" bugs with recycled objects.

Here's are some issues that could potentially be caused by "double release" or "use after release" bugs:
#22035
#21892

Other example of such potential issue: #21421/#21933. It is possible that these issues are fixed by apache/bookkeeper#4196. The other root cause could be a "double release" or "use after release" bug which is corrupting the buffer and causing checksum calculation to fail.

Outside of Java, there's a known bug pattern called "double-free" or "Doubly freeing memory" (with malloc).
The "double release" bug pattern is a bit similar, but happens with the recycled object pattern using Netty's Recycler that Pulsar uses because of performance reasons.

There is also a "use after free" bug pattern. Something similar could be happen with Netty recycled objects that the object instance gets used after releasing.

The solution in this PR attempts to help detect "double release" and "use after release" bugs.

Modifications

  • get rid of any .setRefCnt(1) calls in production code since that could hide real issues
  • add additional detection feature that can be disabled by setting the system property -Dpulsar.refcount.check.on_access=false.
    • the refcount check will add a volatile read. This isn't a performance problem in most use cases. For production usage we could add -Dpulsar.refcount.check.on_access=false into the bin/pulsar script by default if we are afraid of the performance overhead. For all tests, we should be running with checks enabled so that we could find the source of problems.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari lhotari added this to the 3.3.0 milestone Feb 23, 2024
@lhotari lhotari self-assigned this Feb 23, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Feb 23, 2024
@lhotari lhotari changed the title WIP Detect double release bugs with recycled objects [improve][misc] WIP Detect double release bugs with recycled objects Feb 23, 2024
@lhotari lhotari marked this pull request as ready for review February 23, 2024 16:03
@lhotari lhotari changed the title [improve][misc] WIP Detect double release bugs with recycled objects [improve][misc] WIP Detect "double release" and "use after release" bugs with recycled objects Feb 23, 2024
@lhotari lhotari changed the title [improve][misc] WIP Detect "double release" and "use after release" bugs with recycled objects [improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects Feb 23, 2024
@lhotari
Copy link
Member Author

lhotari commented Feb 23, 2024

After making the changes, there are a lot of unit test failures. I haven't had a chance to look into the details. This PR is still a very early proposal about how to start detecting "double release" and "use after release" bugs.

Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recycled objects are widely used in Pulsar, not only for EntryImpl. The 1st concern is that should we apply checks to all these places? For example, the client side could also use recycled objects.

The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.

@lhotari
Copy link
Member Author

lhotari commented Feb 26, 2024

The recycled objects are widely used in Pulsar, not only for EntryImpl. The 1st concern is that should we apply checks to all these places? For example, the client side could also use recycled objects.

@BewareMyPower Yes that's a good point. It looks like the AbstractCASReferenceCounted/AbstractReferenceCounted base class prevents a lot of the problems since there would be exceptions at some point if there would be execution paths where "double release" or "use after release" bugs existed.

The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.

Yes, those are cases where the problem is getting hidden.

It would be great to find a better way to track down the issues. That's why I started this PR which is more like an experiment to find some way that would work for detecting "double release" and "use after release" bugs and also raise awareness of such bug patterns with the recycled objects. These are bug patterns that most Java developers have never had to deal with because of Java's garbage collection. With recycled objects and Netty ByteBufs, that all changes.

@lhotari
Copy link
Member Author

lhotari commented Feb 26, 2024

I guess there's also the possibility of "double release" and "use after release" bugs with Netty ByteBufs.

In Netty, there's the leak detector for detecting when you don't release buffers, but there seems to be nothing for detecting the "use after release" bugs. I guess "double release" would be detected with the io.netty.buffer.AbstractReferenceCountedByteBuf base class which will throw an exception on double release.
It leaves the "use after release" bugs undetected.

It seems that the solution might be a Java Agent written with Byte Buddy etc. which would add additional checks with byte code instrumentation when the agent is activated. Thinking something like https://github.com/reactor/BlockHound but for a completely different purpose, to help detect "use after release" bugs.

@lhotari lhotari marked this pull request as draft February 26, 2024 06:03
@lhotari
Copy link
Member Author

lhotari commented Feb 26, 2024

The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.

Getting back to this one more time. Netty protects against most "use after release" ByteBuf bugs by setting the fields to null and the NPEs would be popping up as a sign of issues. Therefore it's extremely important that NPEs aren't suppressed with null checks.

Copy link
Contributor

@onobc onobc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @lhotari

My comments are mostly questions - 1 minor suggestion on naming. Other than that LGTM.

Also, because it is coupled w/ the "emergency brake" property , it seems harmless to add in .

The remaining piece will be to find out where we are currently doing the null checks and remove them.

return entry;
}

private static EntryImpl getEntryFromRecycler() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 factoring this out makes it easier to follow (and add other common behaviors in one place later if needed - basic DRY stuff).

setRefCnt(1);
}

public static <T extends AbstractValidatingReferenceCounted> T getAndCheck(Recycler<T> recycler) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] It is not really doing a "check". In the ACRC it was called getEntryFromRecycler. I think it would be helpful to name them the same thing (sans "entry"), maybe getInstanceFromRecycler.

}

public final void resetRefCnt() {
setRefCnt(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate that setRefCnt is protected. Otherwise, this could have been done in a single util rather than inserted into the hierarchy (composition over inheritance). Also, the code in both hierarchies is almost identical.

  • The checkOnAccess var is just a static lookup.
  • The checkRefCount and getFromRecycler could be passed in the counter or recycler, respectively

Not the end of the world but my brain is having trouble leaving it alone.

@@ -1374,8 +1375,10 @@ protected ProducerImpl.ChunkedMessageCtx newObject(
};

public static ChunkedMessageCtx get(int totalChunks) {
ChunkedMessageCtx chunkedMessageCtx = RECYCLER.get();
chunkedMessageCtx.setRefCnt(totalChunks);
ChunkedMessageCtx chunkedMessageCtx = getAndCheck(RECYCLER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is setting ref count to 1 and then retaining N-1 less performant than setting ref count to N?

OR

Is setting ref count to 1 and then retaining N-1 more functionally correct than setting ref count to N?

@lhotari
Copy link
Member Author

lhotari commented May 24, 2024

There are 2 ByteBuf reference count handling issues in Bookkeeper client that have been recently fixed: apache/bookkeeper#4289 and apache/bookkeeper#4293 .
These fixes are expected to be delivered in Bookkeeper 4.16.6 and 4.17.1 so that we could deliver the fixes to Pulsar releases 3.0.6, 3.2.4 and 3.3.1.

@lhotari
Copy link
Member Author

lhotari commented May 24, 2024

I recently discovered another bug pattern with ByteBufs which is due to an incorrect assumption of how Netty ByteBuf reference counting works for derived buffers.

Writing some pseudo code to explain:

ByteBuf buf = PulsarByteBufAllocator.DEFAULT.buffer(16);
...
ByteBuf cachedBuffer = buf.duplicate().retain();
...
buf.release();
...

The assumption could be that the above code is correct. Why is it wrong?

To understand how a duplicate buffer works, one could take a look at the source, in AbstractPooledDerivedByteBuf.
https://github.com/netty/netty/blob/5085cef149134951c94a02a743ed70025c8cdad4/buffer/src/main/java/io/netty/buffer/AbstractPooledDerivedByteBuf.java#L32-L37

The reference count of the duplicated buffer (extends AbstractPooledDerivedByteBuf in the case of pooled buffers) is independent of the parent buffer. It will call .release() on the parent buffer when the reference count of the duplicated buffer reaches 0. Understanding this will help use .retain() and .release() correctly when .duplicate() or .slice() is used.

@lhotari
Copy link
Member Author

lhotari commented May 31, 2024

One potential source of double release bugs is the race conditions in RangeCache implementation used for the broker cache. There are 2 PRs to address the race conditions: #22789 and #22814

@lhotari
Copy link
Member Author

lhotari commented Jun 19, 2024

The Netty 4.1.111.Final upgrade will prevent some problems in this area, more details in #22892

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test release/4.0.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants