Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency optimization for native graph loading #2345

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Gankris96
Copy link

@Gankris96 Gankris96 commented Dec 19, 2024

Description

Fixes #2265

Refactors the graph load into a 2 step approach detailed here - #2265 (comment)

This will help to move out the opening of indexInput file outside of the synchronized block so that the graphfile can be downloaded in parallel even if the graph load and createIndexAllocation are inside synchronized block.

Related Issues

Resolves #2265

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Gankris96 Gankris96 force-pushed the concurrent-graph-load branch from 7cb8710 to 8e90b88 Compare December 19, 2024 01:30
@Gankris96 Gankris96 changed the title Concurrency optimization for graph native loading Concurrency optimization for native graph loading Dec 19, 2024
@navneet1v
Copy link
Collaborator

Please add an entry in the changelog.

@0ctopus13prime
Copy link
Collaborator

HI @Gankris96, thank you for the PR.
I see this clearly will bring benefit for the cases where multiple threads are competing each other.
But just curious, after this fix, how much performance gain do you see?

@Gankris96
Copy link
Author

Gankris96 commented Dec 20, 2024

HI @Gankris96, thank you for the PR. I see this clearly will bring benefit for the cases where multiple threads are competing each other. But just curious, after this fix, how much performance gain do you see?

Hi @0ctopus13prime - yes i am working on getting the benchmarking numbers. Primarily trying to test it on remote store backed index to see the performance gains.

Will update with Benchmarking numbers soon.

@navneet1v
Copy link
Collaborator

@Gankris96 please fix the failing CIs.

@Gankris96 Gankris96 force-pushed the concurrent-graph-load branch 6 times, most recently from f981b83 to 79392bc Compare December 31, 2024 01:15
@Gankris96 Gankris96 force-pushed the concurrent-graph-load branch from 33afd58 to ecdb8fa Compare January 9, 2025 01:05
@navneet1v
Copy link
Collaborator

navneet1v commented Jan 9, 2025

The searchable snapshot case shows improvement @navneet1v

Great this is what was expected.

I think the first query shows the improvement:

Query # Without Fix With Fix Delta (ms) Improvement
Query 1 3127.25 108.58 -3018.67 -96.5%

because that is time when all the graph files will be downloaded from remote, and once it is downloaded then the gains will drop and this fix was actually for the first query or for the queries for which the graph files are evicted from disk.

Copy link
Contributor

@Vikasht34 Vikasht34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make our unit and integ test more robust with code changes we are doing

  1. for cases isIndexGraphFileOpened() Ensure that openVectorIndex does nothing if the index graph file is already opened.
    2.Verify that the method extracts the vector file name correctly and proceeds to load the index without errors.
    3.Pass an invalid cache key that does not contain a vector file name and verify that the method throws an IllegalStateException with the correct error message.
    4.Can we mock directory.openInput method to return a valid IndexInput and verify that readStream and indexInputWithBuffer are initialized correctly.
  2. Can we veify readStream.seek(0) is called successfully.

@@ -350,7 +352,11 @@ public NativeMemoryAllocation get(NativeMemoryEntryContext<?> nativeMemoryEntryC
return result;
}
} else {
return cache.get(nativeMemoryEntryContext.getKey(), nativeMemoryEntryContext::load);
// open graphFile before load
try (nativeMemoryEntryContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a case where Multiple threads trigger eviction and graph loading concurrently, leading to temporary spikes in memory usage. Can we think of using bounded concurrency for eviction and graph loading tasks with thread pools?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take it up in a separate issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair callout. I think we need to improve on our cache operations in general.
I think the problem we are going through right now is that the cache operations can be async in nature (cleanup, eviction) where as we use it as a 1:1 reference for the off heap memory in use.
We can create a tracking issue and deal with this separately.

return cache.get(nativeMemoryEntryContext.getKey(), nativeMemoryEntryContext::load);
// open graphFile before load
try (nativeMemoryEntryContext) {
nativeMemoryEntryContext.openVectorIndex();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this case when graph is partially loaded or an error occurs during loading, which endup cache being an inconsistent state . Can we ensure automaticity in graph loading and only put in cache if it is successful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is error in graph loading then the entry will not be in cache. What would be the scenario where cache ends up in inconsistent state ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gankris96 Can we wrap this call behind the same lock based logic above?
Just to make sure we do not open the same index files concurrently in two different threads?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping this within a lock still seems to fail some bwc search tests where we endup getting incorrect results. Even doing so would not really help coz we don't solve the eventual problem of multiple graph files getting loaded at the same time because the load is not synchronized anymore.
This probably requires revisiting in a new separate issue where we refactor the whole cache strategy imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue so that we can track it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the bwc failure was a different issue unrelated to this. I did add back the locking logic for this as well. It seems to work fine so we can keep this in.

navneet1v
navneet1v previously approved these changes Jan 13, 2025
@navneet1v navneet1v self-requested a review January 13, 2025 19:07
@navneet1v
Copy link
Collaborator

@Gankris96 can you please fix the CIs

@Gankris96 Gankris96 force-pushed the concurrent-graph-load branch 2 times, most recently from 68c067b to daae55d Compare January 17, 2025 02:39
@Gankris96
Copy link
Author

@Vikasht34 @kotwanikunal @navneet1v have updated with some additional locking around openVectorIndex and added some UTs around the same. Please take a look

Copy link
Contributor

@Vikasht34 Vikasht34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments , Looks good to me.

@Gankris96 Gankris96 force-pushed the concurrent-graph-load branch 2 times, most recently from 9edec4e to 2d90b20 Compare January 21, 2025 22:15
return cache.get(nativeMemoryEntryContext.getKey(), nativeMemoryEntryContext::load);
// open graphFile before load
try (nativeMemoryEntryContext) {
nativeMemoryEntryContext.openVectorIndex();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue so that we can track it.

@Gankris96 Gankris96 force-pushed the concurrent-graph-load branch 7 times, most recently from d8e017a to 4fdd28e Compare January 22, 2025 22:58
@Gankris96 Gankris96 force-pushed the concurrent-graph-load branch from 4fdd28e to 4dc8449 Compare January 23, 2025 03:51
@Gankris96
Copy link
Author

@navneet1v @jmazanec15 please take a look and approve if all looks good.

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Some comments related to maintenance of code

Comment on lines +352 to +355
ReentrantLock indexFileLock = indexLocks.computeIfAbsent(key, k -> new ReentrantLock());
indexFileLock.lock();
nativeMemoryEntryContext.openVectorIndex();
indexFileLock.unlock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please, have a private method openIndex() here so this is taken care as an when the code changes?

Comment on lines +360 to +366
// recheck if another thread already loaded this entry into the cache
result = cache.getIfPresent(key);
if (result != null) {
accessRecencyQueue.remove(key);
accessRecencyQueue.addLast(key);
return result;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private method for this as well? There will be additional null check but everytime get returns accessRecency should be updated.

@navneet1v
Copy link
Collaborator

Overall looks good to me. I think I agree with @shatejas comments. Please resolve them so that we can ship this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Concurrency optimizations with native memory graph loading and force eviction
7 participants