Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thread to periodically perform pending cache maintenance #2308

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

owenhalpert
Copy link

@owenhalpert owenhalpert commented Dec 3, 2024

Description

Adds a ScheduledExecutor class that, for each cache, takes in a Runnable with a call to cleanUp and periodically executes according to the values of KNN_CACHE_ITEM_EXPIRY_TIME_MINUTES and QUANTIZATION_STATE_CACHE_EXPIRY_TIME_MINUTES. This will perform any pending maintenance (such as evicting expired entries) which was previously only performed when the cache was accessed. The maintenance thread is created whenever a NativeMemoryCache or QuantizationStateCache is instantiated or rebuilt and can be shut down with either class's close method. Relevant logic for cleanup was added to some testing base classes.

Related Issues

Resolves #2239

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kotwanikunal
Copy link
Member

@owenhalpert The changes look good in general. What I'd be interested in is testing under load for a resource constrained system - can you verify if this adds to the latency or impact the performance in any way?

We did implement a force evict before writes with #2015. Can you also enable this feature flag and run the above tests to ensure it behaves well?

@owenhalpert
Copy link
Author

@kotwanikunal I've completed the benchmarking on a single node cluster limited to 3GB of memory.

Before benchmarking my code, I validated the CacheMaintainer was actually running by inspecting the OpenSearch process status in the Docker container and found that after about a minute, the RssAnon value would decrease by about 0.5gb without any interference on my part, signaling the maintainer successfully cleaned up the expired entries. This is aligned with what I saw on the standard maintenance on the clean code (which I triggered manually by accessing the cache).

Results below:

Performance Summary for Resource-Constrained Testing (3GB Memory Limit)

Run 1: Clean 2.18 code

  • Indexing:
    • p50 latency: 16.96 ms
    • p90 latency: 32.47 ms
  • Search:
    • p50 latency: 330.48 ms
    • p90 latency: 437.78 ms

Run 2: Clean 2.18 code, Force evict ON

  • Indexing:
    • p50 latency: 17.27 ms
    • p90 latency 33.82 ms
  • Search:
    • p50 latency 337.43 ms
    • p90 latency 428.70 ms

Run 3: PR changes added, Force evict ON:

  • Indexing:
    • p50 latency: 16.89 ms
    • p90 latency: 32.24 ms
  • Search:
    • p50 latency: 341.38 ms
    • p90 latency: 414.06 ms

Run 4: PR changes added, Force evict OFF:

  • Indexing:
    • p50 latency: 17.43 ms
    • p90 latency: 34.63 ms
  • Search:
    • p50 latency: 346.83 ms
    • p90 latency: 430.04 ms

This suggests there is no significant impact on latency with my code changes. I've included the full results of these test runs here:

https://gist.github.com/owenhalpert/05ad4f5ae9577f717f2c59f2039d52e4

@owenhalpert owenhalpert changed the title Add CacheMaintainer class to perform pending cache maintenance every minute Add thread to perform pending cache maintenance every minute Dec 19, 2024
@owenhalpert owenhalpert force-pushed the cache-maintenance branch 2 times, most recently from dc61f6e to 1ca8dff Compare December 19, 2024 19:20
@owenhalpert owenhalpert marked this pull request as ready for review December 19, 2024 20:09
@owenhalpert owenhalpert changed the title Add thread to perform pending cache maintenance every minute Add thread to periodically perform pending cache maintenance Dec 21, 2024
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments overall looks good

try {
cache.cleanUp();
} catch (Exception e) {
logger.error("Error cleaning up cache", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why swallow exception here?

Copy link
Author

@owenhalpert owenhalpert Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any exceptions from Guava cache operations would otherwise halt our scheduled executor. If an exception occurs here, it would be from Guava's internals rather than our logic, so per Dooyong's suggestion above we can log it and continue scheduling cleanup tasks.

This way we can ensure the cache maintenance keeps running even if individual cleanup attempts fail, and we can monitor Guava errors in the logs.


TimeValue interval = KNNSettings.state().getSettingValue(KNNSettings.KNN_CACHE_ITEM_EXPIRY_TIME_MINUTES);

if (threadPool != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to log an error in else if its null. I think throwing is an option - @kotwanikunal thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logging makes more sense here since the operations are unaffected without the maintenance thread.
We don't want the OS process to fail because of an auxiliary process.

Also, please move the runnable definition within the if block, given it would not be used otherwise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the call to startMaintenance into a null check on threadPool because the entire method has no functionality otherwise.

@owenhalpert owenhalpert force-pushed the cache-maintenance branch 2 times, most recently from 3eb58ba to a7e5ff4 Compare January 9, 2025 23:18
@owenhalpert
Copy link
Author

The Github Actions were showing some failures in the Quantization State Cache unit tests, presumably because of concurrent test runs setting/shutting down the thread pool. To address this I added a thread pool that is created and terminated only within the two cache unit test classes to be used for those tests.

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall

Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
This reverts commit 66829ea.

Signed-off-by: owenhalpert <[email protected]>
spotlessApply

Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perform cache maintenance in a separate thread
5 participants