Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][fn] Add API endpoint for function-worker for liveness check with configurable flag #23829

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mukesh154
Copy link
Contributor

@mukesh154 mukesh154 commented Jan 9, 2025

Motivation

This pull request introduces a health check functionality for Kubernetes deployments, specifically adding a liveness probe for the function worker. The liveness probe is crucial for Kubernetes-based applications, enabling automated pod restarts in case of failure. This change ensures that the function worker recovers when a ProducerFencedException occurs, which causes the worker to get stuck and not recover.

For instance, when a client makes a request like:

curl --location --request PUT 'https://localhost:6651/admin/v3/functions/test/test/test' --header 'Authorization: Bearer <token>' --header '...' --form '[email protected]'

For POST, PUT, and DELETE operations, the following error is returned under heavy load:

{"reason":"Internal Error updating function at the leader"}

And, when the following error occurs in the function worker currently:

ERROR org.apache.pulsar.functions.worker.FunctionMetaDataManager - Could not write into Function Metadata topic │
│ org.apache.pulsar.client.api.PulsarClientException$ProducerFencedException: Producer was fenced

The function worker does not recover, leading to an ongoing failure. With this update, the worker will automatically restart with the help of health check with liveliness probe upon encountering this error, ensuring proper recovery and continuity of operations.

Modifications

This update introduces an API endpoint to perform a liveliness check on the function worker pod. The API returns an HTTP status of 200 (OK) when the isLive flag within FunctionImpl is true. If the flag is false, typically after a ProducerFencedException occurs, the API will return a status of 503 (Service Unavailable).

The Kubernetes deployment configuration has been updated to use this new API endpoint in the liveness probe along with existing metrics endpoint, allowing the system to monitor the health and availability of the function worker.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Copy link

github-actions bot commented Jan 9, 2025

@mukesh154 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jan 9, 2025
@lhotari
Copy link
Member

lhotari commented Jan 9, 2025

The function worker does not recover, leading to an ongoing failure. With this update, the worker will automatically restart with the help of health check with liveliness probe upon encountering this error, ensuring proper recovery and continuity of operations.

Although adding a liveness check could be useful for many reasons, it would be better to primarily address this issue. Are you able to isolate and reproduce the issue? What Pulsar version are you running?

It looks like the leader election for the function worker uses a consumer with consumerEventListener to find who is the leader:

// the leaders service is using a `coordination` topic for leader election.
// we don't produce any messages into this topic, we only use the `failover` subscription
// to elect an active consumer as the leader worker. The leader worker will be responsible
// for scheduling snapshots for FMT and doing task assignment.
consumer = (ConsumerImpl<byte[]>) pulsarClient.newConsumer()
.topic(workerConfig.getClusterCoordinationTopic())
.subscriptionName(COORDINATION_TOPIC_SUBSCRIPTION)
.subscriptionType(SubscriptionType.Failover)
.consumerEventListener(this)
.property(WORKER_IDENTIFIER, consumerName)
.consumerName(consumerName)
.subscribe();

There might be a bug in the Pulsar implementation of notifying the active consumer change:

if (!pickAndScheduleActiveConsumer()) {
// the active consumer is not changed
Consumer currentActiveConsumer = getActiveConsumer();
if (null == currentActiveConsumer) {
if (log.isDebugEnabled()) {
log.debug("Current active consumer disappears while adding consumer {}", consumer);
}
} else {
consumer.notifyActiveConsumerChange(currentActiveConsumer);
}
}

if (closeFuture == null && !consumers.isEmpty()) {
pickAndScheduleActiveConsumer();
return;
}

At least by reading the code, it's hard to see how that could work.

@lhotari
Copy link
Member

lhotari commented Jan 9, 2025

At least by reading the code, it's hard to see how that could work.

Now I can see it. There are multiple references to notifyActiveConsumerChanged in other code locations. It seems that the solution works, but the code is just hard to understand. Found #1818 with some explanations.

@lhotari
Copy link
Member

lhotari commented Jan 9, 2025

How many function worker instances do you have when you encounter this problem?

@mukesh154
Copy link
Contributor Author

Although adding a liveness check could be useful for many reasons, it would be better to primarily address this issue. Are you able to isolate and reproduce the issue? What Pulsar version are you running?

Thanks for the feedback! I understand your point about addressing the primary issue first. Regarding the issue reproduction, I haven’t been able to isolate it myself yet. I'm using Pulsar 3.0, so it could potentially be related to that version.

@mukesh154
Copy link
Contributor Author

How many function worker instances do you have when you encounter this problem?

I have 2 function worker instances when I encounter the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants