Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/3.3.0 #437

Merged
merged 5 commits into from
Jan 7, 2025
Merged

Release/3.3.0 #437

merged 5 commits into from
Jan 7, 2025

Conversation

AlexBenny
Copy link
Contributor

No description provided.

peel and others added 5 commits January 7, 2025 13:06
Introduces version updates to third-party components along with refreshed SLULA license 1.1
Previously it was possible to send arbitrarily large payloads at the collector.
Now, we add two mechanisms of handling large requests.
First, `networking.maxPayloadSize` (defaulting to 1048576 = 1MB) will stream the
input and generate SizeViolation bad event.
Second, `networking.dropPayloadSize` (defaulting to 2097152 = 2MB) will return
HTTP 413 and will not attempt to process the request.

The two mechanisms come hand in hand allowing for setting a safe but tolerant
configuration.
If requests exceed `sink.maxBytes` the collector will attempt to split them as
separate events. However, if payload is bigger than `maxPayloadSize`, parsing the
body is risky and in case of an attack may result in failures. The collector
will not attempt to parse the request and split it up but will generate a
SizeViolation bad event. But if payload exceeds `dropPayloadSize` the request
will not be handled and no bad event is going to be emitted. Instead HTTP 413 is generated.
In #431 we improved performance of the Kafka
sink by calling `producer.send()` on a compute thread not on a blocking
thread.  This is a great improvement if it is always true that
`producer.send()` never blocks.

But `producer.send()` does in fact block if the producer needs to
re-fetch topic metadata.  This happens every 5 minutes by default, and
is configured by the kafka setting `metadata.max.age.ms`.  If
`producer.send()` blocks on a compute thread, then it can potentially
cause thread starvation, and negatively impact the collector's
responsiveness to requests.

This PR changes to executing `send()` on a dedicated thread. From the
point of view of the collector, it is ok if `send()` is blocking on the
dedicated thread. It is better than running it inside `Sync[F].blocking`
(which is what we did before) because that tended to create a huge
number of threads under some circumstances.

In theory this shouldn't negatively affect performance much, even though
it's a single thread.  Because most of the time `send()` does not block;
and when it does block (i.e. once per 5 minutes) then it is ok for the
other events to get enqueued as a backlog of Callables on the thread.

Kafka sink use a dedicated thread for potentially blocking send: Amendment 1

### Security
- Add limit for payload allowed sizes [snowplow/stream-collector-private#1]
- Security and license updates [snowplow/stream-collector-private#1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don’t need these links?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep the minimum changes in the content as it will be hard to manage both origins if content is different. If not strictly required I would keep them there.

@AlexBenny AlexBenny merged commit 0af99f7 into master Jan 7, 2025
3 checks passed
@AlexBenny AlexBenny deleted the release/3.3.0 branch January 7, 2025 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants