-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new StateBaseAsyncDoFn with State and Timers #5055
Comments
IMDG = ?
|
Thanks @kellen, I was checking it out too ;) , really interesting indeed, what I meant by IMDG, it is that we are using a Hazelcast cluster on GKE (with some mircoservices), and now prototyping in SCIO a similar data processing app. I am going through https://github.com/spotify/scio/blob/0eece133a773e7ff85fc9e7fdcad1bfc3593fc1d/scio-test/src/test/scala/com/spotify/scio/transforms/AsyncLookupDoFnTest.scala#L273C33-L273C46, based on BaseAsyncLookupDoFn I am trying to figure out an implementation: I do not see many examples but
I was also wondering TTL for each Element, (there would be some uses cases when eviction comes about could be released, e.g: PubSub; with Timers and @ontime we could have some interesting control / output for each Element. Cheers |
I still don't think I know what "IMDG" means here. You are correct that |
IMDG stands for In-Memory Data Grid |
Hi Guys! I was able to have a functional State & Timer, for duplicates control; we could have "in memory/or with a low latency" duplicate checks of +1M events (just when using GlobalWindows for PubSub events, otherwise I have not been able to keep state among calls with FixedWindows and no Triggers):
thus, we would have not to deal with a cache supplier (+1M it could lead to memory issues) and it could be easier to scale (Vertical autoscale it is only available in DataFlow Prime) Thanks for reviewing! P.S.2: I am curently testing in DataFlow with multiple workers P.S.3: I am facing some race conditions when records are mocked at the same time (e.g: scio-8000000246 and beam-8000000246 ): it could be elminated with distinct here |
regarding this potential enhancement S & T for BaseAsyncDoFn, this has been published last week https://medium.com/@serna.alberto.eng/avoid-http-requests-duplicates-in-apache-beam-with-scio-a-custom-baseasyncdofn-and-state-and-2c7d63059ab3 hope it helps @kellen , @RustedBones |
Feature Request, motivated by BaseAsyncDoFn and KV lookups to avoid duplicates (instead of using Redis, BigTable, Hazelcast IMDG, etc)
https://github.com/albertols/scio-db/blob/develop/src/main/scala/com/db/myproject/async/http/state/StateBaseAsyncDoFn.java
protected abstract boolean alreadySent(@StateId("buffer") MapState<InputT, OutputT> buffer, InputT element);
and this should be also abstract: https://github.com/albertols/scio-db/blob/develop/src/main/scala/com/db/myproject/async/http/state/StateBaseAsyncDoFn.java#L166
protected abstract void addIdempotentElementInBuffer(MapState<InputT, OutputT> buffer, InputT input, OutputT output);
NOTE: GlobalWindow is needed (before applying
applyTransform(ParDo.of(new StateAsyncParDoWithAkka(mediationConfig))).map { m =>)
, in order to keepthe state amomng processElementsHappy to hear other alternatives, we are trying to cut IMDG and extenral idempotent lookups? it would be really cool 👍🏼
Thanks SCio Team!!
The text was updated successfully, but these errors were encountered: