New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[CB] Split token streaming and generation to different threads for all CB based pipelines #1544

Open

iefode wants to merge 30 commits into openvinotoolkit:master from iefode:cb_streaming_and_generation_threads

+199 −90

Contributor

iefode commented Jan 14, 2025 •

edited

Loading

Notes:

Merge after: https://github.com/openvinotoolkit/openvino.genai/pull/1594/files

Ticket:

Results (CPU):

iefode added 2 commits

January 14, 2025 12:32


          Merge

f34e167


          [CB] Split token streaming and generation to different threads for al…

b866521

…l CB based pipelines

github-actions bot added category: continuous batching category: speculative decoding category: samples category: prompt lookup labels

iefode marked this pull request as draft

January 14, 2025 13:46

iefode requested a review from ilya-lavrenov

January 14, 2025 13:46


          Merge remote-tracking branch 'upstream/master' into cb_streaming_and_…

788fa7a

…generation_threads

ilya-lavrenov self-assigned this

iefode added 2 commits

January 14, 2025 17:52


          fix build

aeee846


          Merge branch 'master' into cb_streaming_and_generation_threads

6cdaef7

ilya-lavrenov reviewed

View reviewed changes

src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp Outdated Show resolved Hide resolved

ilya-lavrenov added this to the 2025.1 milestone

iefode added 3 commits

January 17, 2025 12:43

tmp

4aceccc


          Merge remote-tracking branch 'upstream/master' into cb_streaming_and_…

e1fda54

…generation_threads


          Streaming thread sync + main thread for gen

ec29d48

github-actions bot added category: tokenizers category: GenAI C++ API labels

iefode added 2 commits

January 17, 2025 18:15


          Merge branch 'cb_streaming_and_generation_threads' of github.com:iefo…

d83b538

…de/openvino.genai into cb_streaming_and_generation_threads


          remove debug info

075245e

iefode marked this pull request as ready for review

January 17, 2025 15:25


          Update speculative_decoding_lm.cpp

c440e08

ilya-lavrenov requested changes

View reviewed changes

src/cpp/src/generation_stream.hpp Outdated

@@ @@ -38,7 +38,7 @@ class GenerationStream { @@
                   }
                   bool can_read() {
-                      return !m_output_queue.empty();
+                      return !m_output_queue.empty() && !m_output_queue.full();

Contributor

ilya-lavrenov Jan 18, 2025

could you please explain it?
Logically, we can read from queue if it's full

Contributor Author

iefode Jan 20, 2025

To get one element only once. In other case, we can take one element several times ( = print it)

Contributor

ilya-lavrenov Jan 20, 2025

I still don't understand.. if stream is will and we cannot read from it, it will always be full.

should not be the original issue fixed somewhere in a different place?

Contributor Author

iefode Jan 20, 2025

No, it will be full in case not updated generated stream. Please, check src/cpp/src/synchronized_queue.hpp to get the details.

This is a good place to handle it.

src/cpp/src/continuous_batching_impl.cpp Outdated

+                      stream_tokens();
+                  });
+                  while (!generation->is_dropped() && has_active_request) {

Contributor

ilya-lavrenov Jan 18, 2025

step() will drop all non-running requests via _free_non_running_requests() (including dropped ones). Similary, drop_requests() will remove all requests. So !generation->is_dropped() does not seem to be required here.

If it's needed to handle generation->drop() from stream_tokens lambda, then it's undefined behavior, because a handle can be dropped after this condition has passed.

Looks like handle's status (dropped or not) is really a critical resource and should be safely used between main and streamer threads. So, we need to ensure correct work of current step() if handle is dropped

Contributor

ilya-lavrenov Jan 20, 2025

FYI, in PR #1594 I've tried to remove mis-use of handle_dropped() (see method num_finished_seqs()) to ensure step() is not affected by handle_drop() during schedule, model runner and sampler phases.

Now, it's safe to drop() request and step() will still be working

src/cpp/src/continuous_batching_impl.cpp Outdated

                   }
                   auto all_requests = m_awaiting_requests; // we need to store all requests to get results from them once generation has finished
-                  bool continue_generation = true;
-                  while (has_non_finished_requests() && continue_generation) {
+                  std::atomic<bool> has_active_request = has_non_finished_requests();

Contributor

ilya-lavrenov Jan 18, 2025

Suggested change

      
                std::atomic<bool> has_active_request = has_non_finished_requests();
          
                std::atomic<bool> has_active_requests = has_non_finished_requests();

src/cpp/src/continuous_batching_impl.cpp Outdated

+                  };
+                  // to define streaming thread
+                  std::thread t_stream([&stream_tokens] {

Contributor

ilya-lavrenov Jan 18, 2025

is streamer_ptr is nullptr, we don't need this thread

src/cpp/src/continuous_batching_impl.cpp Outdated

+                              for (const auto& gen_token : token.begin()->second.generated_ids) {
+                                  if (streamer_ptr->put(gen_token)) {
+                                      generation->drop();
+                                      cv.notify_all();

Contributor

ilya-lavrenov Jan 18, 2025

who is notified here? cv.wait() is used only in current thread, so it seems notification is not required.

src/cpp/src/continuous_batching_impl.cpp Outdated Show resolved Hide resolved

src/cpp/src/continuous_batching_impl.cpp Outdated

+                      while (!generation->is_dropped() && (has_active_request || streamer_ptr && generation->can_read())) {
+                          // waiting for any tokens or request finishing
+                          cv.wait(lock, [&generation, &has_active_request]{ return generation->can_read() || !has_active_request; });
+                          if (streamer_ptr && generation->can_read()) {

Contributor

ilya-lavrenov Jan 18, 2025

Suggested change

      
                        if (streamer_ptr && generation->can_read()) {
          
                        if (generation->can_read()) {

let's avoid thread creation if streamer is nullptr

ilya-lavrenov reviewed

View reviewed changes

src/cpp/src/continuous_batching_impl.cpp Outdated

                           throw;
                       }
+                      has_active_request = has_non_finished_requests();
+                      cv.notify_all();

Contributor

ilya-lavrenov Jan 18, 2025

Suggested change

      
                    cv.notify_all();
          
                    cv.notify_one();

as we have one waiting thread

iefode added 2 commits

January 20, 2025 12:31


          review

df710c0


          Merge branch 'cb_streaming_and_generation_threads' of github.com:iefo…

adbb29c

…de/openvino.genai into cb_streaming_and_generation_threads

iefode requested a review from ilya-lavrenov

January 20, 2025 08:33

iefode added 2 commits

January 20, 2025 12:54


          Merge branch 'master' into cb_streaming_and_generation_threads

ee8215e


          Remove srop_requests

a3843e4

iefode added 3 commits

January 20, 2025 14:18


          Merge branch 'cb_streaming_and_generation_threads' of github.com:iefo…

56ef8a2

…de/openvino.genai into cb_streaming_and_generation_threads


          Merge branch 'master' into cb_streaming_and_generation_threads

44eb47b


          remove extra

f5063d7

github-actions bot added the category: GHA label

iefode added 3 commits

January 21, 2025 12:32


          Update scheduler to ignore dropped generations. Add debug prints

78a6428


          Update causal_lm_cpp.yml

c8a2951


          Update causal_lm_cpp.yml

1055a1e

ilya-lavrenov marked this pull request as draft

January 21, 2025 14:02

iefode added 5 commits

January 22, 2025 14:20


          debug only

66e6d06


          Merge remote-tracking branch 'upstream/master' into cb_streaming_and_…

9c45a9e

…generation_threads


          Fix python

ffbf6dd


          Merge remote-tracking branch 'upstream/master' into cb_streaming_and_…

412c482

…generation_threads


          Merge branch 'cb_streaming_and_generation_threads' of github.com:iefo…

48d7158

…de/openvino.genai into cb_streaming_and_generation_threads

github-actions bot added category: LLM category: Python API and removed category: GHA labels

iefode marked this pull request as ready for review

January 22, 2025 19:00

iefode added 4 commits

January 22, 2025 23:00


          remove extra

c122ca0


          Update py_utils.cpp

7f3b67e


          Update py_utils.cpp

d272420


          Update speculative_decoding_lm.py

c6780d3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching category: GenAI C++ API category: LLM category: prompt lookup category: Python API category: samples category: speculative decoding category: tokenizers