Ensure graph is good

hao-ai-lab · Mar 18, 2024 · b1692e5 · b1692e5
1 parent 0bfad16
commit b1692e5
Show file tree

Hide file tree

Showing 5 changed files with 2 additions and 2 deletions.
diff --git a/...gs/distserve/img/distserve_anime_crop.gif → ...t/blogs/distserve/img/distserve_anime.gif b/...gs/distserve/img/distserve_anime_crop.gif → ...t/blogs/distserve/img/distserve_anime.gif
diff --git a/...rve/img/distserve_anime_colocate_crop.gif → ...istserve/img/distserve_anime_colocate.gif b/...rve/img/distserve_anime_colocate_crop.gif → ...istserve/img/distserve_anime_colocate.gif
diff --git a/content/blogs/distserve/index.md b/content/blogs/distserve/index.md
@@ -73,7 +73,7 @@ Before we dive deeper, let’s revisit the lifecycle of a request in LLM serving
 
 LLM serving systems usually batch prefill and decoding all together using a technique called [**iteration-level scheduling**](https://www.usenix.org/conference/osdi22/presentation/yu) or [**continuous batching**](https://www.anyscale.com/blog/continuous-batching-llm-inference#continuous-batching), so that the GPUs process a batch size as large as possible, run one iteration, and generate one token for all of these requests. This technique effectively enhances the overall throughput (token per second) and is widely adopted in popular serving systems such as vLLM and TensorRT-LLM. 
 
-{{< image src="img/distserve-anime-colocate-crop.gif" alt="prefill_decode_process" width="100%" title="Figure 2. How requests get processed.">}}
+{{< image src="img/distserve_anime_colocate.gif" alt="prefill_decode_process" width="100%" title="Figure 2. How requests get processed.">}}
 
 
 However, **the two phases have very distinct characteristics in computation.** Prefill is very easily compute-bound, meaning a small batch of requests or even a long enough request will saturate computation very quickly. On the other hand, decoding needs a much bigger batch size to hit the compute bound, and is more easily subject to the memory-limit of the GPU. 
@@ -113,7 +113,7 @@ The intuition is simple: disaggregating prefill and decode into different GPUs a
 
 Figure 5 illustrates how a request is processed in such a disaggregated system. When a request arrives in the system, it first goes to a **prefill worker** and completes its prefill phase. After its intermediate states (mainly [KV Cache](https://medium.com/@joaolages/kv-caching-explained-276520203249)) migrate to a **decode worker,** multiple decode steps are taken to generate subsequent tokens. The request leaves the system once it finishes generation. 
 
-{{< image src="img/distserve_anime-crop.gif" alt="disaggregation" width="100%" title="Figure 5. How requests get processed when prefill/decode is disaggregated.">}}
+{{< image src="img/distserve_anime_crop.gif" alt="disaggregation" width="100%" title="Figure 5. How requests get processed when prefill/decode is disaggregated.">}}
 
 Let’s go through a simple experiment to see why disaggregation is beneficial. We serve a 13B LLM on a single A100-80GB GPU with a synthetic workload of inputs of length 512 and output length 64 following [Poisson arrival](https://en.wikipedia.org/wiki/Poisson_point_process). We gradually increase the request rates (x-axis) and measure how the two latencies (P90 TTFT and P90 TPOT, y-axis) change in Figure 6.
 

diff --git a/static/img/distserve_anime_crop.gif → static/img/distserve_anime.gif b/static/img/distserve_anime_crop.gif → static/img/distserve_anime.gif
diff --git a/static/img/distserve_anime_colocate_crop.gif b/static/img/distserve_anime_colocate_crop.gif