Skip to content

Commit

Permalink
Fix small phone display
Browse files Browse the repository at this point in the history
  • Loading branch information
GindaChen committed Mar 18, 2024
1 parent 5d7897e commit 251706d
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions content/blogs/distserve/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,18 +118,18 @@ Figure 5 illustrates how a request is processed in such a disaggregated system.
Let’s go through a simple experiment to see why disaggregation is beneficial. We serve a 13B LLM on a single A100-80GB GPU with a synthetic workload of inputs of length 512 and output length 64 following [Poisson arrival](https://en.wikipedia.org/wiki/Poisson_point_process). We gradually increase the request rates (x-axis) and measure how the two latencies (P90 TTFT and P90 TPOT, y-axis) change in Figure 6.

Suppose we set the SLO of P90 TTFT as 0.4 second and P90 TPOT as 0.04 second (the horizontal line in **Figure 6**). We observe the existing systems can support roughly 3 rps that stay within the TTFT latency constraint using 1 GPU, whereas for TPOT, it sustains 1.6 rps (**Figure 6a)**. Since we need to satisfy both constraints, the goodput of existing colocated system becomes:
$$
\text{ Goodput (colocate) = min(2.3, 1.6) = 1.6 rps (per GPU)
}
$$


Goodput (colocate) = min(2.3, 1.6) = 1.6 rps (per GPU)



The performance is significantly boosted after disaggregation. Prefill worker and decode worker can both achieve better rps than previous if only handling one phase – as shown in **Figure 6**, one prefill worker achieves roughly 5.6 rps and one decode worker achieves roughly 10 rps.

More importantly, now we can **flexibly** allocate 2 prefill workers to pair with 1 decode worker (notate as 2P1D), 3 GPUs in total.The goodput becomes
$$
\text{Goodput (2P1D) = min(5.6 x 2, 10) = 10 reqs/s / 3 GPUs ≈ 3.3 reqs/s (per GPU)}
$$

Goodput (2P1D) = min(5.6 x 2, 10) = 10 reqs/s / 3 GPUs ≈ 3.3 reqs/s (per GPU)



**Simply disaggregating without any parallelism yields 2x goodput improvement.**
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 251706d

Please sign in to comment.