Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a stresstest example #844

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions examples/server/stresstest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
from openai import OpenAI
import httpx
import textwrap
import json
import time


def log_response(response: httpx.Response):
request = response.request
print(f"Request: {request.method} {request.url}")
print(" Headers:")
for key, value in request.headers.items():
if key.lower() == "authorization":
value = "[...]"
if key.lower() == "cookie":
value = value.split("=")[0] + "=..."
print(f" {key}: {value}")
print(" Body:")
try:
request_body = json.loads(request.content)
print(textwrap.indent(json.dumps(request_body, indent=2), " "))
except json.JSONDecodeError:
print(textwrap.indent(request.content.decode(), " "))
print(f"Response: status_code={response.status_code}")
print(" Headers:")
for key, value in response.headers.items():
if key.lower() == "set-cookie":
value = value.split("=")[0] + "=..."
print(f" {key}: {value}")


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

# Enable this to log requests and responses
# client._client = httpx.Client(
# event_hooks={"request": [print], "response": [log_response]}
# )

messages = []

for i in range(1000):
messages.append({"role": "user", "content": "Hello! How are you? Please write generic binary search function in Rust."})
print("Sending", i)
start = time.time()
completion = client.chat.completions.create(
model="mistral",
messages=messages,
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print("Done", time.time()-start)
1 change: 1 addition & 0 deletions mistralrs-core/src/engine/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ impl Engine {
let throughput_start = Instant::now();
let current_completion_ids: Vec<usize> =
scheduled.completion.iter().map(|seq| *seq.id()).collect();

let res = {
let mut pipeline = get_mut_arcmutex!(self.pipeline);
let pre_op = if !self.no_kv_cache
Expand Down
54 changes: 54 additions & 0 deletions mistralrs/examples/stresstest/main.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
use anyhow::Result;
use mistralrs::{IsqType, MemoryUsage, RequestBuilder, TextMessageRole, TextModelBuilder};

const N_ITERS: u64 = 1000;
const BYTES_TO_MB: usize = 1024 * 1024;

const PROMPT: &str = r#"
The Rise of Rust: A New Era in Systems Programming
Introduction

Rust is a modern systems programming language that has garnered significant attention since its inception. Created by Graydon Hoare and backed by Mozilla Research, Rust has quickly risen to prominence due to its ability to balance performance, safety, and concurrency in ways that other languages struggle to achieve. While languages like C and C++ have long dominated systems programming, Rust offers a fresh approach by addressing many of the core issues that have historically plagued these languages—namely, memory safety and concurrency challenges.

In this essay, we will explore the key features of Rust, its advantages over other systems programming languages, and how it is shaping the future of software development, particularly in the realms of performance-critical and safe computing.

The Philosophy Behind Rust
Rust was designed to solve a key problem in systems programming: memory safety without sacrificing performance. Traditionally, languages like C and C++ have offered low-level access to memory, which is essential for writing efficient programs that interact closely with hardware. However, this power comes with significant risks, especially when it comes to bugs such as buffer overflows, null pointer dereferencing, and use-after-free errors. These types of bugs not only cause crashes but also open up security vulnerabilities, which have become a major issue in modern software development.

Rust's approach to memory safety is built on a few key principles:

Ownership and Borrowing: Rust introduces a unique ownership system that ensures memory safety at compile time. Each value in Rust has a single owner, and when the owner goes out of scope, the value is deallocated. This ensures that memory leaks and dangling pointers are virtually impossible. Furthermore, Rust's borrowing system allows references to be shared, but only in ways that are provably safe. For example, mutable references cannot be aliased, preventing many common concurrency issues.

Zero-Cost Abstractions: Rust provides high-level abstractions such as iterators and closures without incurring runtime penalties. This is crucial in systems programming, where performance is paramount. Unlike languages that rely on garbage collection (like Java or Go), Rustabsmins memory model allows developers to write high-performance code while still benefiting from the safety of modern abstractions.
"#;

#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
.with_isq(IsqType::Q4K)
.with_prefix_cache_n(None)
.with_logging()
// .with_paged_attn(|| mistralrs::PagedAttentionMetaBuilder::default().build())?
.build()
.await?;

for i in 0..N_ITERS {
let messages = RequestBuilder::new()
.add_message(
TextMessageRole::User,
PROMPT,
)
.set_sampler_max_len(1000);

println!("Sending request {}...", i + 1);
let response = model.send_chat_request(messages).await?;

let amount = MemoryUsage.get_memory_available(&model.config().device)? / BYTES_TO_MB;

println!("{amount}");
println!("{}", response.usage.total_time_sec);
println!("{:?}", response.choices[0].message.content);
}

Ok(())
}
Loading