This project challenged me to build and deploy my first open source LLM API solution. It's a simple data analysis chatbot built on the personality of Britney Spears, which analyzes a small data set of student grades in math and physics. The data set was provided by the Lonely Octopus Program and inspiration for the challenge came from Ibraheem Ansari, after he taught a super helpful training on open source LLMs. Thank you to him and the entire LO team for your support!
-
🐍Python: The primary programming language used for this project. Python is used for both the backend (FastAPI) and frontend (Streamlit) development.
-
⚡FastAPI: A modern, fast (high-performance) web framework for building APIs with Python. In this project, FastAPI is used to create the backend server that handles requests from the Streamlit frontend and communicates with the Hugging Face Inference Endpoint.
-
🤗Hugging Face Inference Endpoint: A cloud-hosted API service that runs open-source machine learning models. For this project, I'm now using the Hugging Face Serverless Inference API with the Meta-Llama-3.1-8B-Instruct model, which powers BritneyBot's responses. The FastAPI backend sends prompts to this API and receives the generated text responses. This serverless option is more cost-effective and automatically scales based on usage. It's a great choice for projects that don't require constant, high-volume inference.
-
🐳Docker: A platform used to develop, ship, and run applications inside containers. In this project, Docker is used to containerize the FastAPI application, ensuring consistent deployment across different environments.
-
☁️Render: A cloud platform used to deploy and host web services. In this project, Render is used to deploy and host the Docker container running the FastAPI backend.
-
🪄Streamlit: An open-source Python library used to create web applications for machine learning and data science projects. In this project, Streamlit is used to create the frontend user interface where users can interact with BritneyBot.
-
🐼pandas: A Python library used for data manipulation and analysis. In this project, pandas is used to load and process the student grades data.
-
📫requests: A Python library used for making HTTP requests. Used in both the FastAPI backend (to communicate with the Hugging Face Inference Endpoint) and the Streamlit frontend (to communicate with the FastAPI backend).
- The user interacts with the Streamlit frontend, entering questions about the student grades data.
- The Streamlit app sends these questions to the FastAPI backend hosted on Render.
- The FastAPI backend processes the request, formats the prompt, and sends it to the Hugging Face Inference Endpoint.
- The Hugging Face Serverless Inference API generates a response using the hosted language model.
- The FastAPI backend receives the response, processes it to ensure it's in Britney's style, and sends it back to the Streamlit frontend.
- The Streamlit frontend displays the response to the user.
This setup allows for a scalable, cloud-based application that leverages modern web technologies and machine learning capabilities to create an engaging user experience.
- The Hugging Face Serverless Inference API has usage limits and may have higher latency compared to dedicated endpoints, especially on the first request after a period of inactivity (cold start).
- Users might experience slight delays in responses during these cold starts.
- For high-traffic applications, consider monitoring API usage and response times, and be prepared to upgrade to a dedicated endpoint if necessary.
You can adjust various elements to fine-tune the personality of BritneyBot! Overview:
- For more creative, varied responses: Increase temperature and top_p.
- For more focused, consistent responses: Decrease temperature and top_p.
- For longer or shorter responses: Adjust max_new_tokens.
Adjust the main.py parameters as desired. See current settings and guide:
- 🌡️ "max_new_tokens": 300
- This attempts to limit the response to about 5-7 sentences, ensuring brevity. Increase this if you want longer responses, or decrease for shorter ones. Adjusting this affects the response length and potentially the API call cost.
- 🌡️ "temperature": 0.6
- This attempts to balance creativity with accuracy. This should still allow for Britney's "voice" and emojis while maintaining mathematical correctness. This controls the randomness of the output. Higher values (e.g., 1.0) make output more random, lower values (e.g., 0.2) make it more focused and deterministic. Adjust this based on how creative or precise you want the responses to be.
- 🌡️"top_p": 0.90
- This focuses the output a bit more, but still allowing for creative elements. This is for nucleus sampling. It controls the cumulative probability of token selection. Lower values (e.g., 0.5) make the output more focused, higher values (e.g., 0.95) allow for more diversity.You can adjust this in conjunction with temperature to fine-tune the output style.
- 🌡️ "do_sample": True
- This allows for some randomness in the responses. This enables sampling (as opposed to always choosing the most likely next token). You might set this to False if you want more deterministic outputs.
- Questions? Email [email protected] or contact me here at https://github.com/kobrakitty
- Curious? Follow my AI learning journey and www.glitterpile.blog.
🥰xo Kobra Kitty