This package is for storing machine learning models with meta data in Rust so they can be used on the SurrealDB server.
SurrealML is a feature that allows you to store trained machine learning models in a special format called 'surml'. This enables you to run these models in either Python or Rust, and even upload them to a SurrealDB node to run the models on the server
- A basic understanding of Machine Learning: You should be familiar with ML concepts, algorithms, and model training processes.
- Knowledge of Python: Proficiency in Python is necessary as SurrealML involves working with Python-based ML models.
- Familiarity with SurrealDB: Basic knowledge of how SurrealDB operates is required since SurrealML integrates directly with it.
- Python Environment Setup: A Python environment with necessary libraries installed, including SurrealML, PyTorch or SKLearn (depending on your model preference).
- SurrealDB Installation: Ensure you have SurrealDB installed and running on your machine or server
To install SurrealML, make sure you have Python installed. Then, install the SurrealML
library and either PyTorch
or
SKLearn
, based on your model choice. You can install the package with both PyTorch
and SKLearn
with the command
below:
pip install "git+https://github.com/surrealdb/surrealml#egg=surrealml[sklearn,torch]"
If you want to use SurrealML
with sklearn
you will need the following installation:
pip install "git+https://github.com/surrealdb/surrealml#egg=surrealml[sklearn]"
For PyTorch
:
pip install "git+https://github.com/surrealdb/surrealml#egg=surrealml[torch]"
After that, you can train your model and save it in the SurrealML format.
If nothing is configured the crate will compile the ONNX runtime into the binary. This is the default behaviour. However, you have 2 more options:
- If you want to use an ONNX runtime that is installed on your system, you can set the environment variable
ONNXRUNTIME_LIB_PATH
before you compile the crate. This will make the crate use the ONNX runtime that is installed on your system. - If you want to statically compile the library, you can download it from https://github.com/surrealdb/onnxruntime-build/releases/tag/v1.16.3 and then build the crate this way:
$ tar xvf <onnx-archive-file> -C extract-dir
$ ORT_STRATEGY=system ORT_LIB_LOCATION=$(pwd)/extract-dir cargo build
Sk-learn models can also be converted and stored in the .surml
format enabling developers to load them in any
python version as we are not relying on pickle. Metadata in the file also enables other users of the model to use them
out of the box without having to worry about the normalisation of the data or getting the right inputs in order. You
will also be able to load your sk-learn models in Rust and run them meaning you can use them in your SurrealDB server.
Saving a model is as simple as the following:
from sklearn.linear_model import LinearRegression
from surrealml import SurMlFile, Engine
from surrealml.model_templates.datasets.house_linear import HOUSE_LINEAR # click on this HOUSE_LINEAR to see the data
# train the model
model = LinearRegression()
model.fit(HOUSE_LINEAR["inputs"], HOUSE_LINEAR["outputs"])
# package and save the model
file = SurMlFile(model=model, name="linear", inputs=HOUSE_LINEAR["inputs"], engine=Engine.SKLEARN)
# add columns in the order of the inputs to map dictionaries passed in to the model
file.add_column("squarefoot")
file.add_column("num_floors")
# add normalisers for the columns
file.add_normaliser("squarefoot", "z_score", HOUSE_LINEAR["squarefoot"].mean(), HOUSE_LINEAR["squarefoot"].std())
file.add_normaliser("num_floors", "z_score", HOUSE_LINEAR["num_floors"].mean(), HOUSE_LINEAR["num_floors"].std())
file.add_output("house_price", "z_score", HOUSE_LINEAR["outputs"].mean(), HOUSE_LINEAR["outputs"].std())
# save the file
file.save(path="./linear.surml")
# load the file
new_file = SurMlFile.load(path="./linear.surml", engine=Engine.SKLEARN)
# Make a prediction (both should be the same due to the perfectly correlated example data)
print(new_file.buffered_compute(value_map={"squarefoot": 5, "num_floors": 6}))
print(new_file.raw_compute(input_vector=[5, 6]))
First we need to have one script where we create and store the model. In this example we will merely do a linear regression model to predict the house price using the number of floors and the square feet.
We can create some fake data with the following python code:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
squarefoot = np.array([1000, 1200, 1500, 1800, 2000, 2200, 2500, 2800, 3000, 3200], dtype=np.float32)
num_floors = np.array([1, 1, 1.5, 1.5, 2, 2, 2.5, 2.5, 3, 3], dtype=np.float32)
house_price = np.array([200000, 230000, 280000, 320000, 350000, 380000, 420000, 470000, 500000, 520000], dtype=np.float32)
We then get the parameters to perform normalisation to get better convergance with the following"
squarefoot_mean = squarefoot.mean()
squarefoot_std = squarefoot.std()
num_floors_mean = num_floors.mean()
num_floors_std = num_floors.std()
house_price_mean = house_price.mean()
house_price_std = house_price.std()
We then normalise our data with the code below:
squarefoot = (squarefoot - squarefoot.mean()) / squarefoot.std()
num_floors = (num_floors - num_floors.mean()) / num_floors.std()
house_price = (house_price - house_price.mean()) / house_price.std()
We then create our tensors so they can be loaded into our model and stack it together with the following:
squarefoot_tensor = torch.from_numpy(squarefoot)
num_floors_tensor = torch.from_numpy(num_floors)
house_price_tensor = torch.from_numpy(house_price)
X = torch.stack([squarefoot_tensor, num_floors_tensor], dim=1)
We can now define our linear regression model with loss function and an optimizer with the code below:
# Define the linear regression model
class LinearRegressionModel(nn.Module):
def __init__(self):
super(LinearRegressionModel, self).__init__()
self.linear = nn.Linear(2, 1) # 2 input features, 1 output
def forward(self, x):
return self.linear(x)
# Initialize the model
model = LinearRegressionModel()
# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
We are now ready to train our model on the data we have generated with 100 epochs with the following loop:
num_epochs = 1000
for epoch in range(num_epochs):
# Forward pass
y_pred = model(X)
# Compute the loss
loss = criterion(y_pred.squeeze(), house_price_tensor)
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Print the progress
if (epoch + 1) % 100 == 0:
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")
Our model is now trained and we need some example data to trace the model with the code below:
test_squarefoot = torch.tensor([2800, 3200], dtype=torch.float32)
test_num_floors = torch.tensor([2.5, 3], dtype=torch.float32)
test_inputs = torch.stack([test_squarefoot, test_num_floors], dim=1)
We can now wrap our model in the SurMlFile
object with the following code:
from surrealml import SurMlFile, Engine
file = SurMlFile(model=model, name="linear", inputs=inputs[:1], engine=Engine.PYTORCH)
The name is optional but the inputs and model are essential. We can now add some meta data to the file such as our inputs and outputs with the following code, however meta data is not essential, it just helps with some types of computation:
file.add_column("squarefoot")
file.add_column("num_floors")
file.add_output("house_price", "z_score", house_price_mean, house_price_std)
It must be stressed that the add_column
order needs to be consistent with the input tensors that the model was trained on as these
now act as key bindings to convert dictionary inputs into the model. We need to also add the normalisers for our column but these will
be automatically mapped therefore we do not need to worry about the order they are inputed, again, normalisers are optional, you can
normalise the data yourself:
file.add_normaliser("squarefoot", "z_score", squarefoot_mean, squarefoot_std)
file.add_normaliser("num_floors", "z_score", num_floors_mean, num_floors_std)
We then save the model with the following code:
file.save("./test.surml")
If you have followed the previous steps you should have a .surml
file saved with all our meta data. We load it with the following code:
from surrealml import SurMlFile, Engine
new_file = SurMlFile.load("./test.surml", engine=Engine.PYTORCH)
Our model is now loaded. We can now perform computations.
This is where the computation utilises the data in the header. We can do this by merely passing in a dictionary as seen below:
print(new_file.buffered_compute({
"squarefoot": 1.0,
"num_floors": 2.0
}))
We can upload our trained model with the following code:
url = "http://0.0.0.0:8000/ml/import"
SurMlFile.upload(
path="./linear_test.surml",
url=url,
chunk_size=36864,
namespace="test",
database="test",
username="root",
password="root"
)
With this, we can perform SQL statements in our database. To test this, we can create the following rows:
CREATE house_listing SET squarefoot_col = 500.0, num_floors_col = 1.0;
CREATE house_listing SET squarefoot_col = 1000.0, num_floors_col = 2.0;
CREATE house_listing SET squarefoot_col = 1500.0, num_floors_col = 3.0;
SELECT * FROM (
SELECT
*,
ml::house-price-prediction<0.0.1>({
squarefoot: squarefoot_col,
num_floors: num_floors_col
}) AS price_prediction
FROM house_listing
)
WHERE price_prediction > 177206.21875;
What is happening here is that we are feeding the columns from the table house_listing
into a model we uploaded
called house-price-prediction
with a version of 0.0.1
. We then get the results of that trained ML model as the column
price_prediction
. We then use the calculated predictions to filter the rows giving us the following result:
[
{
"id": "house_listing:7bo0f35tl4hpx5bymq5d",
"num_floors_col": 3,
"price_prediction": 406534.75,
"squarefoot_col": 1500
},
{
"id": "house_listing:8k2ttvhp2vh8v7skwyie",
"num_floors_col": 2,
"price_prediction": 291870.5,
"squarefoot_col": 1000
}
]
We can now load our .surml
file with the following code:
use crate::storage::surml_file::SurMlFile;
let mut file = SurMlFile::from_file("./test.surml").unwrap();
You can have an empty header if you want. This makes sense if you're doing something novel, or complex such as convolutional neural networks for image processing. To perform a raw computation you can merely just do the following:
file.model.set_eval();
let x = Tensor::f_from_slice::<f32>(&[1.0, 2.0, 3.0, 4.0]).unwrap().reshape(&[2, 2]);
let outcome = file.model.forward_ts(&[x]);
println!("{:?}", outcome);
However if you want to use the header you need to perform a buffered computer
This is where the computation utilises the data in the header. We can do this by wrapping our File
struct in a ModelComputation
struct
with the code below:
use crate::execution::compute::ModelComputation;
let computert_unit = ModelComputation {
surml_file: &mut file
};
Now that we have this wrapper we can create a hashmap with values and keys that correspond to the key bindings. We can then pass this into
a buffered_compute
that maps the inputs and applies normalisation to those inputs if normalisation is present for that column with the
following:
let mut input_values = HashMap::new();
input_values.insert(String::from("squarefoot"), 1.0);
input_values.insert(String::from("num_floors"), 2.0);
let outcome = computert_unit.buffered_compute(&mut input_values);
println!("{:?}", outcome);