Skip to content

Inference model server implementation with Intel performance optimizations and TensorFlow Serving API

License

Notifications You must be signed in to change notification settings

llandis/OpenVINO-model-server

 
 

Repository files navigation

OpenVINO™ model server

Inference model server implementation, compatible with TensorFlow Serving API and OpenVINO™ as the execution backend. It provides both gRPC and RESTfull API interfaces.

Project overview

“OpenVINO™ Model Server” is a flexible, high-performance inference serving component for artificial intelligence models.
The software makes it easy to deploy new algorithms and AI experiments, while keeping the same server architecture and APIs like in TensorFlow Serving.

It provides out-of-the-box integration with models supported by OpenVINO™ and allows frameworks such as AWS Sagemaker to serve AI models with OpenVINO™.

OpenVINO Model Server supports for the models storage, beside local filesystem, also GCS, S3 and Minio.

It is implemented as a python service using gRPC interface library; falcon REST API framework; data serialization and deserialization using TensorFlow; and OpenVINO™ for inference execution. It acts as an integrator and a bridge exposing CPU optimized inference engine over network interfaces.

Review the Architecture concept document for more details.

OpenVINO Model Server, beside CPU, can employ:

Getting it up and running

Using a docker container

Landing on bare metal or virtual machine

Advanced configuration

Custom CPU extensions

Using FPGA (TBD)

gRPC API documentation

OpenVINO™ Model Server gRPC API is documented in proto buffer files in tensorflow_serving_api. Note: The implementations for Predict, GetModelMetadata and GetModelStatus function calls are currently available. These are the most generic function calls and should address most of the usage scenarios.

predict function spec has two message definitions: PredictRequest and PredictResponse.

  • PredictRequest specifies information about the model spec, a map of input data serialized via TensorProto to a string format.
  • PredictResponse includes a map of outputs serialized by TensorProto and information about the used model spec.

get_model_metadata function spec has three message definitions: SignatureDefMap, GetModelMetadataRequest, GetModelMetadataResponse. A function call GetModelMetadata accepts model spec information as input and returns Signature Definition content in the format similar to TensorFlow Serving.

get model status function spec can be used to report all exposed versions including their state in their lifecycle.

Refer to the example client code to learn how to use this API and submit the requests using gRPC interface.

gRPC interface is recommended for performance reasons because it has faster implementation of input data deserialization. It can achieve shorter latency especially for big input messages like images.

RESTful API documentation

OpenVINO™ Model Server RESTful API follows the documentation from tensorflow serving rest api.

Both row and column format of the requests are implemented. Note: Just like with gRPC, only the implementations for Predict, GetModelMetadata and GetModelStatus function calls are currently available.

Only the numerical data types are supported.

Review the exemplary clients below to find out more how to connect and run inference requests.

REST API is recommended when the primary goal is in reducing the number of client side python dependencies and simpler application code.

Usage examples

Kubernetes deployments

Sagemaker integration

Using Predict function over gRPC and RESTful API with numpy data input

Using GetModelMetadata function over gRPC and RESTful API

Using GetModelStatus function over gRPC and RESTful API

Example script submitting jpeg images for image classification

Jupyter notebook - kubernetes demo

Jupyter notebook - REST API client for age-gender classification

Benchmarking results

Report for resnet models

References

OpenVINO™

TensorFlow Serving

gRPC

RESTful API

Inference at scale in Kubernetes

OpenVINO Model Server boosts AI

Troubleshooting

Server logging

OpenVINO™ model server accepts 3 logging levels:

  • ERROR: Logs information about inference processing errors and server initialization issues.
  • INFO: Presents information about server startup procedure.
  • DEBUG: Stores information about client requests.

The default setting is INFO, which can be altered by setting environment variable LOG_LEVEL.

The captured logs will be displayed on the model server console. While using docker containers or kubernetes the logs can be examined using docker logs or kubectl logs commands respectively.

It is also possible to save the logs to a local file system by configuring an environment variable LOG_PATH with the absolute path pointing to a log file. Please see example below for usage details.

docker run --name ie-serving --rm -d -v /models/:/opt/ml:ro -p 9001:9001 --env LOG_LEVEL=DEBUG --env LOG_PATH=/var/log/ie_serving.log \
 ie-serving-py:latest /ie-serving-py/start_server.sh ie_serving config --config_path /opt/ml/config.json --port 9001
 
docker logs ie-serving 

Model import issues

OpenVINO™ Model Server loads all defined models versions according to set version policy. A model version is represented by a numerical directory in a model path, containing OpenVINO model files with .bin and .xml extensions.

Below are examples of incorrect structure:

models/
├── model1
│   ├── 1
│   │   ├── ir_model.bin
│   │   └── ir_model.xml
│   └── 2
│       ├── somefile.bin
│       └── anotherfile.txt
└── model2
    ├── ir_model.bin
    ├── ir_model.xml
    └── mapping_config.json

In above scenario, server will detect only version 1 of model1. Directory 2 does not contain valid OpenVINO model files, so it won't be detected as a valid model version. For model2, there are correct files, but they are not in a numerical directory. The server will not detect any version in model2.

When new model version is detected, the server will loads the model files and starts serving new model version. This operation might fail for the following reasons:

  • there is a problem with accessing model files (i. e. due to network connectivity issues to the remote storage or insufficient permissions)
  • model files are malformed and can not be imported by the Inference Engine
  • model requires custom CPU extension

In all those situations, the root cause is reported in the server logs or in the response from a call to GetModelStatus function.

Detected but not loaded model version will not be served and will report status LOADING with error message: Error occurred while loading version. When model files becomes accessible or fixed, server will try to load them again on the next version update attempt.

At startup, the server will enable gRPC and REST API endpoint, after all configured models and detected model versions are loaded successfully (in AVAILABLE state).

The server will fail to start if it can not list the content of configured model paths.

Client request issues

When the model server starts successfully and all the models are imported, there could be a couple of reasons for errors in the request handling. The information about the failure reason is passed to the gRPC client in the response. It is also logged on the model server in the DEBUG mode.

The possible issues could be:

  • Incorrect shape of the input data.
  • Incorrect input key name which does not match the tensor name or set input key name in mapping_config.json.
  • Incorrectly serialized data on the client side.

Resource allocation

RAM consumption might depend on the size and volume of the models configured for serving. It should be measured experimentally, however it can be estimated that each model will consume RAM size equal to the size of the model weights file (.bin file). Every version of the model creates a separate inference engine object, so it is recommended to mount only the desired model versions.

OpenVINO™ model server consumes all available CPU resources unless they are restricted by operating system, docker or kubernetes capabilities.

Performance tuning

When you send the input data for inference execution try to adjust the numerical data type to reduce the message size. For example you might consider sending the image representation as uint8 instead to float data. For REST API calls, it might help to reduce the numbers precisions in the json message with a command similar to np.round(imgs.astype(np.float),decimals=2). It will reduce the network bandwidth usage.

Usually, there is no need to tune any environment variables according to the allocated resources. In some cases it might be however beneficial to adjust the threading parameters to fit the allocated resources in optimal way. This is especially relevant in configuration when multiple services it being used on a single node. Another situation is in horizontal scalabily in Kubernetes when the thoughput can be increased by employing big volume of small containers.

Below are listed exemplary environment settings in 2 scenarios.

Optimization for latency - 1 container consuming all 80vCPU on the node:

OMP_NUM_THREADS=40
KMP_SETTINGS=1
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=1

Optimization for throughput - 20 containers on the node consuming 4vCPU each:

OMP_NUM_THREADS=4
KMP_SETTINGS=1
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=1

Usage monitoring

It is possible to track the usage of the models including processing time while DEBUG mode is enabled. With this setting model server logs will store information about all the incoming requests. You can parse the logs to analyze: volume of requests, processing statistics and most used models.

Known limitations and plans

  • Currently, Predict, GetModelMetadata and GetModelStatus calls are implemented using Tensorflow Serving API. Classify, Regress and MultiInference are planned to be added.
  • Output_filter is not effective in the Predict call. All outputs defined in the model are returned to the clients.

Contribution

Contribution rules

All contributed code must be compatible with the Apache 2 license.

All changes needs to have passed style, unit and functional tests.

All new features need to be covered by tests.

Building

Docker image with OpenVINO Model Server can be built with several options:

  • make docker_build_bin - using Intel Distribution of OpenVINO binary package (ubuntu base image)
  • make docker_build_src_ubuntu - using OpenVINO source code with ubuntu base image
  • make docker_build_src_intelpython - using OpenVINO source code with 'intelpython/intelpython3_core' base image (Intel optimized python distribution with conda and debian)

Testing

make style to run linter tests

make unit to execute unit tests (it requires OpenVINO installation followed by make install)

make test to execute functional tests (it requires building the docker image in advance). Running the tests require also preconfigured env variables GOOGLE_APPLICATION_CREDENTIALS, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_REGION with permissions to access models used in tests. To run tests limited to models to locally downloaded models use command:

make test_local_only

Contact

Submit Github issue to ask question, request a feature or report a bug.

About

Inference model server implementation with Intel performance optimizations and TensorFlow Serving API

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 83.7%
  • Jupyter Notebook 12.7%
  • Dockerfile 2.0%
  • Other 1.6%