Skip to content

Load Test a KServe Model (via gRPC)

This tutorial shows how easy it is to run a load test for KServe when using gRPC to make requests. We use a sklearn model to demonstrate. The same approach works for any model type.

Before you begin
  1. Try your first experiment. Understand the main concepts behind Iter8 experiments.
  2. Ensure that you have the kubectl CLI.
  3. Have access to a cluster running KServe. You can create a KServe Quickstart environment as follows:
    curl -s "" | bash

Deploy an InferenceService

Create an InferenceService which exposes a gRPC port. The following serves the sklearn irisv2 model:

cat <<EOF | kubectl create -f -
apiVersion: ""
kind: "InferenceService"
  name: "sklearn-irisv2"
        name: sklearn
      runtime: kserve-mlserver
      protocolVersion: v2
      storageUri: "gs://seldon-models/sklearn/mms/lr_model"
      - containerPort: 9000
        name: h2c
        protocol: TCP

Launch Experiment

Launch the Iter8 experiment inside the Kubernetes cluster:

GRPC_HOST=$(kubectl get isvc sklearn-irisv2 -o jsonpath='{.status.components.predictor.address.url}' | sed 's#.*//##')
iter8 k launch \
--set "tasks={ready,grpc,assess}" \
--set ready.isvc=sklearn-irisv2 \
--set ready.timeout=180s \
--set grpc.protoURL= \
--set${GRPC_HOST}:${GRPC_PORT} \
--set \
--set grpc.dataURL= \
--set assess.SLOs.upper.grpc/error-rate=0 \
--set assess.SLOs.upper.grpc/latency/mean=5000 \
--set assess.SLOs.upper.grpc/latency/p'97\.5'=7500 \
--set runner=job
About this experiment

This experiment consists of three tasks, namely, ready, grpc, and assess.

The ready task checks if the sklearn-irisv2 InferenceService exists and is Ready.

The grpc task sends call requests to the inference.GRPCInferenceService.ModelInfer method of the cluster-local gRPC service with host address ${GRPC_HOST}:${GRPC_PORT}, and collects Iter8's built-in gRPC load test metrics.

The assess task verifies if the app satisfies the specified SLOs: i) there are no errors, ii) the mean latency of the service does not exceed 50 msec, and iii) the 97.5th percentile latency does not exceed 200 msec.

This is a single-loop Kubernetes experiment where all the previously mentioned tasks will run once and the experiment will finish. Hence, its runner value is set to job.

You can assert experiment outcomes, view an experiment report, and view experiment logs as described in your first experiment.

Some variations and extensions of this experiment
  1. The grpc task can be configured with load related parameters such as the number of requests, requests per second, or number of concurrent connections.
  2. The assess task can be configured with SLOs for any of Iter8's built-in gRPC load test metrics.

Clean up

iter8 k delete
kubectl delete inferenceservice sklearn-irisv2