Load Test a KServe Model (via HTTP)¶

This tutorial shows how easy it is to run a load test for KServe when using HTTP to make requests. We use a sklearn model to demonstrate. The same approach works for any model type.

Before you begin

Try your first experiment. Understand the main concepts behind Iter8 experiments.
Ensure that you have the kubectl CLI.

Have access to a cluster running KServe. You can create a KServe Quickstart environment as follows:

curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.10/hack/quick_install.sh" | bash

Deploy an InferenceService¶

Create an InferenceService which exposes an HTTP port. The following serves the sklearn irisv2 model:

cat <<EOF | kubectl apply -f -
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-irisv2"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      runtime: kserve-mlserver
      storageUri: "gs://seldon-models/sklearn/mms/lr_model"
EOF

Launch Experiment¶

Launch an Iter8 experiment inside the Kubernetes cluster:

iter8 k launch \
--set "tasks={ready,http,assess}" \
--set ready.isvc=sklearn-irisv2 \
--set ready.timeout=180s \
--set http.url=http://sklearn-irisv2.default.svc.cluster.local/v2/models/sklearn-irisv2/infer \
--set http.payloadURL=https://gist.githubusercontent.com/kalantar/d2dd03e8ebff2c57c3cfa992b44a54ad/raw/97a0480d0dfb1deef56af73a0dd31c80dc9b71f4/sklearn-irisv2-input.json \
--set http.contentType="application/json" \
--set assess.SLOs.upper.http/latency-mean=50 \
--set assess.SLOs.upper.http/error-count=0 \
--set runner=job

About this experiment

This experiment consists of three tasks, namely, ready, http, and assess.

The ready task checks if the sklearn-irisv2 InferenceService exists and is Ready.

The http task sends requests to the cluster-local HTTP service whose URL exposed by the InferenceService, http://sklearn-irisv2.default.svc.cluster.local/v2/models/sklearn-irisv2/infer, and collects Iter8's built-in HTTP load test metrics.

The assess task verifies if the app satisfies the specified SLOs: i) the mean latency of the service does not exceed 50 msec, and ii) there are no errors (4xx or 5xx response codes) in the responses.

This is a single-loop Kubernetes experiment where all the previously mentioned tasks will run once and the experiment will finish. Hence, its runner value is set to job.

You can assert experiment outcomes, view an experiment report, and view experiment logs as described in your first experiment.

Some variations and extensions of this experiment

The http task can be configured with load related parameters such as the number of requests, queries per second, or number of parallel connections.
The assess task can be configured with SLOs for any of Iter8's built-in HTTP load test metrics.

Clean up¶

iter8 k delete
kubectl delete inferenceservice sklearn-irisv2