A/B/n Experiments with Rewards¶

This tutorial describes how to use Iter8 to evaluate two or more versions on an application or ML model to identify the "best" version according to some reward metric(s).

A reward metric is a metric that measures the benefit or profit of a version of an application or ML model. Reward metrics are usually application or model specific. User engagement, sales, and net profit are examples.

Assumptions¶

We assume that you have deployed multiple versions of an application (or ML model) with the following characteristics:

There is a way to route user traffic to the deployed versions. This might be done using the Iter8 SDK, the Iter8 traffic control features, or some other mechanism.
Metrics, including reward metrics, are being exported to a metrics store such as Prometheus.
Metrics can be retrieved from the metrics store by application (model) version.

In this tutorial, we mock a Prometheus service and demonstrate how to write an Iter8 experiment that evaluates reward metrics.

Mock Prometheus¶

For simplicity, we use mockoon to create a mocked Prometheus service instead of deploying Prometheus itself:

kubectl create deploy prometheus-mock \
--image mockoon/cli:latest \
--port 9090 \
-- mockoon-cli start --daemon-off \
--port 9090 \
--data https://raw.githubusercontent.com/iter8-tools/docs/v0.14.6/samples/abn/model-prometheus-abn-tutorial.json
kubectl expose deploy prometheus-mock --port 9090

Define template¶

Create a provider specification that describes how Iter8 should fetch each metric value from the metrics store. The specification provides information about the provider URL, the HTTP method to be used, and any common headers. Furthermore, for each metric, there is: - metadata, such as name, type and description, - HTTP query parameters, and - a jq expression describing how to extract the metric value from the response.

For example, a specification for the mean latency metric from Prometheus can look like the following:

metric:
- name: latency-mean
  type: gauge
  description: |
    Mean latency
  params:
  - name: query
    value: |
      (sum(last_over_time(revision_app_request_latencies_sum{
        {{- template "labels" . }}
      }[{{ .elapsedTimeSeconds }}s])) or on() vector(0))/(sum(last_over_time(revision_app_request_latencies_count{
        {{- template "labels" . }}
      }[{{ .elapsedTimeSeconds }}s])) or on() vector(0))
  jqExpression: .data.result[0].value[1] | tonumber

Note that the template is parameterized. Values are provided by the Iter8 experiment at run time.

A sample provider specification for Prometheus is provided here.

It describes the following metrics:

request-count
latency-mean
profit-mean

Launch experiment¶

iter8 k launch \
--set "tasks={custommetrics,assess}" \
--set custommetrics.templates.model-prometheus="https://gist.githubusercontent.com/kalantar/80c9efc0fd4cc34572d893cc82bdc4d2/raw/f3629aa62cdc9fd7e39ee2b6b113a8bf7b6b4463/model-prometheus-abn-tutorial.tpl" \
--set custommetrics.values.labels.model_name=wisdom \
--set 'custommetrics.versionValues[0].labels.mm_vmodel_id=wisdom-1' \
--set 'custommetrics.versionValues[1].labels.mm_vmodel_id=wisdom-2' \
--set assess.SLOs.upper.model-prometheus/latency-mean=50 \
--set "assess.rewards.max={model-prometheus/profit-mean}" \
--set runner=cronjob \
--set cronjobSchedule="*/1 * * * *"

This experiment executes in a loop, once every minute. It uses the custommetrics task to read metrics from the (mocked) Prometheus provider. Finally, the assess task verifies that the latency-mean is below 50 msec and identifies which version provides the greatest reward; that is, the greatest mean profit.

Inspect experiment report¶

TextHTML

iter8 k report

The text report looks like this

Experiment summary:
*******************

  Experiment completed: true
  No task failures: true
  Total number of tasks: 2
  Number of completed tasks: 2
  Number of completed loops: 2

Whether or not service level objectives (SLOs) are satisfied:
*************************************************************

  SLO Conditions                      | version 0 | version 1
  --------------                      | --------- | ---------
  model-prometheus/latency-mean <= 50 | true      | true


Latest observed values for metrics:
***********************************

  Metric                        | version 0 | version 1 | Best
  -------                       | -----     | -----     | ----
  model-prometheus/latency-mean | 19.00     | 43.00     | n/a
  model-prometheus/profit-mean  | 71.00     | 92.00     | version 1

iter8 k report -o html > report.html # view in a browser

The HTML report looks like this

HTML report

Because the experiment loops, the reported results will change over time.

Cleanup¶

Delete the experiment:

iter8 k delete

Terminate the mocked Prometheus service:

kubectl delete deploy/prometheus-mock svc/prometheus-mock