mab

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 1, 2021 License: Apache-2.0 Imports: 12 Imported by: 3

README

Mab

Multi-Armed Bandits Go Library

Build Status Go Report Card Go Reference

Description

What it is

Mab is a library/framework for scalable and customizable multi-armed bandits. It provides efficient pseudo-random implementations of epsilon-greedy and Thompson sampling strategies. Arm-selection strategies are decoupled from reward models, allowing Mab to be used with any reward model whose output can be described as a posterior distribution or point estimate for each arm.

Mab also provides a numerical one-dimensional integration package, numint, which was developed for use by the Mab Thompson sampler but can also be used as a standalone for numerical integration.

What it isn't

Mab is not concerned with building, training, or updating bandit reward models. It is focused on efficient pseudo-random arm selection given the output of a reward model.

Installation

go get -u github.com/stitchfix/mab

Usage

Bandit

A Bandit consists of three components: a RewardSource, a Strategy and a Sampler. Mab provides implementations of each of these, but you are encouraged to implement your own as well! Each component is defined by single-method interface, making it relatively simple to fully customize a Mab bandit.

Example:

package main

import (
	"context"
	"fmt"

	"github.com/stitchfix/mab"
	"github.com/stitchfix/mab/numint"
)

func main() {

	rewards := map[string][]mab.Dist{
		"us": {
			mab.Beta(40, 474),
			mab.Beta(64, 730),
			mab.Beta(71, 818),
		},
		"uk": {
			mab.Beta(25, 254),
			mab.Beta(100, 430),
			mab.Beta(30, 503),
		},
	}

	bandit := mab.Bandit{
		RewardSource: &mab.ContextualRewardStub{rewards},
		Strategy:     mab.NewThompson(numint.NewQuadrature()),
		Sampler:      mab.NewSha1Sampler(),
	}

	result, err := bandit.SelectArm(context.Background(), "user_id:12345", "us")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(result)
}

SelectArm will get the reward estimates from the RewardSource, compute arm-selection probabilities using the Strategy and select an arm using the Sampler.

There is an unfortunate name collision between Go's context.Context type and the context a contextual bandit. In Mab, the context.Context variables will always be named ctx, while the variables used for bandit context will be called banditContext.

Go's context.Context should be used to pass request-scoped data to the RewardSource, and it is best practice to only use it for cancellation propagation or passing non-controlling data such as request IDs.

The values needed by the contextual bandit to determine the reward estimates should be passed using the last argument, which is named banditContext.

The unit input to SelectArm is a string that is used for enabling deterministic outcomes. This is useful for debugging and testing, but can also be used to ensure that users get a consistent experience in between updates to the bandit reward model. Bandits are expected to always provide the same arm selection for the same set of reward estimates and input unit string.

The output of SelectArm is a struct containing the reward estimates, computed probabilities, and selected arm.

RewardSource

A RewardSource is expected to provide up-to-date reward estimates for each arm, given some context data. Mab provides a basic implementation (HTTPSource) that can be used for requesting rewards from an HTTP service, and some stubs that can be used for testing and development.

type RewardSource interface {
    GetRewards(context.Context, interface{}) ([]Dist, error)
}

A typical RewardSource implementation is expected to get reward estimates from a database, a cache, or a via HTTP request to a dedicated reward service. Since a RewardSource is likely to require a call to some external service, the GetRewards method includes a context.Context-type argument. This enables Mab bandits to be used in web services that need to pass request-scoped data such as request timeouts and cancellation propagation. The second argument should be used to pass bandit context data to the reward source. The reward source must return one distribution per arm, conditional on the bandit context.

Distributions

Reward estimates are represented as a Dist for each arm.

type Dist interface {
    CDF(x float64) float64
    Mean() float64
    Prob(x float64) float64
    Rand() float64
    Support() (float64, float64)
}

Mab includes implementations of beta, normal, and point distributions. The beta and normal distributions wrap and extend gonum implementations, so they are performant and reliable.

Mab lets your combine any distribution with any strategy, although some combinations don't make sense in practice.

For epsilon greedy, you will most likely use Point distributions, since the algorithm only cares about the mean of the reward estimate. Other distributions can be used, as long as they implement a Mean() that returns well-defined values.

For Thompson sampling, it is recommended to use Normal or Beta distributions. Since Thompson sampling is based on sampling from finite-width distributions, you won't get a useful bandit by using Point distributions with the Thompson strategy.

The Null() function returns a Point distribution at negative infinity (math.Inf(-1)). This indicates to the Strategy that this arm should never be selected. Each Strategy must account for any number of Null distributions and return zero probability for the null arms and the correct set of probabilities for the non-null arms, as if the null arms were not present.

Strategy

A Mab Strategy computes arm-selection probabilities from the set of reward estimates.

Mab provides the following strategies:

  • Thompson sampling (mab.Thompson)
  • Epsilon-greedy (mab.EpsilonGreedy)
  • Proportional (mab.Proportional)

Mab also provides a Monte-Carlo based Thompson-sampling strategy (mab.ThompsonMC) but it is much slower an less accurate than mab.Thompson, which is based on numerical integration. It is not recommended to use ThompsonMC in production.

Thompson sampling

The Thompson sampling strategy computes arm-selection probabilities using the following formula:

thompson sampling formula

That is, the probability of selecting an arm under Thompson sampling is the integral of that arm's posterior PDF times the posterior CDFs of all other arms. The derivation of this formula is left as an exercise for the reader.

Computing these probabilities requires one-dimensional integration, which is provided by the numint subpackage.

The limits of integration are determined by the Support of the arms' distribution, so Point distributions will always get zero probability using Thompson sampling.

Epsilon-greedy

This is the basic epsilon-greedy selection strategy. The probability of selecting an arm under epsilon greedy is readily computed from a closed-form solution without the need for numerical integration. It is based on the Mean of the reward estimate.

Proportional

The proportional sampler computes arm selection probabilities proportional to some input weights. This is not a real bandit strategy, but exists to allow users to effectively shift the interface between reward sources and bandit strategies. You can create a RewardSource that returns the desired selection weights as Point distributions and then use the Proportional strategy to make sure that the sampler uses the normalized weights as the probability distribution for arm selection.

Sampler

A Mab Sampler selects an arm given the set of selection probabilities and a string. The default sampler implementation uses the SHA1 hash of the input string (mod 1000) to determine the arm.

Numint

The Thompson sampling strategy depends on an integrator for computing probabilities.

type Integrator interface {
    Integrate(f func (float64) float64, a, b float64) (float64, error)
}

The numint package provides a quadrature-based implementation that can be used for Thompson sampling. It can be used effectively with just the default settings, or can be fully customized by the user.

The default quadrature rule and other parameters for the numint quadrature integrator have been found through trial-and-error to provide a good tradeoff between speed and reliability for a wide range of inputs including many combinations of normal and beta distributions.

See the numint README and documentation for more details.

Documentation

More detailed refence docs can be found on pkg.go.dev

License

Mab and Numint are licensed under the Apache 2.0 license. See the LICENSE file for terms and conditions for use, reproduction, and distribution.

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Bandit

type Bandit struct {
	RewardSource
	Strategy
	Sampler
}

A Bandit gets reward values from a RewardSource, computes selection probabilities using a Strategy, and selects an arm using a Sampler.

func (*Bandit) SelectArm

func (b *Bandit) SelectArm(ctx context.Context, unit string, banditContext interface{}) (Result, error)

SelectArm gets the current reward estimates, computes the arm selection probabilities, and selects and arm index. Returns a partial result and an error message if an error is encountered at any point. For example, if the reward estimates were retrieved, but an error was encountered during the probability computation, the result will contain the reward estimates, but no probabilities or arm index. There is an unfortunate name collision between a multi-armed bandit context and Go's context.Context type. The context.Context argument should only be used for passing request-scoped data to an external reward service, such as timeouts and cancellation propagation. The banditContext argument is used to pass bandit context features to the reward source for contextual bandits. The unit argument is a string that will be hashed to select an arm with the pseudo-random sampler. SelectArm is deterministic for a fixed unit and set of reward estimates from the RewardSource.

Example
rewards := []Dist{
	Beta(1989, 21290),
	Beta(40, 474),
	Beta(64, 730),
	Beta(71, 818),
	Beta(52, 659),
	Beta(59, 718),
}

b := Bandit{
	RewardSource: &RewardStub{Rewards: rewards},
	Strategy:     NewThompson(numint.NewQuadrature()),
	Sampler:      NewSha1Sampler(),
}

result, err := b.SelectArm(context.Background(), "12345", nil)
if err != nil {
	panic(err)
}
fmt.Println(result.Arm)
Output:

2

type BetaDist

type BetaDist struct {
	distuv.Beta
}

func Beta

func Beta(alpha, beta float64) BetaDist

Beta is a beta distribution for use with any bandit strategy.

func (BetaDist) String

func (b BetaDist) String() string

func (BetaDist) Support

func (b BetaDist) Support() (float64, float64)

type ContextMarshaler added in v0.0.2

type ContextMarshaler interface {
	Marshal(banditContext interface{}) ([]byte, error)
}

ContextMarshaler is called on the banditContext and the result will become the body of the request to the bandit service.

type ContextualRewardStub added in v0.0.2

type ContextualRewardStub struct {
	Rewards map[string][]Dist
}

ContextualRewardStub is a static contextual RewardSource that can be used for testing and development of contextual bandits. It assumes that the context can be specified with a string.

func (*ContextualRewardStub) GetRewards added in v0.0.2

func (c *ContextualRewardStub) GetRewards(ctx context.Context, banditContext interface{}) ([]Dist, error)

GetRewards gets the static rewards for a given banditContext string.

type Dist

type Dist interface {
	// CDF returns the cumulative distribution function evaluated at x.
	CDF(x float64) float64

	// Mean returns the mean of the distribution.
	Mean() float64

	// Prob returns the probability density function or probability mass function evaluated at x.
	Prob(x float64) float64

	// Rand returns a pseudo-random sample drawn from the distribution.
	Rand() float64

	// Support returns the range of values over which the distribution is considered non-zero for the purposes of numerical integration.
	Support() (float64, float64)
}

A Dist represents a one-dimensional probability distribution. Reward estimates are represented as a Dist for each arm. Strategies compute arm-selection probabilities using the Dist interface. This allows for combining different distributions with different strategies.

func BetaFromJSON added in v0.0.2

func BetaFromJSON(data []byte) ([]Dist, error)

BetaFromJSON converts a JSON-encoded array of Beta distributions to a []Dist. Expects the JSON data to be in the form:

`[{"alpha": 123, "beta": 456}, {"alpha": 3.1415, "beta": 9.999}]`

Returns an error if alpha or beta value are missing or less than 1 for any arm. Any additional keys are ignored.

func NormalFromJSON added in v0.0.2

func NormalFromJSON(data []byte) ([]Dist, error)

NormalFromJSON converts a JSON-encoded array of Normal distributions to a []Dist. Expects the JSON data to be in the form:

`[{"mu": 123, "sigma": 456}, {"mu": 3.1415, "sigma": 9.999}]`

Returns an error if mu or sigma value are missing or sigma is less than 0 for any arm. Any additional keys are ignored.

func PointFromJSON added in v0.0.2

func PointFromJSON(data []byte) ([]Dist, error)

PointFromJSON converts a JSON-encoded array of Point distributions to a []Dist. Expects the JSON data to be in the form:

`[{"mu": 123}, {"mu": 3.1415}]`

Returns an error if mu value is missing for any arm. Any additional keys are ignored.

type EpsilonGreedy

type EpsilonGreedy struct {
	Epsilon float64
	// contains filtered or unexported fields
}

EpsilonGreedy implements the epsilon-greedy bandit strategy. The Epsilon parameter must be greater than zero. If any arm has a Null distribution, it will have zero selection probability, and the other arms' probabilities will be computed as if the Null arms are not present. Ties are accounted for, so if multiple arms have the maximum mean reward estimate, they will have equal probabilities.

func NewEpsilonGreedy added in v0.0.2

func NewEpsilonGreedy(e float64) *EpsilonGreedy

func (*EpsilonGreedy) ComputeProbs

func (e *EpsilonGreedy) ComputeProbs(rewards []Dist) ([]float64, error)

ComputeProbs computes the arm selection probabilities from the set of reward estimates, accounting for Nulls and ties. Returns an error if epsilon is less than zero.

type ErrRewardNon2XX added in v0.0.6

type ErrRewardNon2XX struct {
	Url        string
	StatusCode int
	RespBody   string
}

func (*ErrRewardNon2XX) Error added in v0.0.6

func (e *ErrRewardNon2XX) Error() string

type HTTPSource added in v0.0.2

type HTTPSource struct {
	// contains filtered or unexported fields
}

HTTPSource is a basic implementation of RewardSource that gets reward estimates from an HTTP reward service.

func NewHTTPSource added in v0.0.2

func NewHTTPSource(client HttpDoer, url string, parser RewardParser, opts ...HTTPSourceOption) *HTTPSource

NewHTTPSource returns a new HTTPSource given an HttpDoer, a url for the reward service, and a RewardParser. Optionally provide a ContextMarshaler for encoding bandit context. For example, if a reward service running on localhost:1337 provides Beta reward estimates:

client := &http.Client{timeout: time.Duration(100*time.Millisecond)}
url := "localhost:1337/rewards"
parser := ParseFunc(BetaFromJSON)
marshaler := MarshalFunc(json.Marshal)

source := NewHTTPSource(client, url, parser, WithContextMashaler(marshaler))

func (*HTTPSource) GetRewards added in v0.0.2

func (h *HTTPSource) GetRewards(ctx context.Context, banditContext interface{}) ([]Dist, error)

GetRewards makes a POST request to the reward URL, and parses the response into a []Dist. If a banditContext is provided, it will be marshaled and included in the body of the request.

type HTTPSourceOption added in v0.0.2

type HTTPSourceOption func(source *HTTPSource)

HTTPSourceOption allows for optional arguments to NewHTTPSource

func WithContextMarshaler added in v0.0.2

func WithContextMarshaler(m ContextMarshaler) HTTPSourceOption

WithContextMarshaler is an optional argument to HTTPSource

type HttpDoer added in v0.0.2

type HttpDoer interface {
	Do(*http.Request) (*http.Response, error)
}

HTTPDoer is a basic interface for making HTTP requests. The net/http Client can be used or you can bring your own. Heimdall is a pretty good alternative client with some nice features: https://github.com/gojek/heimdall

type Integrator

type Integrator interface {
	Integrate(f func(float64) float64, a, b float64) (float64, error)
}

type MarshalFunc added in v0.0.2

type MarshalFunc func(banditContext interface{}) ([]byte, error)

MarshalFunc is an adapter to allow a normal function to be used as a ContextMarshaler

func (MarshalFunc) Marshal added in v0.0.2

func (m MarshalFunc) Marshal(banditContext interface{}) ([]byte, error)

type NormalDist

type NormalDist struct {
	distuv.Normal
}

func Normal

func Normal(mu, sigma float64) NormalDist

Normal is a normal distribution for use with any bandit strategy. For the purposes of Thompson sampling, it is truncated at mean +/- 4*sigma

func (NormalDist) String

func (n NormalDist) String() string

func (NormalDist) Support

func (n NormalDist) Support() (float64, float64)

type ParseFunc added in v0.0.2

type ParseFunc func([]byte) ([]Dist, error)

ParseFunc is an adapter to allow a normal function to be used as a RewardParser

func (ParseFunc) Parse added in v0.0.2

func (p ParseFunc) Parse(b []byte) ([]Dist, error)

type PointDist

type PointDist struct {
	Mu float64
}

func Null added in v0.0.2

func Null() PointDist

Null returns a PointDist with mean equal to negative infinity. This is a special value that indicates to a Strategy that this arm should get selection probability zero.

func Point

func Point(mu float64) PointDist

Point is used for reward models that just provide point estimates. Don't use with Thompson sampling.

func (PointDist) CDF

func (p PointDist) CDF(x float64) float64

func (PointDist) Mean

func (p PointDist) Mean() float64

func (PointDist) Prob

func (p PointDist) Prob(x float64) float64

func (PointDist) Rand

func (p PointDist) Rand() float64

func (PointDist) String

func (p PointDist) String() string

func (PointDist) Support

func (p PointDist) Support() (float64, float64)

type Proportional

type Proportional struct {
	// contains filtered or unexported fields
}

Proportional is a trivial bandit strategy that returns arm-selection probabilities proportional to the mean reward estimate for each arm. This can be used when a bandit service wants to provide selection weights rather than reward estimates. Proportional treats Point(0) and Null() the same way, assigning them zero selection probability.

func NewProportional added in v0.0.2

func NewProportional() *Proportional

func (*Proportional) ComputeProbs

func (p *Proportional) ComputeProbs(rewards []Dist) ([]float64, error)

ComputeProbs computes probabilities proportional to the mean reward of each arm. Returns an error if any arm has a negative finite mean reward. A mean reward of negative infinity is treated as zero, so that a Null() distribution is treated the same as Point(0).

type Result

type Result struct {
	Rewards []Dist
	Probs   []float64
	Arm     int
}

Result is the return type for a call to Bandit.SelectArm. It will contain the reward estimates provided by the RewardSource, the computed arm selection probabilities, and the index of the selected arm.

type RewardParser added in v0.0.2

type RewardParser interface {
	Parse([]byte) ([]Dist, error)
}

RewardParser will be called to convert the response from the reward service to a slice of distributions.

type RewardSource

type RewardSource interface {
	GetRewards(ctx context.Context, banditContext interface{}) ([]Dist, error)
}

A RewardSource provides the current reward estimates, in the form of a Dist for each arm. There is an unfortunate name collision between a multi-armed bandit context and Go's Context type. The first argument is a context.Context and should only be used for passing request-scoped data to an external reward service. If the RewardSource does not require an external request, this first argument should always be context.Background() The second argument is used to pass context values to the reward source for contextual bandits. A RewardSource implementation should provide the reward estimates conditioned on the value of banditContext. For non-contextual bandits, banditContext can be nil.

type RewardStub

type RewardStub struct {
	Rewards []Dist
}

RewardStub is a static non-contextual RewardSource that can be used for testing and development.

func (*RewardStub) GetRewards

func (s *RewardStub) GetRewards(context.Context, interface{}) ([]Dist, error)

GetRewards gets the static rewards

type Sampler

type Sampler interface {
	Sample(probs []float64, unit string) (int, error)
}

A Sampler returns a pseudo-random arm index given a set of probabilities and a string to hash. Samplers should always return the same arm index for the same set of probabilities and unit value.

type Sha1Sampler

type Sha1Sampler struct {
	// contains filtered or unexported fields
}

Sha1Sampler is a Sampler that uses the SHA1 hash of input unit to select an arm index with probability proportional to some given weights.

func NewSha1Sampler

func NewSha1Sampler() *Sha1Sampler

func (*Sha1Sampler) Sample

func (s *Sha1Sampler) Sample(weights []float64, unit string) (int, error)

Sample returns the selected arm for a given set of weights and input unit. An error is returned if any negative weight is encountered.

type Strategy

type Strategy interface {
	ComputeProbs([]Dist) ([]float64, error)
}

A Strategy computes arm selection probabilities from a slice of Distributions.

type Thompson

type Thompson struct {
	// contains filtered or unexported fields
}

func NewThompson

func NewThompson(integrator Integrator) *Thompson

func (*Thompson) ComputeProbs

func (t *Thompson) ComputeProbs(rewards []Dist) ([]float64, error)

type ThompsonMC

type ThompsonMC struct {
	NumIterations int
	// contains filtered or unexported fields
}

ThompsonMC is a Monte-Carlo based implementation of Thompson sampling Strategy. It should not be used in production but is provided only as an example and for comparison with the Thompson Strategy, which is much faster and more accurate.

func NewThompsonMC

func NewThompsonMC(numIterations int) *ThompsonMC

NewThompsonMC returns a new ThompsonMC with numIterations.

func (*ThompsonMC) ComputeProbs

func (t *ThompsonMC) ComputeProbs(rewards []Dist) ([]float64, error)

ComputeProbs estimates the arm-selection probabilities by repeatedly sampling from the Dist for each arm, and counting how many times each arm yields the maximal sampled value.

Directories

Path Synopsis
examples
Package numint provides rules and methods for one-dimensional numerical quadrature
Package numint provides rules and methods for one-dimensional numerical quadrature

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL