ilovebandits.mab package

Submodules

ilovebandits.mab.agents module

Core and main classes for the MAB problem.

class ilovebandits.mab.agents.BaseAgent(q_estimator: QEstMean | QEstFixedStep)

Bases: object

Base class for the agents. This base class is just a blueprint.

reset_agent(): Reset agent parameters such as arm counters and q estimations.

reset_only_estimator(): Reset q_estimator of the agent.

take_action()

Takes one step for the agent.

It takes in a reward and observation and returns the action the agent chooses at that time step.

class ilovebandits.mab.agents.EpsilonGreedyAgent(q_estimator: QEstMean | QEstFixedStep, epsilon: float)

Bases: BaseAgent

Epsilon Greedy Agent. Take Greedy action 1-epsilon% of times. Take random action epsilon% of times.

take_action() → Tuple[int, int, float]

Takes one step for the agent.

It takes in a reward and observation and returns the action the agent chooses at that time step.

Returns

int - the index of the current action. int - the number of times the current action has been chosen. float - the probability of selecting the action.

class ilovebandits.mab.agents.GreedyAgent(q_estimator: QEstMean | QEstFixedStep)

Bases: BaseAgent

Pure Greedy Agent. Takes always greedy action.

take_action() → Tuple[int, int, float]

Takes one step for the agent.

It takes in a reward and observation and returns the action the agent chooses at that time step.

Returns

int - the index of the current action. int - the number of times the current action has been chosen. float - the probability of selecting the action.

class ilovebandits.mab.agents.RandomAgent(q_estimator: QEstMean | QEstFixedStep)

Bases: BaseAgent

Random Agent. It can be used as a baseline.

take_action() → Tuple[int, int, float]

Takes one step for the agent.

It takes in a reward and observation and returns the action the agent chooses at that time step.

Returns

int - the index of the current action. int - the number of times the current action has been chosen. float - the probability of selecting the action.

class ilovebandits.mab.agents.TSAgent(arms: int, a_init: float = 1, b_init: float = 1, samples_for_freq_est: int = 100000)

Bases: BaseAgent

Implements Thompson Sampling Agent for discrete 0-1 rewards. Default a_init=1 and b_init=1 correspond to unfiform distribution.

reset_agent(): Reset agent parameters such as arm counters and q estimations.

take_action() → Tuple[int, int, float]

Takes one step for the agent.

It takes in a reward and observation and returns the action the agent chooses at that time step.

Returns

int - the index of the current action. int - the number of times the current action has been chosen. float - the probability of selecting the action.

class ilovebandits.mab.agents.UCBAgent(q_estimator: QEstMean | QEstFixedStep, c: float)

Bases: BaseAgent

Implements the UCB1 Agent.

reset_agent(): Reset agent parameters such as arm counters and q estimations.

take_action() → Tuple[int, int, float]

Takes one step for the agent.

It takes in a reward and observation and returns the action the agent chooses at that time step.

Returns

int - the index of the current action. int - the number of times the current action has been chosen. float - the probability of selecting the action.

ilovebandits.mab.q_estimators module

Core and main classes for the MAB problem.

class ilovebandits.mab.q_estimators.BaseScaler

Bases: object

Base class for the reward scalers. This base class lets the rewards as it is.

scale(reward): Scale reward.

class ilovebandits.mab.q_estimators.BernoulliBinarizationScaler(seed=42)

Bases: BaseScaler

Class to convert rewards from [0,1] domain to {0,1} domain.

scale(reward): Scale reward from [0,1] to {0,1}.

with_proba(epsilon): Bernoulli test, with probability \(\varepsilon\), return True, and with probability \(1 - \varepsilon\), return False.

class ilovebandits.mab.q_estimators.QEstBase(arms: int, qvals_init=None, reward_scaler: BaseScaler | BernoulliBinarizationScaler | None = None)

Bases: object

Base class for the Q estimators. This base class is just a blueprint.

estimate(action): Base estimate: update the arm count for the updates.

reset_arm_counts(): Reset arm count updates.

class ilovebandits.mab.q_estimators.QEstFixedStep(arms: int, step_size: float, qvals_init=None, reward_scaler: BaseScaler | BernoulliBinarizationScaler | None = None)

Bases: QEstBase

Q estimator to estimate reward with constant step size.

estimate(reward, action): Estimate reward for the given arm/action.

class ilovebandits.mab.q_estimators.QEstMean(arms: int, qvals_init=None, reward_scaler: BaseScaler | BernoulliBinarizationScaler | None = None)

Bases: QEstBase

Q estimator to estimate reward with sample average estimates.

estimate(reward, action): Estimate reward for the given arm/action.

class ilovebandits.mab.q_estimators.QThompSamp(alphas: array, betas: array, reward_scaler: BaseScaler | BernoulliBinarizationScaler | None = None)

Bases: QEstBase

Q estimator to estimate the reward for the Thompson Sampling agent assuming a bernourlli RV for the rewards. In this case, beta is the prior.

estimate(reward, action): Estimate reward for the given arm/action.

get_expected_values(): Get expected values of the assumed distributions.

sample_thetas(): Sample thetas from current distribution.

ilovebandits.mab.utils module

Utils functions for the package.

ilovebandits.mab.utils.argmax(q_values: List) → Tuple[int, float, List[int]]: Takes in a list of q_values and returns the index of the item with the highest value. Breaks ties randomly.

Returns

int - the index of the highest value in q_values. float - the probability of selecting the action. list[int] - the list of indices that are tied for the highest value.

ilovebandits.mab.utils.find_max_indices(numbers: List) → List: Returns a list with the index of the max number. In case of tie, the indexs of the tied numbers are returned.

ilovebandits.mab.utils.find_max_numbers(numbers: List) → List: Returns a list with the max number. In case of tie, the tied numbers are returned.

ilovebandits.mab.utils.ucb_uncertainty(c: float, arm_count_updates: List[int]) → ndarray: Compute uncertainty given a c value and an arm_count vector.

Arguments

c : c value for UCB. arm_count_update: List of arm counts.

Returns

np.ndarray : uncertainty vector.

ilovebandits.mab package

Submodules

ilovebandits.mab.agents module

Returns

Returns

Returns

Returns

Returns

ilovebandits.mab.q_estimators module

ilovebandits.mab.utils module

Returns

Arguments

Returns

Module contents