step_based

a2c

class A2C(save_dir: PathLike, env: Env, policy: Policy, critic: GAE, max_iter: int, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, vfcn_coeff: float = 0.5, entropy_coeff: float = 0.001, batch_size: int = 32, std_init: float = 1.0, max_grad_norm: Optional[float] = None, num_workers: int = 4, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, logger: Optional[StepLogger] = None)[source]

Bases: ActorCritic

Advantage Actor Critic (A2C)

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • min_rollouts – minimum number of rollouts sampled per policy update batch

  • min_steps – minimum number of state transitions sampled per policy update batch

  • vfcn_coeff – weighting factor of the value function term in the combined loss, specific to PPO2

  • entropy_coeff – weighting factor of the entropy term in the combined loss, specific to PPO2

  • batch_size – number of samples per policy update batch

  • std_init – initial standard deviation on the actions for the exploration noise

  • max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping

  • num_workers – number of environments for parallel sampling

  • lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.

  • lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)

  • lr_scheduler_hparam – hyper-parameters for the learning rate scheduler

  • logger – logger for every step of the algorithm, if None the default logger will be created

loss_fcn(log_probs: Tensor, adv: Tensor, v_pred: Tensor, v_targ: Tensor)[source]

A2C loss function

Parameters:
  • log_probs – logarithm of the probabilities of the taken actions

  • adv – advantage values

  • v_pred – predicted value function values

  • v_targ – target value function values

Returns:

combined loss value

name: str = 'a2c'
update(rollouts: Sequence[StepSequence])[source]

Update the actor and critic parameters from the given batch of rollouts.

Parameters:

rollouts – batch of rollouts

actor_critic

class ActorCritic(env: Env, actor: Policy, critic: GAE, save_dir: PathLike, max_iter: int, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm, ABC

Base class of all actor critic algorithms

Constructor

Parameters:
  • env – the environment which the policy operates

  • actor – policy taking the actions in the environment

  • critic – estimates the value of states (e.g. advantage or return)

  • save_dir – directory to save the snapshots i.e. the results in

  • max_iter – maximum number of iterations

  • logger – logger for every step of the algorithm, if None the default logger will be created

property critic: GAE

Get the critic.

property expl_strat: NormalActNoiseExplStrat

Get the algorithm’s exploration strategy.

init_modules(warmstart: bool, suffix: str = '', prefix: Optional[str] = None, **kwargs)[source]

Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.

Parameters:
  • warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.

  • suffix – keyword for meta_info when loading from previous iteration

  • prefix – keyword for meta_info when loading from previous iteration

  • kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init

load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sampler: ParallelRolloutSampler

Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

abstract update(rollouts: Sequence[StepSequence])[source]

Update the actor and critic parameters from the given batch of rollouts.

Parameters:

rollouts – batch of rollouts

dql

class DQL(save_dir: PathLike, env: Env, policy: DiscreteActQValPolicy, memory_size: int, eps_init: float, eps_schedule_gamma: float, gamma: float, max_iter: int, num_updates_per_step: int, target_update_intvl: Optional[int] = 5, num_init_memory_steps: Optional[int] = None, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, batch_size: int = 256, eval_intvl: int = 100, max_grad_norm: float = 0.5, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, num_workers: int = 4, logger: Optional[StepLogger] = None, use_trained_policy_for_refill: bool = False)[source]

Bases: ValueBased

Deep Q-Learning (without bells and whistles)

See also

[1] V. Mnih et.al., “Human-level control through deep reinforcement learning”, Nature, 2015

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – (current) Q-network updated by this algorithm

  • memory_size – number of transitions in the replay memory buffer

  • eps_init – initial value for the probability of taking a random action, constant if eps_schedule_gamma=1

  • eps_schedule_gamma – temporal discount factor for the exponential decay of epsilon

  • gamma – temporal discount factor for the state values

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • num_updates_per_step – number of (batched) updates per algorithm steps

  • target_update_intvl – number of iterations that pass before updating the qfcn_targ network

  • num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely

  • min_rollouts – minimum number of rollouts sampled per policy update batch

  • min_steps – minimum number of state transitions sampled per policy update batch

  • batch_size – number of samples per policy update batch

  • eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics

  • max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping

  • lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.

  • lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)

  • lr_scheduler_hparam – hyper-parameters for the learning rate scheduler

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

  • use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets

init_modules(warmstart: bool, suffix: str = '', prefix: str = '', **kwargs)[source]

Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.

Parameters:
  • warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.

  • suffix – keyword for meta_info when loading from previous iteration

  • prefix – keyword for meta_info when loading from previous iteration

  • kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init

load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

static loss_fcn(q_vals: Tensor, expected_q_vals: Tensor) Tensor[source]

The Huber loss function on the one-step TD error \(\delta = Q(s,a) - (r + \gamma \max_a Q(s^\prime, a))\).

Parameters:
  • q_vals – state-action values \(Q(s,a)\), from policy network

  • expected_q_vals – expected state-action values \(r + \gamma \max_a Q(s^\prime, a)\), from target network

Returns:

loss value

name: str = 'dql'
reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sampler: ParallelRolloutSampler

Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

update()[source]

Update the policy’s and qfcn_targ Q-function’s parameters on transitions sampled from the replay memory.

gae

class GAE(vfcn: Union[Module, Policy], gamma: float = 0.99, lamda: float = 0.95, num_epoch: int = 10, batch_size: int = 64, standardize_adv: bool = True, standardizer: Optional[RunningStandardizer] = None, max_grad_norm: Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None)[source]

Bases: LoggerAware, Module

General Advantage Estimation (GAE)

See also

[1] J. Schulmann, P. Moritz, S. Levine, M. Jordan, P. Abbeel, ‘High-Dimensional Continuous Control Using Generalized Advantage Estimation’, ICLR 2016

Constructor

Parameters:
  • vfcn – value function, which can be a FNN or a Policy

  • gamma – temporal discount factor

  • lamda – regulates the trade-off between bias (max for 0) and variance (max for 1), see [1]

  • num_epoch – number of iterations over all gathered samples during one estimator update

  • batch_size – number of samples per estimator update batch

  • standardize_adv – if True, the advantages are standardized to be \(~ N(0,1)\)

  • standardizer – pass None to use stateless standardisation, alternatively pass RunningStandardizer() to use a standardizer wich keeps track of past values

  • max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping

  • lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.

  • lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)

  • lr_scheduler_hparam – hyper-parameters for the learning rate scheduler

gae(concat_ros: StepSequence, v_pred: Optional[Tensor] = None, requires_grad: bool = False) Tensor[source]

Compute the generalized advantage estimation as described in [1].

Parameters:
  • concat_ros – concatenated rollouts (sequence of steps from potentially different rollouts)

  • v_pred – state-value predictions if already computed, else pass None

  • requires_grad – is the gradient required

Return adv:

tensor of advantages

reset()[source]

Reset the advantage estimator to it’s initial state. The default implementation resets the learning rate scheduler if there is one.

tdlamda_returns(v_pred: Optional[Tensor] = None, adv: Optional[Tensor] = None, concat_ros: Optional[StepSequence] = None) Tensor[source]

Compute the TD(\(\lambda\)) returns based on the predictions of the network (introduces a bias).

Parameters:
  • v_pred – state-value predictions if already computed, pass None to compute form given rollouts

  • adv – advantages if already computed, pass None to compute form given rollouts

  • concat_ros – rollouts to compute predicted values and advantages from if they are not provided

Returns:

exponentially weighted returns based on the value function estimator

update(rollouts: Sequence[StepSequence], use_empirical_returns: bool = False)[source]

Adapt the parameters of the advantage function estimator, minimizing the MSE loss for the given samples.

Parameters:
  • rollouts – batch of rollouts

  • use_empirical_returns – use the return from the rollout (True) or the ones from the V-fcn (False)

Return adv:

tensor of advantages after V-function updates

values(concat_ros: StepSequence) Tensor[source]

Compute the states’ values for all observations.

Parameters:

concat_ros – concatenated rollouts

Returns:

states’ values

property vfcn: Union[Module, Policy]

Get the value function approximator.

ppo

class PPO(save_dir: ~pyrado.PathLike, env: ~pyrado.environments.base.Env, policy: ~pyrado.policies.base.Policy, critic: ~pyrado.algorithms.step_based.gae.GAE, max_iter: int, min_rollouts: int = None, min_steps: int = None, num_epoch: int = 3, eps_clip: float = 0.1, batch_size: int = 64, std_init: float = 1.0, num_workers: int = 4, max_grad_norm: ~typing.Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: [<class 'dict'>, None] = None, logger: ~pyrado.logger.step.StepLogger = None)[source]

Bases: ActorCritic

Proximal Policy Optimization (PPO)

See also

[1] J. Schulmann, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms”,

arXiv, 2017

[2] D.P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization”, ICLR, 2015

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • min_rollouts – minimum number of rollouts sampled per policy update batch

  • min_steps – minimum number of state transitions sampled per policy update batch

  • num_epoch – number of iterations over all gathered samples during one policy update

  • eps_clip – max/min probability ratio, see [1]

  • batch_size – number of samples per policy update batch

  • std_init – initial standard deviation on the actions for the exploration noise

  • num_workers – number of environments for parallel sampling

  • max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping

  • lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.

  • lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)

  • lr_scheduler_hparam – hyper-parameters for the learning rate scheduler

  • logger – logger for every step of the algorithm, if None the default logger will be created

Note

The Adam optimizer computes individual learning rates for all parameters. Thus, the learning rate scheduler schedules the maximum learning rate.

property expl_strat: NormalActNoiseExplStrat

Get the algorithm’s exploration strategy.

loss_fcn(log_probs: Tensor, log_probs_old: Tensor, adv: Tensor) Tensor[source]

PPO loss function

Parameters:
  • log_probs – logarithm of the probabilities of the taken actions using the updated policy

  • log_probs_old – logarithm of the probabilities of the taken actions using the old policy

  • adv – advantage values

Returns:

loss value

name: str = 'ppo'
update(rollouts: Sequence[StepSequence])[source]

Update the actor and critic parameters from the given batch of rollouts.

Parameters:

rollouts – batch of rollouts

class PPO2(save_dir: ~pyrado.PathLike, env: ~pyrado.environments.base.Env, policy: ~pyrado.policies.base.Policy, critic: ~pyrado.algorithms.step_based.gae.GAE, max_iter: int, min_rollouts: int = None, min_steps: int = None, num_epoch: int = 3, eps_clip: float = 0.1, vfcn_coeff: float = 0.5, entropy_coeff: float = 0.001, batch_size: int = 32, std_init: float = 1.0, num_workers: int = 4, max_grad_norm: ~typing.Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: [<class 'dict'>, None] = None, logger: ~pyrado.logger.step.StepLogger = None)[source]

Bases: ActorCritic

Variant of Proximal Policy Optimization (PPO) PPO2 differs from PPO by also clipping the value function and standardizing the advantages.

Note

PPO2 refers to the OpenAI version of PPO which is a GPU-enabled implementation. However, this one is not!

See also

[1] OpenAI Stable Baselines Documentation, https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html [2] J. Schulmann, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms”,

arXiv, 2017

[3] D.P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization”, ICLR, 2015

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • min_rollouts – minimum number of rollouts sampled per policy update batch

  • min_steps – minimum number of state transitions sampled per policy update batch

  • num_epoch – number of iterations over all gathered samples during one policy update

  • eps_clip – max/min probability ratio, see [1]

  • vfcn_coeff – weighting factor of the value function term in the combined loss, specific to PPO2

  • entropy_coeff – weighting factor of the entropy term in the combined loss, specific to PPO2

  • batch_size – number of samples per policy update batch

  • std_init – initial standard deviation on the actions for the exploration noise

  • num_workers – number of environments for parallel sampling

  • max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping

  • lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.

  • lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)

  • lr_scheduler_hparam – hyper-parameters for the learning rate scheduler

  • logger – logger for every step of the algorithm, if None the default logger will be created

Note

The Adam optimizer computes individual learning rates for all parameters. Thus, the learning rate scheduler schedules the maximum learning rate.

property expl_strat: NormalActNoiseExplStrat

Get the algorithm’s exploration strategy.

loss_fcn(log_probs: Tensor, log_probs_old: Tensor, adv: Tensor, v_pred: Tensor, v_pred_old: Tensor, v_targ: Tensor) Tensor[source]

PPO2 loss function

Parameters:
  • log_probs – logarithm of the probabilities of the taken actions using the updated policy

  • log_probs_old – logarithm of the probabilities of the taken actions using the old policy

  • adv – advantage values

  • v_pred – predicted value function values

  • v_pred_old – predicted value function values using the old value function

  • v_targ – target value function values

Returns:

combined loss value

name: str = 'ppo2'
update(rollouts: Sequence[StepSequence])[source]

Update the actor and critic parameters from the given batch of rollouts.

Parameters:

rollouts – batch of rollouts

sac

class SAC(save_dir: PathLike, env: Env, policy: TwoHeadedPolicy, qfcn_1: Policy, qfcn_2: Policy, memory_size: int, gamma: float, max_iter: int, num_updates_per_step: Optional[int] = None, tau: float = 0.995, ent_coeff_init: float = 0.2, learn_ent_coeff: bool = True, target_update_intvl: int = 1, num_init_memory_steps: Optional[int] = None, standardize_rew: bool = True, rew_scale: Union[int, float] = 1.0, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, batch_size: Optional[int] = 256, eval_intvl: int = 100, max_grad_norm: float = 5.0, lr: float = 0.0003, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, num_workers: int = 4, logger: Optional[StepLogger] = None, use_trained_policy_for_refill: bool = False)[source]

Bases: ValueBased

Soft Actor-Critic (SAC) variant with stochastic policy and two Q-functions and two Q-targets (no V-function)

See also

[1] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep

Reinforcement Learning with a Stochastic Actor”, ICML, 2018

[2] This implementation was inspired by https://github.com/pranz24/pytorch-soft-actor-critic

which is seems to be based on https://github.com/vitchyr/rlkit

[3] This implementation also borrows (at least a bit) from

https://github.com/DLR-RM/stable-baselines3/tree/master/stable_baselines3/sac

[4] https://github.com/MushroomRL/mushroom-rl/blob/dev/mushroom_rl/algorithms/actor_critic/deep_actor_critic/sac.py

Note

The update order of the policy, the Q-functions, (and the entropy coefficient) is different in almost every implementation out there. Here we follow the one from [3].

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • qfcn_1 – state-action value function \(Q(s,a)\), the associated target Q-functions is created from a re-initialized copies of this one

  • qfcn_2 – state-action value function \(Q(s,a)\), the associated target Q-functions is created from a re-initialized copies of this one

  • memory_size – number of transitions in the replay memory buffer, e.g. 1000000

  • gamma – temporal discount factor for the state values

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • num_updates_per_step – number of (batched) gradient updates per algorithm step

  • tau – interpolation factor in averaging for target networks, update used for the soft update a.k.a. polyak update, between 0 and 1

  • ent_coeff_init – initial weighting factor of the entropy term in the loss function

  • learn_ent_coeff – adapt the weighting factor of the entropy term

  • target_update_intvl – number of iterations that pass before updating the target network

  • num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely

  • standardize_rew – if True, the rewards are standardized to be \(~ N(0,1)\)

  • rew_scale – scaling factor for the rewards, defaults no scaling

  • min_rollouts – minimum number of rollouts sampled per policy update batch

  • min_steps – minimum number of state transitions sampled per policy update batch

  • batch_size – number of samples per policy update batch

  • eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics

  • max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping

  • lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.

  • lr_scheduler – learning rate scheduler type for the policy and the Q-functions that does one step per update() call

  • lr_scheduler_hparam – hyper-parameters for the learning rate scheduler

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

  • use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets

property ent_coeff: Tensor

Get the detached entropy coefficient.

init_modules(warmstart: bool, suffix: str = '', prefix: Optional[str] = None, **kwargs)[source]

Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.

Parameters:
  • warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.

  • suffix – keyword for meta_info when loading from previous iteration

  • prefix – keyword for meta_info when loading from previous iteration

  • kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init

load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'sac'
reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sampler: ParallelRolloutSampler

Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

update()[source]

Update the policy’s and Q-functions’ parameters on transitions sampled from the replay memory.

svpg

class OptimizerHook(particle)[source]

Bases: object

This class mocks the optimizer interface partially to intercept the gradient updates of svpg.

Constructor

Parameters:

particle – blueprint algorithm in which the optimizer is replaced by the mocked one

empty() bool[source]

Check if the buffer is empty.

Returns:

buffer is empty

iter_steps() Tuple[List, Dict, Tensor, Tensor][source]

Generate the steps in the buffer queue.

Yield:

the next step in the queue

real_step(*args, **kwargs)[source]

Call the original optimizer with given args and kwargs.

reset()[source]
step(*args, **kwargs)[source]

Store the args of the mocked call in the queue.

zero_grad(*args, **kwargs)[source]
class SVPG(save_dir: PathLike, env: Env, particle: Algorithm, max_iter: int, num_particles: int, temperature: float, horizon: int, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm

Stein Variational Policy Gradient (SVPG)

See also

[1] Yang Liu, Prajit Ramachandran, Qiang Liu, Jian Peng, “Stein Variational Policy Gradient”, arXiv, 2017

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • particle – the particle to populate with different parameters during training

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • num_particles – number of SVPG particles

  • temperature – SVPG temperature

  • horizon – horizon for each particle

  • logger – defaults to None

property iter_particles

Iterate particles by sequentially loading and yielding them.

kernel(X: Tensor) Tuple[Tensor, Tensor][source]

Compute the RBF-kernel and the corresponding derivatives.

Parameters:

X – the tensor to compute the kernel from

Returns:

the kernel and its derivatives

load_particle(idx: int)[source]

Load a specific particle’s state into self.particle.

Parameters:

idx – index of the particle to load

load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'svpg'
save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

store_particle()[source]

Safe the current particle’s state.

class SVPGBuilder(save_dir, env: Env, hparam: Dict)[source]

Bases: object

Helper class to build an SVPG algorithm instance

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • hparam – hyper-parameters for SVPG

value_based

class ValueBased(save_dir: PathLike, env: Env, policy: Union[Policy, TwoHeadedPolicy], memory_size: int, gamma: float, max_iter: int, num_updates_per_step: int, target_update_intvl: int, num_init_memory_steps: int, min_rollouts: int, min_steps: int, batch_size: int, eval_intvl: int, max_grad_norm: float, num_workers: int, logger: StepLogger, use_trained_policy_for_refill: bool = False)[source]

Bases: Algorithm, ABC

Base class of all value-based algorithms

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • memory_size – number of transitions in the replay memory buffer, e.g. 1000000

  • gamma – temporal discount factor for the state values

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • num_updates_per_step – number of (batched) gradient updates per algorithm step

  • target_update_intvl – number of iterations that pass before updating the target network

  • num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely

  • min_rollouts – minimum number of rollouts sampled per policy update batch

  • min_steps – minimum number of state transitions sampled per policy update batch

  • batch_size – number of samples per policy update batch

  • eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics

  • max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

  • use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets

property expl_strat: Union[SACExplStrat, EpsGreedyExplStrat]

Get the algorithm’s exploration strategy.

property memory: ReplayMemory

Get the replay memory.

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

abstract update()[source]

Update the policy’s (and value functions’) parameters based on the collected rollout data.

Module contents