step_based

a2c

class A2C(save_dir: PathLike, env: Env, policy: Policy, critic: GAE, max_iter: int, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, vfcn_coeff: float = 0.5, entropy_coeff: float = 0.001, batch_size: int = 32, std_init: float = 1.0, max_grad_norm: Optional[float] = None, num_workers: int = 4, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, logger: Optional[StepLogger] = None)[source]

Bases: ActorCritic

Advantage Actor Critic (A2C)

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
vfcn_coeff – weighting factor of the value function term in the combined loss, specific to PPO2
entropy_coeff – weighting factor of the entropy term in the combined loss, specific to PPO2
batch_size – number of samples per policy update batch
std_init – initial standard deviation on the actions for the exploration noise
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
num_workers – number of environments for parallel sampling
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
logger – logger for every step of the algorithm, if None the default logger will be created

loss_fcn(log_probs: Tensor, adv: Tensor, v_pred: Tensor, v_targ: Tensor)[source]

A2C loss function

Parameters:

log_probs – logarithm of the probabilities of the taken actions
adv – advantage values
v_pred – predicted value function values
v_targ – target value function values

Returns:

combined loss value

name: str = 'a2c'

update(rollouts: Sequence[StepSequence])[source]

Update the actor and critic parameters from the given batch of rollouts.

Parameters:: rollouts – batch of rollouts

actor_critic

class ActorCritic(env: Env, actor: Policy, critic: GAE, save_dir: PathLike, max_iter: int, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm, ABC

Base class of all actor critic algorithms

Constructor

Parameters:

env – the environment which the policy operates
actor – policy taking the actions in the environment
critic – estimates the value of states (e.g. advantage or return)
save_dir – directory to save the snapshots i.e. the results in
max_iter – maximum number of iterations
logger – logger for every step of the algorithm, if None the default logger will be created

property critic: GAE: Get the critic.

property expl_strat: NormalActNoiseExplStrat: Get the algorithm’s exploration strategy.

init_modules(warmstart: bool, suffix: str = '', prefix: Optional[str] = None, **kwargs)[source]

Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.

Parameters:

warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.
suffix – keyword for meta_info when loading from previous iteration
prefix – keyword for meta_info when loading from previous iteration
kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

property sampler: ParallelRolloutSampler: Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

abstract update(rollouts: Sequence[StepSequence])[source]

Update the actor and critic parameters from the given batch of rollouts.

Parameters:: rollouts – batch of rollouts

dql

class DQL(save_dir: PathLike, env: Env, policy: DiscreteActQValPolicy, memory_size: int, eps_init: float, eps_schedule_gamma: float, gamma: float, max_iter: int, num_updates_per_step: int, target_update_intvl: Optional[int] = 5, num_init_memory_steps: Optional[int] = None, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, batch_size: int = 256, eval_intvl: int = 100, max_grad_norm: float = 0.5, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, num_workers: int = 4, logger: Optional[StepLogger] = None, use_trained_policy_for_refill: bool = False)[source]

Bases: ValueBased

Deep Q-Learning (without bells and whistles)

See also

[1] V. Mnih et.al., “Human-level control through deep reinforcement learning”, Nature, 2015

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – (current) Q-network updated by this algorithm
memory_size – number of transitions in the replay memory buffer
eps_init – initial value for the probability of taking a random action, constant if eps_schedule_gamma=1
eps_schedule_gamma – temporal discount factor for the exponential decay of epsilon
gamma – temporal discount factor for the state values
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_updates_per_step – number of (batched) updates per algorithm steps
target_update_intvl – number of iterations that pass before updating the qfcn_targ network
num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
batch_size – number of samples per policy update batch
eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets

init_modules(warmstart: bool, suffix: str = '', prefix: str = '', **kwargs)[source]

Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.

Parameters:

warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.
suffix – keyword for meta_info when loading from previous iteration
prefix – keyword for meta_info when loading from previous iteration
kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

static loss_fcn(q_vals: Tensor, expected_q_vals: Tensor) → Tensor[source]

The Huber loss function on the one-step TD error \(\delta = Q(s,a) - (r + \gamma \max_a Q(s^\prime, a))\).

Parameters:

q_vals – state-action values \(Q(s,a)\), from policy network
expected_q_vals – expected state-action values \(r + \gamma \max_a Q(s^\prime, a)\), from target network

Returns:

loss value

name: str = 'dql'

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

property sampler: ParallelRolloutSampler: Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

update()[source]: Update the policy’s and qfcn_targ Q-function’s parameters on transitions sampled from the replay memory.

gae

class GAE(vfcn: Union[Module, Policy], gamma: float = 0.99, lamda: float = 0.95, num_epoch: int = 10, batch_size: int = 64, standardize_adv: bool = True, standardizer: Optional[RunningStandardizer] = None, max_grad_norm: Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None)[source]

Bases: LoggerAware, Module

General Advantage Estimation (GAE)

See also

[1] J. Schulmann, P. Moritz, S. Levine, M. Jordan, P. Abbeel, ‘High-Dimensional Continuous Control Using Generalized Advantage Estimation’, ICLR 2016

Constructor

Parameters:

vfcn – value function, which can be a FNN or a Policy
gamma – temporal discount factor
lamda – regulates the trade-off between bias (max for 0) and variance (max for 1), see [1]
num_epoch – number of iterations over all gathered samples during one estimator update
batch_size – number of samples per estimator update batch
standardize_adv – if True, the advantages are standardized to be \(~ N(0,1)\)
standardizer – pass None to use stateless standardisation, alternatively pass RunningStandardizer() to use a standardizer wich keeps track of past values
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler

gae(concat_ros: StepSequence, v_pred: Optional[Tensor] = None, requires_grad: bool = False) → Tensor[source]

Compute the generalized advantage estimation as described in [1].

Parameters:

concat_ros – concatenated rollouts (sequence of steps from potentially different rollouts)
v_pred – state-value predictions if already computed, else pass None
requires_grad – is the gradient required

Return adv:

tensor of advantages

reset()[source]: Reset the advantage estimator to it’s initial state. The default implementation resets the learning rate scheduler if there is one.

tdlamda_returns(v_pred: Optional[Tensor] = None, adv: Optional[Tensor] = None, concat_ros: Optional[StepSequence] = None) → Tensor[source]

Compute the TD(\(\lambda\)) returns based on the predictions of the network (introduces a bias).

Parameters:

v_pred – state-value predictions if already computed, pass None to compute form given rollouts
adv – advantages if already computed, pass None to compute form given rollouts
concat_ros – rollouts to compute predicted values and advantages from if they are not provided

Returns:

exponentially weighted returns based on the value function estimator

update(rollouts: Sequence[StepSequence], use_empirical_returns: bool = False)[source]

Adapt the parameters of the advantage function estimator, minimizing the MSE loss for the given samples.

Parameters:

rollouts – batch of rollouts
use_empirical_returns – use the return from the rollout (True) or the ones from the V-fcn (False)

Return adv:

tensor of advantages after V-function updates

values(concat_ros: StepSequence) → Tensor[source]

Compute the states’ values for all observations.

Parameters:: concat_ros – concatenated rollouts
Returns:: states’ values

property vfcn: Union[Module, Policy]: Get the value function approximator.

ppo

class PPO(save_dir: ~pyrado.PathLike, env: ~pyrado.environments.base.Env, policy: ~pyrado.policies.base.Policy, critic: ~pyrado.algorithms.step_based.gae.GAE, max_iter: int, min_rollouts: int = None, min_steps: int = None, num_epoch: int = 3, eps_clip: float = 0.1, batch_size: int = 64, std_init: float = 1.0, num_workers: int = 4, max_grad_norm: ~typing.Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: [<class 'dict'>, None] = None, logger: ~pyrado.logger.step.StepLogger = None)[source]

Bases: ActorCritic

Proximal Policy Optimization (PPO)

See also

[1] J. Schulmann, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms”,: arXiv, 2017

[2] D.P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization”, ICLR, 2015

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
num_epoch – number of iterations over all gathered samples during one policy update
eps_clip – max/min probability ratio, see [1]
batch_size – number of samples per policy update batch
std_init – initial standard deviation on the actions for the exploration noise
num_workers – number of environments for parallel sampling
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
logger – logger for every step of the algorithm, if None the default logger will be created

Note

The Adam optimizer computes individual learning rates for all parameters. Thus, the learning rate scheduler schedules the maximum learning rate.

property expl_strat: NormalActNoiseExplStrat: Get the algorithm’s exploration strategy.

loss_fcn(log_probs: Tensor, log_probs_old: Tensor, adv: Tensor) → Tensor[source]

PPO loss function

Parameters:

log_probs – logarithm of the probabilities of the taken actions using the updated policy
log_probs_old – logarithm of the probabilities of the taken actions using the old policy
adv – advantage values

Returns:

loss value

name: str = 'ppo'

update(rollouts: Sequence[StepSequence])[source]

Update the actor and critic parameters from the given batch of rollouts.

Parameters:: rollouts – batch of rollouts

class PPO2(save_dir: ~pyrado.PathLike, env: ~pyrado.environments.base.Env, policy: ~pyrado.policies.base.Policy, critic: ~pyrado.algorithms.step_based.gae.GAE, max_iter: int, min_rollouts: int = None, min_steps: int = None, num_epoch: int = 3, eps_clip: float = 0.1, vfcn_coeff: float = 0.5, entropy_coeff: float = 0.001, batch_size: int = 32, std_init: float = 1.0, num_workers: int = 4, max_grad_norm: ~typing.Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: [<class 'dict'>, None] = None, logger: ~pyrado.logger.step.StepLogger = None)[source]

Bases: ActorCritic

Variant of Proximal Policy Optimization (PPO) PPO2 differs from PPO by also clipping the value function and standardizing the advantages.

Note

PPO2 refers to the OpenAI version of PPO which is a GPU-enabled implementation. However, this one is not!

See also

[1] OpenAI Stable Baselines Documentation, https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html [2] J. Schulmann, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms”,

arXiv, 2017

[3] D.P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization”, ICLR, 2015

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
num_epoch – number of iterations over all gathered samples during one policy update
eps_clip – max/min probability ratio, see [1]
vfcn_coeff – weighting factor of the value function term in the combined loss, specific to PPO2
entropy_coeff – weighting factor of the entropy term in the combined loss, specific to PPO2
batch_size – number of samples per policy update batch
std_init – initial standard deviation on the actions for the exploration noise
num_workers – number of environments for parallel sampling
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
logger – logger for every step of the algorithm, if None the default logger will be created

Note

The Adam optimizer computes individual learning rates for all parameters. Thus, the learning rate scheduler schedules the maximum learning rate.

property expl_strat: NormalActNoiseExplStrat: Get the algorithm’s exploration strategy.

loss_fcn(log_probs: Tensor, log_probs_old: Tensor, adv: Tensor, v_pred: Tensor, v_pred_old: Tensor, v_targ: Tensor) → Tensor[source]

PPO2 loss function

Parameters:

log_probs – logarithm of the probabilities of the taken actions using the updated policy
log_probs_old – logarithm of the probabilities of the taken actions using the old policy
adv – advantage values
v_pred – predicted value function values
v_pred_old – predicted value function values using the old value function
v_targ – target value function values

Returns:

combined loss value

name: str = 'ppo2'

update(rollouts: Sequence[StepSequence])[source]

Update the actor and critic parameters from the given batch of rollouts.

Parameters:: rollouts – batch of rollouts

sac

class SAC(save_dir: PathLike, env: Env, policy: TwoHeadedPolicy, qfcn_1: Policy, qfcn_2: Policy, memory_size: int, gamma: float, max_iter: int, num_updates_per_step: Optional[int] = None, tau: float = 0.995, ent_coeff_init: float = 0.2, learn_ent_coeff: bool = True, target_update_intvl: int = 1, num_init_memory_steps: Optional[int] = None, standardize_rew: bool = True, rew_scale: Union[int, float] = 1.0, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, batch_size: Optional[int] = 256, eval_intvl: int = 100, max_grad_norm: float = 5.0, lr: float = 0.0003, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, num_workers: int = 4, logger: Optional[StepLogger] = None, use_trained_policy_for_refill: bool = False)[source]

Bases: ValueBased

Soft Actor-Critic (SAC) variant with stochastic policy and two Q-functions and two Q-targets (no V-function)

See also

[1] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep: Reinforcement Learning with a Stochastic Actor”, ICML, 2018
[2] This implementation was inspired by https://github.com/pranz24/pytorch-soft-actor-critic: which is seems to be based on https://github.com/vitchyr/rlkit
[3] This implementation also borrows (at least a bit) from: https://github.com/DLR-RM/stable-baselines3/tree/master/stable_baselines3/sac

[4] https://github.com/MushroomRL/mushroom-rl/blob/dev/mushroom_rl/algorithms/actor_critic/deep_actor_critic/sac.py

Note

The update order of the policy, the Q-functions, (and the entropy coefficient) is different in almost every implementation out there. Here we follow the one from [3].

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
qfcn_1 – state-action value function \(Q(s,a)\), the associated target Q-functions is created from a re-initialized copies of this one
qfcn_2 – state-action value function \(Q(s,a)\), the associated target Q-functions is created from a re-initialized copies of this one
memory_size – number of transitions in the replay memory buffer, e.g. 1000000
gamma – temporal discount factor for the state values
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_updates_per_step – number of (batched) gradient updates per algorithm step
tau – interpolation factor in averaging for target networks, update used for the soft update a.k.a. polyak update, between 0 and 1
ent_coeff_init – initial weighting factor of the entropy term in the loss function
learn_ent_coeff – adapt the weighting factor of the entropy term
target_update_intvl – number of iterations that pass before updating the target network
num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely
standardize_rew – if True, the rewards are standardized to be \(~ N(0,1)\)
rew_scale – scaling factor for the rewards, defaults no scaling
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
batch_size – number of samples per policy update batch
eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler type for the policy and the Q-functions that does one step per update() call
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets

property ent_coeff: Tensor: Get the detached entropy coefficient.

init_modules(warmstart: bool, suffix: str = '', prefix: Optional[str] = None, **kwargs)[source]

Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.

Parameters:

warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.
suffix – keyword for meta_info when loading from previous iteration
prefix – keyword for meta_info when loading from previous iteration
kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'sac'

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

property sampler: ParallelRolloutSampler: Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

update()[source]: Update the policy’s and Q-functions’ parameters on transitions sampled from the replay memory.

svpg

class OptimizerHook(particle)[source]

Bases: object

This class mocks the optimizer interface partially to intercept the gradient updates of svpg.

Constructor

Parameters:: particle – blueprint algorithm in which the optimizer is replaced by the mocked one

empty() → bool[source]

Check if the buffer is empty.

Returns:: buffer is empty

iter_steps() → Tuple[List, Dict, Tensor, Tensor][source]

Generate the steps in the buffer queue.

Yield:: the next step in the queue

real_step(*args, **kwargs)[source]: Call the original optimizer with given args and kwargs.

reset()[source]

step(*args, **kwargs)[source]: Store the args of the mocked call in the queue.

zero_grad(*args, **kwargs)[source]

class SVPG(save_dir: PathLike, env: Env, particle: Algorithm, max_iter: int, num_particles: int, temperature: float, horizon: int, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm

Stein Variational Policy Gradient (SVPG)

See also

[1] Yang Liu, Prajit Ramachandran, Qiang Liu, Jian Peng, “Stein Variational Policy Gradient”, arXiv, 2017

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
particle – the particle to populate with different parameters during training
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_particles – number of SVPG particles
temperature – SVPG temperature
horizon – horizon for each particle
logger – defaults to None

property iter_particles: Iterate particles by sequentially loading and yielding them.

kernel(X: Tensor) → Tuple[Tensor, Tensor][source]

Compute the RBF-kernel and the corresponding derivatives.

Parameters:: X – the tensor to compute the kernel from
Returns:: the kernel and its derivatives

load_particle(idx: int)[source]

Load a specific particle’s state into self.particle.

Parameters:: idx – index of the particle to load

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'svpg'

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

store_particle()[source]: Safe the current particle’s state.

class SVPGBuilder(save_dir, env: Env, hparam: Dict)[source]

Bases: object

Helper class to build an SVPG algorithm instance

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
hparam – hyper-parameters for SVPG

value_based

class ValueBased(save_dir: PathLike, env: Env, policy: Union[Policy, TwoHeadedPolicy], memory_size: int, gamma: float, max_iter: int, num_updates_per_step: int, target_update_intvl: int, num_init_memory_steps: int, min_rollouts: int, min_steps: int, batch_size: int, eval_intvl: int, max_grad_norm: float, num_workers: int, logger: StepLogger, use_trained_policy_for_refill: bool = False)[source]

Bases: Algorithm, ABC

Base class of all value-based algorithms

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
memory_size – number of transitions in the replay memory buffer, e.g. 1000000
gamma – temporal discount factor for the state values
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_updates_per_step – number of (batched) gradient updates per algorithm step
target_update_intvl – number of iterations that pass before updating the target network
num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
batch_size – number of samples per policy update batch
eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets

property expl_strat: Union[SACExplStrat, EpsGreedyExplStrat]: Get the algorithm’s exploration strategy.

property memory: ReplayMemory: Get the replay memory.

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

abstract update()[source]: Update the policy’s (and value functions’) parameters based on the collected rollout data.

step_based

a2c

actor_critic

dql

gae

ppo

sac

svpg

value_based

Module contents