step_based
a2c
- class A2C(save_dir: PathLike, env: Env, policy: Policy, critic: GAE, max_iter: int, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, vfcn_coeff: float = 0.5, entropy_coeff: float = 0.001, batch_size: int = 32, std_init: float = 1.0, max_grad_norm: Optional[float] = None, num_workers: int = 4, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, logger: Optional[StepLogger] = None)[source]
Bases:
ActorCritic
Advantage Actor Critic (A2C)
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
vfcn_coeff – weighting factor of the value function term in the combined loss, specific to PPO2
entropy_coeff – weighting factor of the entropy term in the combined loss, specific to PPO2
batch_size – number of samples per policy update batch
std_init – initial standard deviation on the actions for the exploration noise
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
num_workers – number of environments for parallel sampling
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
logger – logger for every step of the algorithm, if None the default logger will be created
- loss_fcn(log_probs: Tensor, adv: Tensor, v_pred: Tensor, v_targ: Tensor)[source]
A2C loss function
- Parameters:
log_probs – logarithm of the probabilities of the taken actions
adv – advantage values
v_pred – predicted value function values
v_targ – target value function values
- Returns:
combined loss value
- name: str = 'a2c'
- update(rollouts: Sequence[StepSequence])[source]
Update the actor and critic parameters from the given batch of rollouts.
- Parameters:
rollouts – batch of rollouts
actor_critic
- class ActorCritic(env: Env, actor: Policy, critic: GAE, save_dir: PathLike, max_iter: int, logger: Optional[StepLogger] = None)[source]
Bases:
Algorithm
,ABC
Base class of all actor critic algorithms
Constructor
- Parameters:
env – the environment which the policy operates
actor – policy taking the actions in the environment
critic – estimates the value of states (e.g. advantage or return)
save_dir – directory to save the snapshots i.e. the results in
max_iter – maximum number of iterations
logger – logger for every step of the algorithm, if None the default logger will be created
- property expl_strat: NormalActNoiseExplStrat
Get the algorithm’s exploration strategy.
- init_modules(warmstart: bool, suffix: str = '', prefix: Optional[str] = None, **kwargs)[source]
Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.
- Parameters:
warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.
suffix – keyword for meta_info when loading from previous iteration
prefix – keyword for meta_info when loading from previous iteration
kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init
- load_snapshot(parsed_args) Tuple[Env, Policy, dict] [source]
Load the state of an experiment, which is specific to the algorithm.
- Parameters:
parsed_args – arguments parsed by the argparser
- Returns:
environment, policy, and (optional) algorithm-specific output, e.g. value function
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
- property sampler: ParallelRolloutSampler
Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.
- save_snapshot(meta_info: Optional[dict] = None)[source]
Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.
- Parameters:
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]
Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.
- Parameters:
snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- abstract update(rollouts: Sequence[StepSequence])[source]
Update the actor and critic parameters from the given batch of rollouts.
- Parameters:
rollouts – batch of rollouts
dql
- class DQL(save_dir: PathLike, env: Env, policy: DiscreteActQValPolicy, memory_size: int, eps_init: float, eps_schedule_gamma: float, gamma: float, max_iter: int, num_updates_per_step: int, target_update_intvl: Optional[int] = 5, num_init_memory_steps: Optional[int] = None, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, batch_size: int = 256, eval_intvl: int = 100, max_grad_norm: float = 0.5, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, num_workers: int = 4, logger: Optional[StepLogger] = None, use_trained_policy_for_refill: bool = False)[source]
Bases:
ValueBased
Deep Q-Learning (without bells and whistles)
See also
[1] V. Mnih et.al., “Human-level control through deep reinforcement learning”, Nature, 2015
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – (current) Q-network updated by this algorithm
memory_size – number of transitions in the replay memory buffer
eps_init – initial value for the probability of taking a random action, constant if eps_schedule_gamma=1
eps_schedule_gamma – temporal discount factor for the exponential decay of epsilon
gamma – temporal discount factor for the state values
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_updates_per_step – number of (batched) updates per algorithm steps
target_update_intvl – number of iterations that pass before updating the qfcn_targ network
num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
batch_size – number of samples per policy update batch
eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets
- init_modules(warmstart: bool, suffix: str = '', prefix: str = '', **kwargs)[source]
Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.
- Parameters:
warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.
suffix – keyword for meta_info when loading from previous iteration
prefix – keyword for meta_info when loading from previous iteration
kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init
- load_snapshot(parsed_args) Tuple[Env, Policy, dict] [source]
Load the state of an experiment, which is specific to the algorithm.
- Parameters:
parsed_args – arguments parsed by the argparser
- Returns:
environment, policy, and (optional) algorithm-specific output, e.g. value function
- static loss_fcn(q_vals: Tensor, expected_q_vals: Tensor) Tensor [source]
The Huber loss function on the one-step TD error \(\delta = Q(s,a) - (r + \gamma \max_a Q(s^\prime, a))\).
- Parameters:
q_vals – state-action values \(Q(s,a)\), from policy network
expected_q_vals – expected state-action values \(r + \gamma \max_a Q(s^\prime, a)\), from target network
- Returns:
loss value
- name: str = 'dql'
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
- property sampler: ParallelRolloutSampler
Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.
- save_snapshot(meta_info: Optional[dict] = None)[source]
Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.
- Parameters:
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
gae
- class GAE(vfcn: Union[Module, Policy], gamma: float = 0.99, lamda: float = 0.95, num_epoch: int = 10, batch_size: int = 64, standardize_adv: bool = True, standardizer: Optional[RunningStandardizer] = None, max_grad_norm: Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None)[source]
Bases:
LoggerAware
,Module
General Advantage Estimation (GAE)
See also
[1] J. Schulmann, P. Moritz, S. Levine, M. Jordan, P. Abbeel, ‘High-Dimensional Continuous Control Using Generalized Advantage Estimation’, ICLR 2016
Constructor
- Parameters:
vfcn – value function, which can be a FNN or a Policy
gamma – temporal discount factor
lamda – regulates the trade-off between bias (max for 0) and variance (max for 1), see [1]
num_epoch – number of iterations over all gathered samples during one estimator update
batch_size – number of samples per estimator update batch
standardize_adv – if True, the advantages are standardized to be \(~ N(0,1)\)
standardizer – pass None to use stateless standardisation, alternatively pass RunningStandardizer() to use a standardizer wich keeps track of past values
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
- gae(concat_ros: StepSequence, v_pred: Optional[Tensor] = None, requires_grad: bool = False) Tensor [source]
Compute the generalized advantage estimation as described in [1].
- Parameters:
concat_ros – concatenated rollouts (sequence of steps from potentially different rollouts)
v_pred – state-value predictions if already computed, else pass None
requires_grad – is the gradient required
- Return adv:
tensor of advantages
- reset()[source]
Reset the advantage estimator to it’s initial state. The default implementation resets the learning rate scheduler if there is one.
- tdlamda_returns(v_pred: Optional[Tensor] = None, adv: Optional[Tensor] = None, concat_ros: Optional[StepSequence] = None) Tensor [source]
Compute the TD(\(\lambda\)) returns based on the predictions of the network (introduces a bias).
- Parameters:
v_pred – state-value predictions if already computed, pass None to compute form given rollouts
adv – advantages if already computed, pass None to compute form given rollouts
concat_ros – rollouts to compute predicted values and advantages from if they are not provided
- Returns:
exponentially weighted returns based on the value function estimator
- update(rollouts: Sequence[StepSequence], use_empirical_returns: bool = False)[source]
Adapt the parameters of the advantage function estimator, minimizing the MSE loss for the given samples.
- Parameters:
rollouts – batch of rollouts
use_empirical_returns – use the return from the rollout (True) or the ones from the V-fcn (False)
- Return adv:
tensor of advantages after V-function updates
- values(concat_ros: StepSequence) Tensor [source]
Compute the states’ values for all observations.
- Parameters:
concat_ros – concatenated rollouts
- Returns:
states’ values
ppo
- class PPO(save_dir: ~pyrado.PathLike, env: ~pyrado.environments.base.Env, policy: ~pyrado.policies.base.Policy, critic: ~pyrado.algorithms.step_based.gae.GAE, max_iter: int, min_rollouts: int = None, min_steps: int = None, num_epoch: int = 3, eps_clip: float = 0.1, batch_size: int = 64, std_init: float = 1.0, num_workers: int = 4, max_grad_norm: ~typing.Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: [<class 'dict'>, None] = None, logger: ~pyrado.logger.step.StepLogger = None)[source]
Bases:
ActorCritic
Proximal Policy Optimization (PPO)
See also
- [1] J. Schulmann, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms”,
arXiv, 2017
[2] D.P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization”, ICLR, 2015
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
num_epoch – number of iterations over all gathered samples during one policy update
eps_clip – max/min probability ratio, see [1]
batch_size – number of samples per policy update batch
std_init – initial standard deviation on the actions for the exploration noise
num_workers – number of environments for parallel sampling
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
logger – logger for every step of the algorithm, if None the default logger will be created
Note
The Adam optimizer computes individual learning rates for all parameters. Thus, the learning rate scheduler schedules the maximum learning rate.
- property expl_strat: NormalActNoiseExplStrat
Get the algorithm’s exploration strategy.
- loss_fcn(log_probs: Tensor, log_probs_old: Tensor, adv: Tensor) Tensor [source]
PPO loss function
- Parameters:
log_probs – logarithm of the probabilities of the taken actions using the updated policy
log_probs_old – logarithm of the probabilities of the taken actions using the old policy
adv – advantage values
- Returns:
loss value
- name: str = 'ppo'
- update(rollouts: Sequence[StepSequence])[source]
Update the actor and critic parameters from the given batch of rollouts.
- Parameters:
rollouts – batch of rollouts
- class PPO2(save_dir: ~pyrado.PathLike, env: ~pyrado.environments.base.Env, policy: ~pyrado.policies.base.Policy, critic: ~pyrado.algorithms.step_based.gae.GAE, max_iter: int, min_rollouts: int = None, min_steps: int = None, num_epoch: int = 3, eps_clip: float = 0.1, vfcn_coeff: float = 0.5, entropy_coeff: float = 0.001, batch_size: int = 32, std_init: float = 1.0, num_workers: int = 4, max_grad_norm: ~typing.Optional[float] = None, lr: float = 0.0005, lr_scheduler=None, lr_scheduler_hparam: [<class 'dict'>, None] = None, logger: ~pyrado.logger.step.StepLogger = None)[source]
Bases:
ActorCritic
Variant of Proximal Policy Optimization (PPO) PPO2 differs from PPO by also clipping the value function and standardizing the advantages.
Note
PPO2 refers to the OpenAI version of PPO which is a GPU-enabled implementation. However, this one is not!
See also
[1] OpenAI Stable Baselines Documentation, https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html [2] J. Schulmann, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms”,
arXiv, 2017
[3] D.P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization”, ICLR, 2015
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
critic – advantage estimation function \(A(s,a) = Q(s,a) - V(s)\)
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
num_epoch – number of iterations over all gathered samples during one policy update
eps_clip – max/min probability ratio, see [1]
vfcn_coeff – weighting factor of the value function term in the combined loss, specific to PPO2
entropy_coeff – weighting factor of the entropy term in the combined loss, specific to PPO2
batch_size – number of samples per policy update batch
std_init – initial standard deviation on the actions for the exploration noise
num_workers – number of environments for parallel sampling
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler that does one step per epoch (pass through the whole data set)
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
logger – logger for every step of the algorithm, if None the default logger will be created
Note
The Adam optimizer computes individual learning rates for all parameters. Thus, the learning rate scheduler schedules the maximum learning rate.
- property expl_strat: NormalActNoiseExplStrat
Get the algorithm’s exploration strategy.
- loss_fcn(log_probs: Tensor, log_probs_old: Tensor, adv: Tensor, v_pred: Tensor, v_pred_old: Tensor, v_targ: Tensor) Tensor [source]
PPO2 loss function
- Parameters:
log_probs – logarithm of the probabilities of the taken actions using the updated policy
log_probs_old – logarithm of the probabilities of the taken actions using the old policy
adv – advantage values
v_pred – predicted value function values
v_pred_old – predicted value function values using the old value function
v_targ – target value function values
- Returns:
combined loss value
- name: str = 'ppo2'
- update(rollouts: Sequence[StepSequence])[source]
Update the actor and critic parameters from the given batch of rollouts.
- Parameters:
rollouts – batch of rollouts
sac
- class SAC(save_dir: PathLike, env: Env, policy: TwoHeadedPolicy, qfcn_1: Policy, qfcn_2: Policy, memory_size: int, gamma: float, max_iter: int, num_updates_per_step: Optional[int] = None, tau: float = 0.995, ent_coeff_init: float = 0.2, learn_ent_coeff: bool = True, target_update_intvl: int = 1, num_init_memory_steps: Optional[int] = None, standardize_rew: bool = True, rew_scale: Union[int, float] = 1.0, min_rollouts: Optional[int] = None, min_steps: Optional[int] = None, batch_size: Optional[int] = 256, eval_intvl: int = 100, max_grad_norm: float = 5.0, lr: float = 0.0003, lr_scheduler=None, lr_scheduler_hparam: Optional[dict] = None, num_workers: int = 4, logger: Optional[StepLogger] = None, use_trained_policy_for_refill: bool = False)[source]
Bases:
ValueBased
Soft Actor-Critic (SAC) variant with stochastic policy and two Q-functions and two Q-targets (no V-function)
See also
- [1] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep
Reinforcement Learning with a Stochastic Actor”, ICML, 2018
- [2] This implementation was inspired by https://github.com/pranz24/pytorch-soft-actor-critic
which is seems to be based on https://github.com/vitchyr/rlkit
- [3] This implementation also borrows (at least a bit) from
https://github.com/DLR-RM/stable-baselines3/tree/master/stable_baselines3/sac
Note
The update order of the policy, the Q-functions, (and the entropy coefficient) is different in almost every implementation out there. Here we follow the one from [3].
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
qfcn_1 – state-action value function \(Q(s,a)\), the associated target Q-functions is created from a re-initialized copies of this one
qfcn_2 – state-action value function \(Q(s,a)\), the associated target Q-functions is created from a re-initialized copies of this one
memory_size – number of transitions in the replay memory buffer, e.g. 1000000
gamma – temporal discount factor for the state values
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_updates_per_step – number of (batched) gradient updates per algorithm step
tau – interpolation factor in averaging for target networks, update used for the soft update a.k.a. polyak update, between 0 and 1
ent_coeff_init – initial weighting factor of the entropy term in the loss function
learn_ent_coeff – adapt the weighting factor of the entropy term
target_update_intvl – number of iterations that pass before updating the target network
num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely
standardize_rew – if True, the rewards are standardized to be \(~ N(0,1)\)
rew_scale – scaling factor for the rewards, defaults no scaling
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
batch_size – number of samples per policy update batch
eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
lr_scheduler – learning rate scheduler type for the policy and the Q-functions that does one step per update() call
lr_scheduler_hparam – hyper-parameters for the learning rate scheduler
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets
- property ent_coeff: Tensor
Get the detached entropy coefficient.
- init_modules(warmstart: bool, suffix: str = '', prefix: Optional[str] = None, **kwargs)[source]
Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.
- Parameters:
warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.
suffix – keyword for meta_info when loading from previous iteration
prefix – keyword for meta_info when loading from previous iteration
kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init
- load_snapshot(parsed_args) Tuple[Env, Policy, dict] [source]
Load the state of an experiment, which is specific to the algorithm.
- Parameters:
parsed_args – arguments parsed by the argparser
- Returns:
environment, policy, and (optional) algorithm-specific output, e.g. value function
- name: str = 'sac'
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
- property sampler: ParallelRolloutSampler
Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.
- save_snapshot(meta_info: Optional[dict] = None)[source]
Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.
- Parameters:
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
svpg
- class OptimizerHook(particle)[source]
Bases:
object
This class mocks the optimizer interface partially to intercept the gradient updates of svpg.
Constructor
- Parameters:
particle – blueprint algorithm in which the optimizer is replaced by the mocked one
- class SVPG(save_dir: PathLike, env: Env, particle: Algorithm, max_iter: int, num_particles: int, temperature: float, horizon: int, logger: Optional[StepLogger] = None)[source]
Bases:
Algorithm
Stein Variational Policy Gradient (SVPG)
See also
[1] Yang Liu, Prajit Ramachandran, Qiang Liu, Jian Peng, “Stein Variational Policy Gradient”, arXiv, 2017
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
particle – the particle to populate with different parameters during training
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_particles – number of SVPG particles
temperature – SVPG temperature
horizon – horizon for each particle
logger – defaults to None
- property iter_particles
Iterate particles by sequentially loading and yielding them.
- kernel(X: Tensor) Tuple[Tensor, Tensor] [source]
Compute the RBF-kernel and the corresponding derivatives.
- Parameters:
X – the tensor to compute the kernel from
- Returns:
the kernel and its derivatives
- load_particle(idx: int)[source]
Load a specific particle’s state into self.particle.
- Parameters:
idx – index of the particle to load
- load_snapshot(parsed_args) Tuple[Env, Policy, dict] [source]
Load the state of an experiment, which is specific to the algorithm.
- Parameters:
parsed_args – arguments parsed by the argparser
- Returns:
environment, policy, and (optional) algorithm-specific output, e.g. value function
- name: str = 'svpg'
- save_snapshot(meta_info: Optional[dict] = None)[source]
Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.
- Parameters:
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]
Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.
- Parameters:
snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
value_based
- class ValueBased(save_dir: PathLike, env: Env, policy: Union[Policy, TwoHeadedPolicy], memory_size: int, gamma: float, max_iter: int, num_updates_per_step: int, target_update_intvl: int, num_init_memory_steps: int, min_rollouts: int, min_steps: int, batch_size: int, eval_intvl: int, max_grad_norm: float, num_workers: int, logger: StepLogger, use_trained_policy_for_refill: bool = False)[source]
Bases:
Algorithm
,ABC
Base class of all value-based algorithms
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
memory_size – number of transitions in the replay memory buffer, e.g. 1000000
gamma – temporal discount factor for the state values
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_updates_per_step – number of (batched) gradient updates per algorithm step
target_update_intvl – number of iterations that pass before updating the target network
num_init_memory_steps – number of samples used to initially fill the replay buffer with, pass None to fill the buffer completely
min_rollouts – minimum number of rollouts sampled per policy update batch
min_steps – minimum number of state transitions sampled per policy update batch
batch_size – number of samples per policy update batch
eval_intvl – interval in which the evaluation rollouts are collected, also the interval in which the logger prints the summary statistics
max_grad_norm – maximum L2 norm of the gradients for clipping, set to None to disable gradient clipping
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
use_trained_policy_for_refill – whether to use the trained policy instead of a dummy policy to refill the replay buffer after resets
- property expl_strat: Union[SACExplStrat, EpsGreedyExplStrat]
Get the algorithm’s exploration strategy.
- property memory: ReplayMemory
Get the replay memory.
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
- save_snapshot(meta_info: Optional[dict] = None)[source]
Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.
- Parameters:
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]
Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.
- Parameters:
snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm