algorithms

base

class Algorithm(save_dir: PathLike, max_iter: int, policy: Optional[Policy], logger: Optional[StepLogger] = None, save_name: str = 'algo')[source]

Bases: ABC, LoggerAware

Base class of all algorithms in Pyrado Algorithms specify the way how the policy is updated as well as the exploration strategy used to acquire samples.

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • policy – Pyrado policy (subclass of PyTorch’s Module) to train

  • logger – logger for every step of the algorithm, if None the default logger will be created

  • save_name – name of the algorithm’s pickle file without the ending, this becomes important if the algorithm is run as a subroutine

static clip_grad(module: Module, max_grad_norm: Optional[float]) float[source]

Clip all gradients of the provided Module (e.g., a policy or an advantage estimator) by their L2 norm value.

Note

The gradient clipping has to be applied between loss.backward() and optimizer.step()

Parameters:
  • module – Module containing parameters

  • max_grad_norm – maximum L2 norm for the gradient

Returns:

total norm of the parameters (viewed as a single vector)

property curr_iter: int

Get the current iteration counter.

property expl_strat: Optional[Union[StochasticActionExplStrat, StochasticParamExplStrat]]

Get the algorithm’s exploration strategy.

init_modules(warmstart: bool, suffix: str = '', prefix: Optional[str] = None, **kwargs)[source]

Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.

Parameters:
  • warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.

  • suffix – keyword for meta_info when loading from previous iteration

  • prefix – keyword for meta_info when loading from previous iteration

  • kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init

iteration_key: str = 'iteration'
load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

make_snapshot(snapshot_mode: str, curr_avg_ret: Optional[float] = None, meta_info: Optional[dict] = None)[source]

Make a snapshot of the training progress. This method is called from the subclasses and delegates to the custom method save_snapshot().

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • curr_avg_ret – current average return used for the snapshot_mode ‘best’ to trigger save_snapshot()

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property max_iter: int

Get the maximum number of iterations.

name: str = None
property policy: Policy

Get the algorithm’s policy.

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

property sampler: Optional[SamplerBase]

Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

property save_dir: str

Get the directory where the data is saved to.

property save_name: str

Get the name for saving this algorithm instance, e.g. ‘algo’ if saved to ‘algo.pkl’.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

abstract step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property stopping_criterion: StoppingCriterion

Get the stopping criterion.

stopping_criterion_met() bool[source]

Checks if the stopping criterion is met.

Note

We need this method because the stopping criterion’s is_met(algo) method requires an instance of the algorithm. If we would simply use the is_met(algo) function of the exposed property, we would have to pass the algorithm instance manually. Moreover, if we would change the is_met(algo) function to not require the algorithm, initializing the latter would require the former and vice versa. This would be a circular dependency.

Returns:

True if the stopping criterion is met, see also StoppingCriterion.is_met(algo)

train(snapshot_mode: str = 'latest', seed: Optional[int] = None, meta_info: Optional[dict] = None)[source]

Train one/multiple policy/policies in a given environment.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new high-score)

  • seed – seed value for the random number generators, pass None for no seeding

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

update(*args: Any, **kwargs: Any)[source]

Update the policy’s (and value functions’) parameters based on the collected rollout data.

class InterruptableAlgorithm(num_checkpoints: int, init_checkpoint: int = 0, *args, **kwargs)[source]

Bases: Algorithm, ABC

A simple checkpoint system too keep track of the algorithms progress. The cyclic counter starts at init_checkpoint and counts until (including) num_checkpoints, and is then reset to zero.

Constructor

Parameters:
  • num_checkpoints – total number of checkpoints

  • init_checkpoint – initial value of the cyclic counter, defaults to 0, use negative values can to mark sections that should only be executed once

  • args – positional arguments forwarded to Algorithm’s constructor

  • kwargs – keyword arguments forwarded to Algorithm’s constructor

property curr_checkpoint: int

Get the current checkpoint counter.

reached_checkpoint(meta_info: Optional[dict] = None)[source]

Increase the cyclic counter by 1. When the counter reached the maximum number of checkpoints, defined in the constructor, it is automatically reset to zero. This method also saves the algorithm instance using save_snapshot(), otherwise increasing the checkpoint counter can have no effect.

Parameters:

meta_info – information forwarded to save_snapshot()

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

reset_checkpoint(curr: int = 0)[source]

Explicitly reset the cyclic counter.

Parameters:

curr – value to set the counter to, defaults to 0

utils

class ActionStatistics(act_distr: Distribution, log_probs: Tensor, entropy: Tensor)[source]

Bases: tuple

act_distr: probability distribution at the given policy output values log_probs: \(\log (p(act|obs, hidden))\) if hidden exists, else \(\log (p(act|obs))\) entropy: entropy of the action distribution

Create new instance of ActionStatistics(act_distr, log_probs, entropy)

property act_distr

Alias for field number 0

property entropy

Alias for field number 2

property log_probs

Alias for field number 1

class ReplayMemory(capacity: int)[source]

Bases: object

Base class for storing step transitions

Constructor

Parameters:

capacity – number of steps a.k.a. transitions in the memory

avg_reward() float[source]

Compute the average reward for all steps stored in the replay memory.

Returns:

average reward

property isempty: bool

Check if the replay buffer is empty.

property memory: StepSequence

Get the replay buffer.

push(ros: Union[list, StepSequence], truncate_last: bool = True)[source]

Save a sequence of steps and drop of steps if the capacity is exceeded.

Parameters:
  • ros – list of rollouts or one concatenated rollout

  • truncate_last – remove the last step from each rollout, forwarded to StepSequence.concat

reset()[source]
sample(batch_size: int) tuple[source]

Sample randomly from the replay memory.

Parameters:

batch_size – number of samples

Returns:

tuple of transition steps and associated next steps

class RolloutSavingWrapper(wrapped_sampler: ~pyrado.sampling.sampler.SamplerBase, rollouts: ~typing.List[~typing.List[~pyrado.sampling.step_sequence.StepSequence]] = <factory>)[source]

Bases: object

A wrapper for SamplerBase objects where calls to SamplerBase.sample() are intercepted and the results stored in this wrapper before they are returned to the caller.

Usage:

ros = RolloutSavingWrapper(subroutine.sampler)
subroutine.sampler = ros
reset_rollouts() None[source]

Resets the internal rollout variable. Intended to be called before save_snapshot(), in order to reduce serialized object’s size.

rollouts: List[List[StepSequence]]
sample(*args, **kwargs) List[StepSequence][source]

Like SamplerBase.sample() but keeps a copy of all returned values.

wrapped_sampler: SamplerBase
compute_action_statistics(steps: StepSequence, expl_strat: StochasticActionExplStrat) ActionStatistics[source]

Get the action distribution from the exploration strategy, compute the log action probabilities and entropy for the given rollout using the given exploration strategy.

Note

Requires the exploration strategy to have a (most likely custom) evaluate() method.

Parameters:
  • steps – recorded rollout data

  • expl_strat – exploration strategy used to generate the data

Returns:

collected action statistics, see ActionStatistics

get_grad_via_torch(x_np: ndarray, fcn_to: Callable, *args_to, **kwargs_to) ndarray[source]

Get the gradient of a function operating on PyTorch tensors, by casting the input x_np as well as the resulting gradient to PyTorch.

Parameters:
  • x_np – input vector \(x\)

  • fcn_to – function \(f(x, \cdot)\)

  • args_to – other arguments to the function

  • kwargs_to – other keyword arguments to the function

Returns:

\(\nabla_x f(x, \cdot)\)

num_iter_from_rollouts(ros: [typing.Sequence[pyrado.sampling.step_sequence.StepSequence], None], concat_ros: [<class 'pyrado.sampling.step_sequence.StepSequence'>, None], batch_size: int) int[source]

Get the number of iterations from the given rollout data.

Parameters:
  • ros – multiple rollouts

  • concat_ros – concatenated rollouts

  • batch_size – number of samples per batch

Returns:

number of iterations (e.g. used for the progress bar)

until_thold_exceeded(thold: float, max_rep: Optional[int] = None)[source]

Designed to wrap a function and repeat it until the return value exceeds a threshold.

Note

The wrapped function must accept the kwarg cnt_rep. This can be useful to trigger different behavior depending on the number of already done repetition.

Parameters:
  • thold – threshold

  • max_rep – maximum number of repetitions of the wrapped function, set to None to run the loop relentlessly

Returns:

wrapped function

Module contents