algorithms
base
- class Algorithm(save_dir: PathLike, max_iter: int, policy: Optional[Policy], logger: Optional[StepLogger] = None, save_name: str = 'algo')[source]
Bases:
ABC
,LoggerAware
Base class of all algorithms in Pyrado Algorithms specify the way how the policy is updated as well as the exploration strategy used to acquire samples.
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
policy – Pyrado policy (subclass of PyTorch’s Module) to train
logger – logger for every step of the algorithm, if None the default logger will be created
save_name – name of the algorithm’s pickle file without the ending, this becomes important if the algorithm is run as a subroutine
- static clip_grad(module: Module, max_grad_norm: Optional[float]) float [source]
Clip all gradients of the provided Module (e.g., a policy or an advantage estimator) by their L2 norm value.
Note
The gradient clipping has to be applied between loss.backward() and optimizer.step()
- Parameters:
module – Module containing parameters
max_grad_norm – maximum L2 norm for the gradient
- Returns:
total norm of the parameters (viewed as a single vector)
- property curr_iter: int
Get the current iteration counter.
- property expl_strat: Optional[Union[StochasticActionExplStrat, StochasticParamExplStrat]]
Get the algorithm’s exploration strategy.
- init_modules(warmstart: bool, suffix: str = '', prefix: Optional[str] = None, **kwargs)[source]
Initialize the algorithm’s learnable modules, e.g. a policy or value function. Overwrite this method if the algorithm uses a learnable module aside the policy, e.g. a value function.
- Parameters:
warmstart – if True, the algorithm starts learning with a non-random initialization. This can either be the a fixed parameter vector, or the loaded results of the previous iteration.
suffix – keyword for meta_info when loading from previous iteration
prefix – keyword for meta_info when loading from previous iteration
kwargs – keyword arguments for initialization, e.g. policy_param_init or valuefcn_param_init
- iteration_key: str = 'iteration'
- load_snapshot(parsed_args) Tuple[Env, Policy, dict] [source]
Load the state of an experiment, which is specific to the algorithm.
- Parameters:
parsed_args – arguments parsed by the argparser
- Returns:
environment, policy, and (optional) algorithm-specific output, e.g. value function
- make_snapshot(snapshot_mode: str, curr_avg_ret: Optional[float] = None, meta_info: Optional[dict] = None)[source]
Make a snapshot of the training progress. This method is called from the subclasses and delegates to the custom method save_snapshot().
- Parameters:
snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
curr_avg_ret – current average return used for the snapshot_mode ‘best’ to trigger save_snapshot()
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- property max_iter: int
Get the maximum number of iterations.
- name: str = None
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
- property sample_count: int
Get the total number of samples, i.e. steps of a rollout, used for training so far.
- property sampler: Optional[SamplerBase]
Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.
- property save_dir: str
Get the directory where the data is saved to.
- property save_name: str
Get the name for saving this algorithm instance, e.g. ‘algo’ if saved to ‘algo.pkl’.
- save_snapshot(meta_info: Optional[dict] = None)[source]
Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.
- Parameters:
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- abstract step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]
Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.
- Parameters:
snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- property stopping_criterion: StoppingCriterion
Get the stopping criterion.
- stopping_criterion_met() bool [source]
Checks if the stopping criterion is met.
Note
We need this method because the stopping criterion’s is_met(algo) method requires an instance of the algorithm. If we would simply use the is_met(algo) function of the exposed property, we would have to pass the algorithm instance manually. Moreover, if we would change the is_met(algo) function to not require the algorithm, initializing the latter would require the former and vice versa. This would be a circular dependency.
- Returns:
True if the stopping criterion is met, see also StoppingCriterion.is_met(algo)
- train(snapshot_mode: str = 'latest', seed: Optional[int] = None, meta_info: Optional[dict] = None)[source]
Train one/multiple policy/policies in a given environment.
- Parameters:
snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new high-score)
seed – seed value for the random number generators, pass None for no seeding
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- class InterruptableAlgorithm(num_checkpoints: int, init_checkpoint: int = 0, *args, **kwargs)[source]
Bases:
Algorithm
,ABC
A simple checkpoint system too keep track of the algorithms progress. The cyclic counter starts at init_checkpoint and counts until (including) num_checkpoints, and is then reset to zero.
Constructor
- Parameters:
num_checkpoints – total number of checkpoints
init_checkpoint – initial value of the cyclic counter, defaults to 0, use negative values can to mark sections that should only be executed once
args – positional arguments forwarded to Algorithm’s constructor
kwargs – keyword arguments forwarded to Algorithm’s constructor
- property curr_checkpoint: int
Get the current checkpoint counter.
- reached_checkpoint(meta_info: Optional[dict] = None)[source]
Increase the cyclic counter by 1. When the counter reached the maximum number of checkpoints, defined in the constructor, it is automatically reset to zero. This method also saves the algorithm instance using save_snapshot(), otherwise increasing the checkpoint counter can have no effect.
- Parameters:
meta_info – information forwarded to save_snapshot()
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
utils
- class ActionStatistics(act_distr: Distribution, log_probs: Tensor, entropy: Tensor)[source]
Bases:
tuple
act_distr: probability distribution at the given policy output values log_probs: \(\log (p(act|obs, hidden))\) if hidden exists, else \(\log (p(act|obs))\) entropy: entropy of the action distribution
Create new instance of ActionStatistics(act_distr, log_probs, entropy)
- property act_distr
Alias for field number 0
- property entropy
Alias for field number 2
- property log_probs
Alias for field number 1
- class ReplayMemory(capacity: int)[source]
Bases:
object
Base class for storing step transitions
Constructor
- Parameters:
capacity – number of steps a.k.a. transitions in the memory
- avg_reward() float [source]
Compute the average reward for all steps stored in the replay memory.
- Returns:
average reward
- property isempty: bool
Check if the replay buffer is empty.
- property memory: StepSequence
Get the replay buffer.
- push(ros: Union[list, StepSequence], truncate_last: bool = True)[source]
Save a sequence of steps and drop of steps if the capacity is exceeded.
- Parameters:
ros – list of rollouts or one concatenated rollout
truncate_last – remove the last step from each rollout, forwarded to StepSequence.concat
- class RolloutSavingWrapper(wrapped_sampler: ~pyrado.sampling.sampler.SamplerBase, rollouts: ~typing.List[~typing.List[~pyrado.sampling.step_sequence.StepSequence]] = <factory>)[source]
Bases:
object
A wrapper for SamplerBase objects where calls to SamplerBase.sample() are intercepted and the results stored in this wrapper before they are returned to the caller.
- Usage:
ros = RolloutSavingWrapper(subroutine.sampler) subroutine.sampler = ros
- reset_rollouts() None [source]
Resets the internal rollout variable. Intended to be called before save_snapshot(), in order to reduce serialized object’s size.
- rollouts: List[List[StepSequence]]
- sample(*args, **kwargs) List[StepSequence] [source]
Like SamplerBase.sample() but keeps a copy of all returned values.
- wrapped_sampler: SamplerBase
- compute_action_statistics(steps: StepSequence, expl_strat: StochasticActionExplStrat) ActionStatistics [source]
Get the action distribution from the exploration strategy, compute the log action probabilities and entropy for the given rollout using the given exploration strategy.
Note
Requires the exploration strategy to have a (most likely custom) evaluate() method.
- Parameters:
steps – recorded rollout data
expl_strat – exploration strategy used to generate the data
- Returns:
collected action statistics, see ActionStatistics
- get_grad_via_torch(x_np: ndarray, fcn_to: Callable, *args_to, **kwargs_to) ndarray [source]
Get the gradient of a function operating on PyTorch tensors, by casting the input x_np as well as the resulting gradient to PyTorch.
- Parameters:
x_np – input vector \(x\)
fcn_to – function \(f(x, \cdot)\)
args_to – other arguments to the function
kwargs_to – other keyword arguments to the function
- Returns:
\(\nabla_x f(x, \cdot)\)
- num_iter_from_rollouts(ros: [typing.Sequence[pyrado.sampling.step_sequence.StepSequence], None], concat_ros: [<class 'pyrado.sampling.step_sequence.StepSequence'>, None], batch_size: int) int [source]
Get the number of iterations from the given rollout data.
- Parameters:
ros – multiple rollouts
concat_ros – concatenated rollouts
batch_size – number of samples per batch
- Returns:
number of iterations (e.g. used for the progress bar)
- until_thold_exceeded(thold: float, max_rep: Optional[int] = None)[source]
Designed to wrap a function and repeat it until the return value exceeds a threshold.
Note
The wrapped function must accept the kwarg cnt_rep. This can be useful to trigger different behavior depending on the number of already done repetition.
- Parameters:
thold – threshold
max_rep – maximum number of repetitions of the wrapped function, set to None to run the loop relentlessly
- Returns:
wrapped function