meta

adr

class ADR(ex_dir: PathLike, env: Env, subrtn: Algorithm, adr_hp: Dict, svpg_hp: Dict, reward_generator_hp: Dict, max_iter: int, num_discriminator_epoch: int, batch_size: int, svpg_warmup: int = 0, num_workers: int = 4, num_trajs_per_config: int = 8, log_exploration: bool = False, randomized_params: Optional[Sequence[str]] = None, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm

Active Domain Randomization (ADR)

See also

[1] B. Mehta, M. Diaz, F. Golemo, C.J. Pal, L. Paull, “Active Domain Randomization”, arXiv, 2019

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment to train in

  • subrtn – algorithm which performs the policy / value-function optimization

  • max_iter – maximum number of iterations

  • svpg_particle_hparam – SVPG particle hyperparameters

  • num_svpg_particles – number of SVPG particles

  • num_discriminator_epoch – epochs in discriminator training

  • batch_size – batch size for training

  • svpg_learning_rate – SVPG particle optimizers’ learning rate

  • svpg_temperature – SVPG temperature coefficient (how strong is the influence of the particles on each other)

  • svpg_evaluation_steps – how many configurations to sample between training

  • svpg_horizon – how many steps until the particles are reset

  • svpg_kl_factor – kl reward coefficient

  • svpg_warmup – number of iterations without SVPG training in the beginning

  • svpg_serial – serial mode (see SVPG)

  • num_workers – number of environments for parallel sampling

  • num_trajs_per_config – number of trajectories to sample from each config

  • max_step_length – maximum change of physics parameters per step

  • randomized_params – which parameters to randomize

  • logger – logger for every step of the algorithm, if None the default logger will be created

compute_params(sim_instances: Tensor, t: int)[source]

Compute the parameters.

Parameters:
  • sim_instances – Physics configurations rollout

  • t – time step to chose

Returns:

parameters at the time

convert_and_detach(arg0)[source]
name: str = 'adr'
property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

class RewardGenerator(env_spec: EnvSpec, batch_size: int, reward_multiplier: float, lr: float = 0.003, hidden_size=256, logger: Optional[StepLogger] = None, device: str = 'cpu')[source]

Bases: object

Class for generating the discriminator rewards in ADR. Generates a reward using a trained discriminator network.

Constructor

Parameters:
  • env_spec – environment specification

  • batch_size – batch size for each update step

  • reward_multiplier – factor for the predicted probability

  • lr – learning rate

  • logger – logger for every step of the algorithm, if None the default logger will be created

get_reward(traj: StepSequence) Tensor[source]

Compute the reward of a trajectory. Trajectories considered as not fixed yield a high reward.

Parameters:

traj – trajectory to evaluate

Returns:

a score

Return type:

to.Tensor

train(reference_trajectory: StepSequence, randomized_trajectory: StepSequence, num_epoch: int) Tensor[source]
class SVPGAdapter(wrapped_env: Env, parameters: Sequence[DomainParam], inner_policy: Policy, discriminator, num_particles: int, step_length: float = 0.01, horizon: int = 50, num_rollouts_per_config: int = 8, num_workers: int = 4, max_steps: int = 8)[source]

Bases: EnvWrapper, Serializable

Wrapper to encapsulate the domain parameter search as an RL task.

Constructor

Parameters:
  • wrapped_env – the environment to wrap

  • parameters – which physics parameters should be randomized

  • inner_policy – the policy to train the subrtn on

  • discriminator – the discriminator to distinguish reference environments from randomized ones

  • step_length – the step size

  • horizon – an svpg horizon

  • num_rollouts_per_config – number of trajectories to sample per physics configuration

  • num_workers – number of environments for parallel sampling

property act_space: Space

Get the space of the actions.

array_to_dict(arr)[source]
eval_states(states: Sequence[ndarray])[source]

Evaluate the states.

Parameters:

states – the states to evaluate

Returns:

respective rewards and according trajectories

nominal()[source]
nominal_dict()[source]
property obs_space: Space

Get the space of the observations (agent’s perception of the environment).

params()[source]
reset(i=None, init_state: Optional[ndarray] = None, domain_param: Optional[dict] = None) ndarray[source]

Reset the environment to its initial state and optionally set different domain parameters.

Parameters:
  • init_state – set explicit initial state if not None

  • domain_param – set explicit domain parameters if not None

Return obs:

initial observation of the state.

step(act: ndarray, i: int) tuple[source]

Perform one time step of the simulation. When a terminal condition is met, the reset function is called.

Parameters:

act – action to be taken in the step

Return tuple of obs, reward, done, and info:

obs : current observation of the environment reward: reward depending on the selected reward function done: indicates whether the episode has ended env_info: contains diagnostic information about the environment

preprocess_rollout(rollout: StepSequence) Tensor[source]

Extract observations and actions from a StepSequence and packs them into a PyTorch tensor.

Parameters:

rollout – a StepSequence instance containing a trajectory

Returns:

a PyTorch tensor` containing the trajectory

arpl

class ARPL(save_dir: PathLike, env: Union[SimEnv, StateAugmentationWrapper], subrtn: Algorithm, policy: Policy, max_iter: int, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm

Adversarially Robust Policy Learning (ARPL)

See also

A. Mandlekar, Y. Zhu, A. Garg, L. Fei-Fei, S. Savarese, “Adversarially Robust Policy Learning: Active Construction of Physically-Plausible Perturbations”, IROS, 2017

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment in which the agent should be trained

  • subrtn – algorithm which performs the policy / value-function optimization

  • policy – policy to be updated

  • max_iter – the maximum number of iterations

  • logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'arpl'
property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

static wrap_env(env, policy, dynamics=False, process=False, observation=False, dyn_eps: float = 0.01, dyn_phi: float = 0.1, halfspan: float = 0.25, proc_eps: float = 0.01, proc_phi: float = 0.05, torch_observation=None, obs_eps: float = 0.01, obs_phi: float = 0.05)[source]
Parameters:
  • env – the environment in which the agent should be trained

  • policy – policy to be updated

  • dynamics – whether adversarially generated dynamics noise should be applied

  • proccess – whether adversarially generated process noise should be applied

  • observation – whether adversarially generated observation noise should be applied

  • dyn_eps – the intensity of generated dynamics noise

  • dyn_phi – the probability of applying dynamics noise

  • halfspan – the halfspan of the uniform random distribution used to sample

  • proc_eps – the intensity of generated process noise

  • proc_phi – the probability of applying process noise

  • obs_eps – the intensity of generated observation noise

  • obs_phi – the probability of applying observation noise

bayessim

bayrn

epopt

class EPOpt(env: EnvWrapper, subrtn: Algorithm, skip_iter: int, epsilon: float, gamma: float = 1.0)[source]

Bases: Algorithm

Ensemble Policy Optimization (EPOpt)

This algorithm wraps another algorithm on a shallow level. It replaces the subroutine’s sampler with a CVaRSampler, but does not have its own logger.

See also

[1] A. Rajeswaran, S. Ghotra, B. Ravindran, S. Levine, “EPOpt: Learning Robust Neural Network Policies using Model Ensembles”, ICLR, 2017

Constructor

Parameters:
  • env – same environment as the subroutine runs in. Only used for checking and saving the randomizer.

  • subrtn – algorithm which performs the policy / value-function optimization

  • skip_iter – number of iterations for which all rollouts will be used (see prefix ‘full’)

  • epsilon – quantile of (worst) rollouts that will be kept

  • gamma – discount factor to compute the discounted return, default is 1 (no discount)

load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'epopt'
reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subroutine: Algorithm

Get the policy optimization subroutine.

iudr

class IUDR(env: DomainRandWrapper, subroutine: Algorithm, max_iter: int, performance_threshold: float, param_adjustment_portion: float = 0.9)[source]

Bases: Algorithm

Incremental Uniform Domain Randomization (IUDR).

This is an ablation of SPDR in the sense that the optimization is omitted and the contextual distribution is naively updated in fixed steps, disregarding the performance information.

Constructor

Parameters:
  • env – environment wrapped in a DomainRandWrapper

  • subroutine – algorithm which performs the policy/value-function optimization; note that this algorithm must be capable of learning a sufficient policy in its maximum number of iterations

  • max_iter – iterations of the IUDR algorithm (not for the subroutine); changing the domain parameter distribution is done by linear interpolation over this many iterations

  • performance_threshold – lower bound for the performance that has to be reached until the domain parameter randomization is changed

  • param_adjustment_portion – what portion of the IUDR iterations should be spent on adjusting the domain parameter distributions; defaults to 90%

name: str = 'iudr'
reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a step of IUDR. This includes training the subroutine and updating the context distribution accordingly. For a description of the parameters see pyrado.algorithms.base.Algorithm.step.

npdr

pddr

class PDDR(save_dir: str, env: Env, policy: Policy, lr: float = 0.0005, std_init: float = 0.15, min_steps: int = 1500, num_epochs: int = 10, max_iter: int = 500, num_teachers: int = 8, teacher_extra: Optional[dict] = None, teacher_policy: Optional[Policy] = None, teacher_algo: Optional[callable] = None, teacher_algo_hparam: Optional[dict] = None, randomizer: Optional[DomainRandomizer] = None, logger: Optional[StepLogger] = None, num_workers: int = 4)[source]

Bases: InterruptableAlgorithm

Policy Distillation with Domain Randomization (PDDR)

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.

  • std_init – initial standard deviation on the actions for the exploration noise

  • min_steps – minimum number of state transitions sampled per policy update batch

  • num_epochs – number of epochs (how often we iterate over the same batch)

  • max_iter – number of iterations (policy updates)

  • num_teachers – number of teachers that are used for distillation

  • teacher_extra – extra dict from PDDRTeachers algo. If provided, teachers are loaded from there

  • teacher_policy – policy to be updated (is duplicated for each teacher)

  • teacher_algo – algorithm class to be used for training the teachers

  • teacher_algo_hparam – hyper-params to be used for teacher_algo

  • randomizer – randomizer for sampling the teacher domain parameters; if None, the environment’s default one is used

  • logger – logger for every step of the algorithm, if None the default logger will be created

  • num_workers – number of environments for parallel sampling

property expl_strat: NormalActNoiseExplStrat

Get the algorithm’s exploration strategy.

load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

load_teacher_experiment(exp: Experiment)[source]

Load teachers from PDDRTeachers experiment.

Parameters:

exp – the teacher’s experiment object

load_teachers()[source]

Recursively load all teachers that can be found in the current experiment’s directory.

name: str = 'pddr'
prune_teachers()[source]

Prune teachers to only use the first num_teachers of them.

sample() Tuple[List[List[StepSequence]], array, array][source]

Samples observations from several samplers.

Returns:

list of rollouts per sampler, list of all returns, list of all rollout lengths

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

set_random_envs()[source]

Creates random environments of the given type.

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Performs a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

train_teachers(snapshot_mode: str = 'latest', seed: Optional[int] = None)[source]

Trains all teachers.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new high-score)

  • seed – seed value for the random number generators, pass None for no seeding

unpack_teachers(extra: dict)[source]

Unpack teachers from PDDRTeachers experiment.

Parameters:

extra – dict with teacher data

update(*args: Any, **kwargs: Any)[source]

Update the policy’s (and value functions’) parameters based on the collected rollout data.

sbi_base

simopt

class SimOpt(save_dir: PathLike, env_sim: MetaDomainRandWrapper, env_real: Union[RealEnv, EnvWrapper], subrtn_policy: Algorithm, subrtn_distr: SysIdViaEpisodicRL, max_iter: int, num_eval_rollouts: int = 5, thold_succ: float = inf, thold_succ_subrtn: float = -inf, warmstart: bool = True, policy_param_init: Optional[Tensor] = None, valuefcn_param_init: Optional[Tensor] = None, subrtn_snapshot_mode: str = 'latest', logger: Optional[StepLogger] = None)[source]

Bases: InterruptableAlgorithm

Simulation Optimization (SimOpt)

Note

A candidate is a set of parameter values for the domain parameter distribution and its value is the discrepancy between the simulated and real observations (based on a weighted metric).

See also

[1] Y. SimOpt, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N.D. Ratliff, D. Fox, “Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience”, ICRA, 2020

Constructor

Note

If you want to continue an experiment, use the load_dir argument for the train call. If you want to initialize every of the policies with a pre-trained policy parameters use policy_param_init.

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env_sim – randomized simulation environment a.k.a. source domain

  • env_real – real-world environment a.k.a. target domain

  • subrtn_policy – algorithm which performs the optimization of the behavioral policy (and value-function)

  • subrtn_distr – algorithm which performs the optimization of the domain parameter distribution policy

  • max_iter – maximum number of iterations

  • num_eval_rollouts – number of rollouts in the target domain to estimate the return

  • thold_succ – success threshold on the real system’s return for BayRn, stop the algorithm if exceeded

  • thold_succ_subrtn – success threshold on the simulated system’s return for the subrtn, repeat the subrtn until the threshold is exceeded or the for a given number of iterations

  • warmstart – initialize the policy (and value function) parameters with the one of the previous iteration. This behavior can also be overruled by passing init_policy_params (and valuefcn_param_init) explicitly.

  • policy_param_init – initial policy parameter values for the subrtn, set None to be random

  • valuefcn_param_init – initial value function parameter values for the subrtn, set None to be random

  • subrtn_snapshot_mode – snapshot mode for saving during training of the subrtn

  • logger – logger for every step of the algorithm, if None the default logger will be created

static eval_behav_policy(save_dir: [<class 'str'>, None], env: [<class 'pyrado.environments.real_base.RealEnv'>, <class 'pyrado.environments.sim_base.SimEnv'>, <class 'pyrado.environment_wrappers.domain_randomization.MetaDomainRandWrapper'>], policy: ~pyrado.policies.base.Policy, prefix: str, num_rollouts: int, init_states: [<class 'numpy.ndarray'>, None] = None, seed: int = 1001) Sequence[StepSequence][source]

Evaluate a policy on the target system (real-world platform). This method is static to facilitate evaluation of specific policies in hindsight.

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in, if None nothing is saved

  • env – environment for evaluation, in the sim-2-sim case this is another simulation instance

  • policy – policy to evaluate

  • prefix – to control the saving for the evaluation of an initial policy, None to deactivate

  • num_rollouts – number of rollouts to collect on the target domain

  • init_states – pass the initial states of the real system to sync the simulation (mandatory in this case)

  • seed – seed value for the random number generators, only used when evaluating in simulation

Returns:

rollouts

static eval_ddp_policy(rollouts_real: Sequence[StepSequence], env_sim: MetaDomainRandWrapper, num_rollouts: int, subrtn_distr: SysIdViaEpisodicRL, subrtn_policy: Algorithm) float[source]

Evaluate the policy that fits the domain parameter distribution to the observed rollouts.

Parameters:
  • rollouts_real – recorded real-world rollouts

  • env_sim – randomized simulation environment a.k.a. source domain

  • num_rollouts – number of rollouts to collect on the target domain

  • subrtn_distr – algorithm which performs the optimization of the domain parameter distribution policy

  • subrtn_policy – algorithm which performs the optimization of the behavioral policy (and value-function)

Returns:

average system identification loss

iteration_key: str = 'simopt_iteration'
load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'simopt'
property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str = 'latest', meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subroutine_distr: SysIdViaEpisodicRL

Get the system identification subroutine.

property subroutine_policy: Algorithm

Get the policy optimization subroutine.

train_ddp_policy(rollouts_real: Sequence[StepSequence], prefix: str) float[source]

Train and evaluate the policy that parametrizes domain randomizer, such that the loss given by the instance of SysIdViaEpisodicRL is minimized.

Parameters:
  • rollouts_real – recorded real-world rollouts

  • prefix – set a prefix to the saved file name, use “” for no prefix

Returns:

average system identification loss

train_policy_sim(cand: Tensor, prefix: str, cnt_rep: int) float[source]

Train a policy in simulation for given hyper-parameters from the domain randomizer.

Parameters:
  • cand – hyper-parameters for the domain parameter distribution (need be compatible with the randomizer)

  • prefix – set a prefix to the saved file name, use “” for no prefix

  • cnt_rep – current repetition count, coming from the wrapper function

Returns:

estimated return of the trained policy in the target domain

spdr

class MultivariateNormalWrapper(mean: Tensor, cov_chol: Tensor)[source]

Bases: object

A wrapper for PyTorch’s multivariate normal distribution with diagonal covariance. It is used to get a SciPy optimizer-ready version of the parameters of a distribution, i.e. a vector that can be used as the target variable.

Constructor.

Parameters:
  • mean – mean of the distribution; shape (k,)

  • cov_chol – Cholesky decomposition of the covariance matrix; must be lower triangular; shape (k, k) if it is the actual matrix or shape (k * (k + 1) / 2,) if it is raveled

property cov

Get the covariance matrix, shape (k, k).

property cov_chol: Tensor

Get the Cholesky decomposition of the covariance; shape (k, k).

property cov_chol_tril: Tensor

Get the lower triangular of the Cholesky decomposition of the covariance; shape (k * (k + 1) / 2).

property dim

Get the size (dimensionality) of the random variable.

static from_stacked(dim: int, stacked: ndarray) MultivariateNormalWrapper[source]

Creates an instance of this class from the given stacked numpy array as generated e.g. by MultivariateNormalWrapper.get_stacked(self).

Parameters:
  • dim – dimensionality k of the random variable

  • stacked – array containing the mean and standard deviations of shape (k + k * (k + 1) / 2,), where the first k entries are the mean and the last k * (k + 1) / 2 entries are lower triangular entries of the Cholesky decomposition of the covariance matrix

Returns:

a MultivariateNormalWrapper with the given mean/cov.

get_stacked() ndarray[source]

Get the numpy representations of the mean and transformed covariance stacked on top of each other.

Returns:

stacked mean and transformed covariance; shape (k + k * (k + 1) / 2,)

property mean

Get the mean.

parameters() Iterator[Tensor][source]

Get the parameters (mean and lower triangular covariance Cholesky) of this distribution.

class SPDR(env: DomainRandWrapper, subroutine: Algorithm, kl_constraints_ub: float, max_iter: int, performance_lower_bound: float, var_lower_bound: Optional[float] = 0.04, kl_threshold: float = 0.1, optimize_mean: bool = True, optimize_cov: bool = True, max_subrtn_retries: int = 1)[source]

Bases: Algorithm

Self-Paced Domain Randomization (SPDR)

This algorithm wraps another algorithm. The main purpose is to apply self-paced RL [1].

See also

[1] P. Klink, H. Abdulsamad, B. Belousov, C. D’Eramo, J. Peters, and J. Pajarinen, “A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning”, arXiv, 2021

Constructor

Parameters:
  • env – environment wrapped in a DomainRandWrapper

  • subroutine – algorithm which performs the policy/value-function optimization, which must expose its sampler

  • kl_constraints_ub – upper bound for the KL-divergence

  • max_iter – Maximal iterations for the SPDR algorithm (not for the subroutine)

  • performance_lower_bound – lower bound for the performance SPDR tries to stay above during distribution updates

  • var_lower_bound – clipping value for the variance,necessary when using very small target variances; prefer a log-transformation instead

  • kl_threshold – threshold for the KL-divergence until which std_lower_bound is enforced

  • optimize_mean – whether the mean should be changed or considered fixed

  • optimize_cov – whether the (co-)variance should be changed or considered fixed

  • max_subrtn_retries – how often a failed (median performance < 30 % of performance_lower_bound) training attempt of the subroutine should be reattempted

property dim: int
load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'spdr'
reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a step of SPDR. This includes training the subroutine and updating the context distribution accordingly. For a description of the parameters see pyrado.algorithms.base.Algorithm.step.

property subrtn_sampler: RolloutSavingWrapper
ravel_tril_elements(A: Tensor) Tensor[source]
unravel_tril_elements(a: Tensor) Tensor[source]

spota

class SPOTA(save_dir: PathLike, env: DomainRandWrapperBuffer, subrtn_cand: Algorithm, subrtn_refs: Algorithm, max_iter: int, alpha: float, beta: float, nG: int, nJ: int, ntau: int, nc_init: int, nr_init: int, sequence_cand: callable, sequence_refs: callable, warmstart_cand: bool = False, warmstart_refs: bool = True, cand_policy_param_init: Optional[Tensor] = None, cand_critic_param_init: Optional[Tensor] = None, num_bs_reps: int = 1000, studentized_ci: bool = False, base_seed: Optional[int] = None, logger: Optional[StepLogger] = None)[source]

Bases: InterruptableAlgorithm

Simulation-based Policy Optimization with Probability Assessment (SPOTA)

Note

We use each domain parameter set \(\xi_{j,nr}\) for \(n_{\tau}\) rollouts. The candidate and the reference policies must have the same architecture!

See also

[1] F. Muratore, M. Gienger, J. Peters, “Assessing Transferability from Simulation to Reality for Reinforcement

Learning”, PAMI, 2021

[2] W. Mak, D.P. Morton, and R.K. Wood, “Monte Carlo bounding techniques for determining solution quality in

stochastic programs”, Oper. Res. Lett., 1999

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • subrtn_cand – the algorithm that is called at every iteration of SPOTA to yield a candidate policy

  • subrtn_refs – the algorithm that is called at every iteration of SPOTA to yield reference policies

  • max_iter – maximum number of iterations that SPOTA algorithm runs. Each of these iterations includes multiple iterations of the subroutine.

  • alpha – confidence level for the upper confidence bound (UCBOG)

  • beta – optimality gap threshold for training

  • nG – number of reference solutions

  • nJ – number of samples for Monte-Carlo approximation of the optimality gap

  • ntau – number of rollouts per domain parameter set

  • nc_init – initial number of domains for training the candidate solution

  • nr_init – initial number of domains for training the reference solutions

  • sequence_cand – mathematical sequence for the number of domains for training the candidate solution

  • sequence_refs – mathematical sequence for the number of domains for training the reference solutions

  • warmstart_cand – flag if the next candidate solution should be initialized with the previous one

  • warmstart_refs – flag if the reference solutions should be initialized with the current candidate

  • cand_policy_param_init – initial policy parameter values for the candidate, set None to be random

  • cand_critic_param_init – initial critic parameter values for the candidate, set None to be random

  • num_bs_reps – number of replications for the statistical bootstrap

  • studentized_ci – flag if a student T distribution should be applied for the confidence interval

  • base_seed – seed added to all other seeds in order to make the experiments distinct but repeatable

  • logger – logger for every step of the algorithm, if None the default logger will be created

iteration_key: str = 'spota_iteration'
load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'spota'
property policy: Policy

Get the algorithm’s policy.

property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str = 'latest', meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subroutine_cand: Algorithm

Get the candidate subroutine.

udr

class UDR(env: EnvWrapper, subrtn: Algorithm)[source]

Bases: Algorithm

Uniform Domain Randomization (UDR)

This algorithm barely wraps another algorithm. The main purpose is to check if the domain randomizer is set up.

Constructor

Parameters:
  • env – same environment as the subroutine runs in. Only used for checking and saving the randomizer.

  • subrtn – algorithm which performs the policy / value-function optimization

load_snapshot(parsed_args) Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:

parsed_args – arguments parsed by the argparser

Returns:

environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'udr'
reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sample_count: int

Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subroutine: Algorithm

Get the policy optimization subroutine.

Module contents