meta

adr

class ADR(ex_dir: PathLike, env: Env, subrtn: Algorithm, adr_hp: Dict, svpg_hp: Dict, reward_generator_hp: Dict, max_iter: int, num_discriminator_epoch: int, batch_size: int, svpg_warmup: int = 0, num_workers: int = 4, num_trajs_per_config: int = 8, log_exploration: bool = False, randomized_params: Optional[Sequence[str]] = None, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm

Active Domain Randomization (ADR)

See also

[1] B. Mehta, M. Diaz, F. Golemo, C.J. Pal, L. Paull, “Active Domain Randomization”, arXiv, 2019

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment to train in
subrtn – algorithm which performs the policy / value-function optimization
max_iter – maximum number of iterations
svpg_particle_hparam – SVPG particle hyperparameters
num_svpg_particles – number of SVPG particles
num_discriminator_epoch – epochs in discriminator training
batch_size – batch size for training
svpg_learning_rate – SVPG particle optimizers’ learning rate
svpg_temperature – SVPG temperature coefficient (how strong is the influence of the particles on each other)
svpg_evaluation_steps – how many configurations to sample between training
svpg_horizon – how many steps until the particles are reset
svpg_kl_factor – kl reward coefficient
svpg_warmup – number of iterations without SVPG training in the beginning
svpg_serial – serial mode (see SVPG)
num_workers – number of environments for parallel sampling
num_trajs_per_config – number of trajectories to sample from each config
max_step_length – maximum change of physics parameters per step
randomized_params – which parameters to randomize
logger – logger for every step of the algorithm, if None the default logger will be created

compute_params(sim_instances: Tensor, t: int)[source]

Compute the parameters.

Parameters:

sim_instances – Physics configurations rollout
t – time step to chose

Returns:

parameters at the time

convert_and_detach(arg0)[source]

name: str = 'adr'

property sample_count: int: Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

class RewardGenerator(env_spec: EnvSpec, batch_size: int, reward_multiplier: float, lr: float = 0.003, hidden_size=256, logger: Optional[StepLogger] = None, device: str = 'cpu')[source]

Bases: object

Class for generating the discriminator rewards in ADR. Generates a reward using a trained discriminator network.

Constructor

Parameters:

env_spec – environment specification
batch_size – batch size for each update step
reward_multiplier – factor for the predicted probability
lr – learning rate
logger – logger for every step of the algorithm, if None the default logger will be created

get_reward(traj: StepSequence) → Tensor[source]

Compute the reward of a trajectory. Trajectories considered as not fixed yield a high reward.

Parameters:: traj – trajectory to evaluate
Returns:: a score
Return type:: to.Tensor

train(reference_trajectory: StepSequence, randomized_trajectory: StepSequence, num_epoch: int) → Tensor[source]

class SVPGAdapter(wrapped_env: Env, parameters: Sequence[DomainParam], inner_policy: Policy, discriminator, num_particles: int, step_length: float = 0.01, horizon: int = 50, num_rollouts_per_config: int = 8, num_workers: int = 4, max_steps: int = 8)[source]

Bases: EnvWrapper, Serializable

Wrapper to encapsulate the domain parameter search as an RL task.

Constructor

Parameters:

wrapped_env – the environment to wrap
parameters – which physics parameters should be randomized
inner_policy – the policy to train the subrtn on
discriminator – the discriminator to distinguish reference environments from randomized ones
step_length – the step size
horizon – an svpg horizon
num_rollouts_per_config – number of trajectories to sample per physics configuration
num_workers – number of environments for parallel sampling

property act_space: Space: Get the space of the actions.

array_to_dict(arr)[source]

eval_states(states: Sequence[ndarray])[source]

Evaluate the states.

Parameters:: states – the states to evaluate
Returns:: respective rewards and according trajectories

nominal()[source]

nominal_dict()[source]

property obs_space: Space: Get the space of the observations (agent’s perception of the environment).

params()[source]

reset(i=None, init_state: Optional[ndarray] = None, domain_param: Optional[dict] = None) → ndarray[source]

Reset the environment to its initial state and optionally set different domain parameters.

Parameters:

init_state – set explicit initial state if not None
domain_param – set explicit domain parameters if not None

Return obs:

initial observation of the state.

step(act: ndarray, i: int) → tuple[source]

Perform one time step of the simulation. When a terminal condition is met, the reset function is called.

Parameters:: act – action to be taken in the step
Return tuple of obs, reward, done, and info:: obs : current observation of the environment reward: reward depending on the selected reward function done: indicates whether the episode has ended env_info: contains diagnostic information about the environment

preprocess_rollout(rollout: StepSequence) → Tensor[source]

Extract observations and actions from a StepSequence and packs them into a PyTorch tensor.

Parameters:: rollout – a StepSequence instance containing a trajectory
Returns:: a PyTorch tensor` containing the trajectory

arpl

class ARPL(save_dir: PathLike, env: Union[SimEnv, StateAugmentationWrapper], subrtn: Algorithm, policy: Policy, max_iter: int, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm

Adversarially Robust Policy Learning (ARPL)

See also

A. Mandlekar, Y. Zhu, A. Garg, L. Fei-Fei, S. Savarese, “Adversarially Robust Policy Learning: Active Construction of Physically-Plausible Perturbations”, IROS, 2017

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment in which the agent should be trained
subrtn – algorithm which performs the policy / value-function optimization
policy – policy to be updated
max_iter – the maximum number of iterations
logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'arpl'

property sample_count: int: Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

static wrap_env(env, policy, dynamics=False, process=False, observation=False, dyn_eps: float = 0.01, dyn_phi: float = 0.1, halfspan: float = 0.25, proc_eps: float = 0.01, proc_phi: float = 0.05, torch_observation=None, obs_eps: float = 0.01, obs_phi: float = 0.05)[source]

Parameters:

env – the environment in which the agent should be trained
policy – policy to be updated
dynamics – whether adversarially generated dynamics noise should be applied
proccess – whether adversarially generated process noise should be applied
observation – whether adversarially generated observation noise should be applied
dyn_eps – the intensity of generated dynamics noise
dyn_phi – the probability of applying dynamics noise
halfspan – the halfspan of the uniform random distribution used to sample
proc_eps – the intensity of generated process noise
proc_phi – the probability of applying process noise
obs_eps – the intensity of generated observation noise
obs_phi – the probability of applying observation noise

bayessim

bayrn

epopt

class EPOpt(env: EnvWrapper, subrtn: Algorithm, skip_iter: int, epsilon: float, gamma: float = 1.0)[source]

Bases: Algorithm

Ensemble Policy Optimization (EPOpt)

This algorithm wraps another algorithm on a shallow level. It replaces the subroutine’s sampler with a CVaRSampler, but does not have its own logger.

See also

[1] A. Rajeswaran, S. Ghotra, B. Ravindran, S. Levine, “EPOpt: Learning Robust Neural Network Policies using Model Ensembles”, ICLR, 2017

Constructor

Parameters:

env – same environment as the subroutine runs in. Only used for checking and saving the randomizer.
subrtn – algorithm which performs the policy / value-function optimization
skip_iter – number of iterations for which all rollouts will be used (see prefix ‘full’)
epsilon – quantile of (worst) rollouts that will be kept
gamma – discount factor to compute the discounted return, default is 1 (no discount)

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'epopt'

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

property sample_count: int: Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subroutine: Algorithm: Get the policy optimization subroutine.

iudr

class IUDR(env: DomainRandWrapper, subroutine: Algorithm, max_iter: int, performance_threshold: float, param_adjustment_portion: float = 0.9)[source]

Bases: Algorithm

Incremental Uniform Domain Randomization (IUDR).

This is an ablation of SPDR in the sense that the optimization is omitted and the contextual distribution is naively updated in fixed steps, disregarding the performance information.

Constructor

Parameters:

env – environment wrapped in a DomainRandWrapper
subroutine – algorithm which performs the policy/value-function optimization; note that this algorithm must be capable of learning a sufficient policy in its maximum number of iterations
max_iter – iterations of the IUDR algorithm (not for the subroutine); changing the domain parameter distribution is done by linear interpolation over this many iterations
performance_threshold – lower bound for the performance that has to be reached until the domain parameter randomization is changed
param_adjustment_portion – what portion of the IUDR iterations should be spent on adjusting the domain parameter distributions; defaults to 90%

name: str = 'iudr'

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

property sample_count: int: Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]: Perform a step of IUDR. This includes training the subroutine and updating the context distribution accordingly. For a description of the parameters see pyrado.algorithms.base.Algorithm.step.

npdr

pddr

class PDDR(save_dir: str, env: Env, policy: Policy, lr: float = 0.0005, std_init: float = 0.15, min_steps: int = 1500, num_epochs: int = 10, max_iter: int = 500, num_teachers: int = 8, teacher_extra: Optional[dict] = None, teacher_policy: Optional[Policy] = None, teacher_algo: Optional[callable] = None, teacher_algo_hparam: Optional[dict] = None, randomizer: Optional[DomainRandomizer] = None, logger: Optional[StepLogger] = None, num_workers: int = 4)[source]

Bases: InterruptableAlgorithm

Policy Distillation with Domain Randomization (PDDR)

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
lr – (initial) learning rate for the optimizer which can be by modified by the scheduler. By default, the learning rate is constant.
std_init – initial standard deviation on the actions for the exploration noise
min_steps – minimum number of state transitions sampled per policy update batch
num_epochs – number of epochs (how often we iterate over the same batch)
max_iter – number of iterations (policy updates)
num_teachers – number of teachers that are used for distillation
teacher_extra – extra dict from PDDRTeachers algo. If provided, teachers are loaded from there
teacher_policy – policy to be updated (is duplicated for each teacher)
teacher_algo – algorithm class to be used for training the teachers
teacher_algo_hparam – hyper-params to be used for teacher_algo
randomizer – randomizer for sampling the teacher domain parameters; if None, the environment’s default one is used
logger – logger for every step of the algorithm, if None the default logger will be created
num_workers – number of environments for parallel sampling

property expl_strat: NormalActNoiseExplStrat: Get the algorithm’s exploration strategy.

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

load_teacher_experiment(exp: Experiment)[source]

Load teachers from PDDRTeachers experiment.

Parameters:: exp – the teacher’s experiment object

load_teachers()[source]: Recursively load all teachers that can be found in the current experiment’s directory.

name: str = 'pddr'

prune_teachers()[source]: Prune teachers to only use the first num_teachers of them.

sample() → Tuple[List[List[StepSequence]], array, array][source]

Samples observations from several samplers.

Returns:: list of rollouts per sampler, list of all returns, list of all rollout lengths

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

set_random_envs()[source]: Creates random environments of the given type.

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Performs a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

train_teachers(snapshot_mode: str = 'latest', seed: Optional[int] = None)[source]

Trains all teachers.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new high-score)
seed – seed value for the random number generators, pass None for no seeding

unpack_teachers(extra: dict)[source]

Unpack teachers from PDDRTeachers experiment.

Parameters:: extra – dict with teacher data

update(*args: Any, **kwargs: Any)[source]: Update the policy’s (and value functions’) parameters based on the collected rollout data.

sbi_base

simopt

class SimOpt(save_dir: PathLike, env_sim: MetaDomainRandWrapper, env_real: Union[RealEnv, EnvWrapper], subrtn_policy: Algorithm, subrtn_distr: SysIdViaEpisodicRL, max_iter: int, num_eval_rollouts: int = 5, thold_succ: float = inf, thold_succ_subrtn: float = -inf, warmstart: bool = True, policy_param_init: Optional[Tensor] = None, valuefcn_param_init: Optional[Tensor] = None, subrtn_snapshot_mode: str = 'latest', logger: Optional[StepLogger] = None)[source]

Bases: InterruptableAlgorithm

Simulation Optimization (SimOpt)

Note

A candidate is a set of parameter values for the domain parameter distribution and its value is the discrepancy between the simulated and real observations (based on a weighted metric).

See also

[1] Y. SimOpt, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N.D. Ratliff, D. Fox, “Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience”, ICRA, 2020

Constructor

Note

If you want to continue an experiment, use the load_dir argument for the train call. If you want to initialize every of the policies with a pre-trained policy parameters use policy_param_init.

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env_sim – randomized simulation environment a.k.a. source domain
env_real – real-world environment a.k.a. target domain
subrtn_policy – algorithm which performs the optimization of the behavioral policy (and value-function)
subrtn_distr – algorithm which performs the optimization of the domain parameter distribution policy
max_iter – maximum number of iterations
num_eval_rollouts – number of rollouts in the target domain to estimate the return
thold_succ – success threshold on the real system’s return for BayRn, stop the algorithm if exceeded
thold_succ_subrtn – success threshold on the simulated system’s return for the subrtn, repeat the subrtn until the threshold is exceeded or the for a given number of iterations
warmstart – initialize the policy (and value function) parameters with the one of the previous iteration. This behavior can also be overruled by passing init_policy_params (and valuefcn_param_init) explicitly.
policy_param_init – initial policy parameter values for the subrtn, set None to be random
valuefcn_param_init – initial value function parameter values for the subrtn, set None to be random
subrtn_snapshot_mode – snapshot mode for saving during training of the subrtn
logger – logger for every step of the algorithm, if None the default logger will be created

static eval_behav_policy(save_dir: [<class 'str'>, None], env: [<class 'pyrado.environments.real_base.RealEnv'>, <class 'pyrado.environments.sim_base.SimEnv'>, <class 'pyrado.environment_wrappers.domain_randomization.MetaDomainRandWrapper'>], policy: ~pyrado.policies.base.Policy, prefix: str, num_rollouts: int, init_states: [<class 'numpy.ndarray'>, None] = None, seed: int = 1001) → Sequence[StepSequence][source]

Evaluate a policy on the target system (real-world platform). This method is static to facilitate evaluation of specific policies in hindsight.

Parameters:

save_dir – directory to save the snapshots i.e. the results in, if None nothing is saved
env – environment for evaluation, in the sim-2-sim case this is another simulation instance
policy – policy to evaluate
prefix – to control the saving for the evaluation of an initial policy, None to deactivate
num_rollouts – number of rollouts to collect on the target domain
init_states – pass the initial states of the real system to sync the simulation (mandatory in this case)
seed – seed value for the random number generators, only used when evaluating in simulation

Returns:

rollouts

static eval_ddp_policy(rollouts_real: Sequence[StepSequence], env_sim: MetaDomainRandWrapper, num_rollouts: int, subrtn_distr: SysIdViaEpisodicRL, subrtn_policy: Algorithm) → float[source]

Evaluate the policy that fits the domain parameter distribution to the observed rollouts.

Parameters:

rollouts_real – recorded real-world rollouts
env_sim – randomized simulation environment a.k.a. source domain
num_rollouts – number of rollouts to collect on the target domain
subrtn_distr – algorithm which performs the optimization of the domain parameter distribution policy
subrtn_policy – algorithm which performs the optimization of the behavioral policy (and value-function)

Returns:

average system identification loss

iteration_key: str = 'simopt_iteration'

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'simopt'

property sample_count: int: Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str = 'latest', meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subroutine_distr: SysIdViaEpisodicRL: Get the system identification subroutine.

property subroutine_policy: Algorithm: Get the policy optimization subroutine.

train_ddp_policy(rollouts_real: Sequence[StepSequence], prefix: str) → float[source]

Train and evaluate the policy that parametrizes domain randomizer, such that the loss given by the instance of SysIdViaEpisodicRL is minimized.

Parameters:

rollouts_real – recorded real-world rollouts
prefix – set a prefix to the saved file name, use “” for no prefix

Returns:

average system identification loss

train_policy_sim(cand: Tensor, prefix: str, cnt_rep: int) → float[source]

Train a policy in simulation for given hyper-parameters from the domain randomizer.

Parameters:

cand – hyper-parameters for the domain parameter distribution (need be compatible with the randomizer)
prefix – set a prefix to the saved file name, use “” for no prefix
cnt_rep – current repetition count, coming from the wrapper function

Returns:

estimated return of the trained policy in the target domain

spdr

class MultivariateNormalWrapper(mean: Tensor, cov_chol: Tensor)[source]

Bases: object

A wrapper for PyTorch’s multivariate normal distribution with diagonal covariance. It is used to get a SciPy optimizer-ready version of the parameters of a distribution, i.e. a vector that can be used as the target variable.

Constructor.

Parameters:

mean – mean of the distribution; shape (k,)
cov_chol – Cholesky decomposition of the covariance matrix; must be lower triangular; shape (k, k) if it is the actual matrix or shape (k * (k + 1) / 2,) if it is raveled

property cov: Get the covariance matrix, shape (k, k).

property cov_chol: Tensor: Get the Cholesky decomposition of the covariance; shape (k, k).

property cov_chol_tril: Tensor: Get the lower triangular of the Cholesky decomposition of the covariance; shape (k * (k + 1) / 2).

property dim: Get the size (dimensionality) of the random variable.

static from_stacked(dim: int, stacked: ndarray) → MultivariateNormalWrapper[source]

Creates an instance of this class from the given stacked numpy array as generated e.g. by MultivariateNormalWrapper.get_stacked(self).

Parameters:

dim – dimensionality k of the random variable
stacked – array containing the mean and standard deviations of shape (k + k * (k + 1) / 2,), where the first k entries are the mean and the last k * (k + 1) / 2 entries are lower triangular entries of the Cholesky decomposition of the covariance matrix

Returns:

a MultivariateNormalWrapper with the given mean/cov.

get_stacked() → ndarray[source]

Get the numpy representations of the mean and transformed covariance stacked on top of each other.

Returns:: stacked mean and transformed covariance; shape (k + k * (k + 1) / 2,)

property mean: Get the mean.

parameters() → Iterator[Tensor][source]: Get the parameters (mean and lower triangular covariance Cholesky) of this distribution.

class SPDR(env: DomainRandWrapper, subroutine: Algorithm, kl_constraints_ub: float, max_iter: int, performance_lower_bound: float, var_lower_bound: Optional[float] = 0.04, kl_threshold: float = 0.1, optimize_mean: bool = True, optimize_cov: bool = True, max_subrtn_retries: int = 1)[source]

Bases: Algorithm

Self-Paced Domain Randomization (SPDR)

This algorithm wraps another algorithm. The main purpose is to apply self-paced RL [1].

See also

[1] P. Klink, H. Abdulsamad, B. Belousov, C. D’Eramo, J. Peters, and J. Pajarinen, “A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning”, arXiv, 2021

Constructor

Parameters:

env – environment wrapped in a DomainRandWrapper
subroutine – algorithm which performs the policy/value-function optimization, which must expose its sampler
kl_constraints_ub – upper bound for the KL-divergence
max_iter – Maximal iterations for the SPDR algorithm (not for the subroutine)
performance_lower_bound – lower bound for the performance SPDR tries to stay above during distribution updates
var_lower_bound – clipping value for the variance,necessary when using very small target variances; prefer a log-transformation instead
kl_threshold – threshold for the KL-divergence until which std_lower_bound is enforced
optimize_mean – whether the mean should be changed or considered fixed
optimize_cov – whether the (co-)variance should be changed or considered fixed
max_subrtn_retries – how often a failed (median performance < 30 % of performance_lower_bound) training attempt of the subroutine should be reattempted

property dim: int

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'spdr'

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

property sample_count: int: Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]: Perform a step of SPDR. This includes training the subroutine and updating the context distribution accordingly. For a description of the parameters see pyrado.algorithms.base.Algorithm.step.

property subrtn_sampler: RolloutSavingWrapper

ravel_tril_elements(A: Tensor) → Tensor[source]

unravel_tril_elements(a: Tensor) → Tensor[source]

spota

class SPOTA(save_dir: PathLike, env: DomainRandWrapperBuffer, subrtn_cand: Algorithm, subrtn_refs: Algorithm, max_iter: int, alpha: float, beta: float, nG: int, nJ: int, ntau: int, nc_init: int, nr_init: int, sequence_cand: callable, sequence_refs: callable, warmstart_cand: bool = False, warmstart_refs: bool = True, cand_policy_param_init: Optional[Tensor] = None, cand_critic_param_init: Optional[Tensor] = None, num_bs_reps: int = 1000, studentized_ci: bool = False, base_seed: Optional[int] = None, logger: Optional[StepLogger] = None)[source]

Bases: InterruptableAlgorithm

Simulation-based Policy Optimization with Probability Assessment (SPOTA)

Note

We use each domain parameter set \(\xi_{j,nr}\) for \(n_{\tau}\) rollouts. The candidate and the reference policies must have the same architecture!

See also

[1] F. Muratore, M. Gienger, J. Peters, “Assessing Transferability from Simulation to Reality for Reinforcement: Learning”, PAMI, 2021
[2] W. Mak, D.P. Morton, and R.K. Wood, “Monte Carlo bounding techniques for determining solution quality in: stochastic programs”, Oper. Res. Lett., 1999

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
subrtn_cand – the algorithm that is called at every iteration of SPOTA to yield a candidate policy
subrtn_refs – the algorithm that is called at every iteration of SPOTA to yield reference policies
max_iter – maximum number of iterations that SPOTA algorithm runs. Each of these iterations includes multiple iterations of the subroutine.
alpha – confidence level for the upper confidence bound (UCBOG)
beta – optimality gap threshold for training
nG – number of reference solutions
nJ – number of samples for Monte-Carlo approximation of the optimality gap
ntau – number of rollouts per domain parameter set
nc_init – initial number of domains for training the candidate solution
nr_init – initial number of domains for training the reference solutions
sequence_cand – mathematical sequence for the number of domains for training the candidate solution
sequence_refs – mathematical sequence for the number of domains for training the reference solutions
warmstart_cand – flag if the next candidate solution should be initialized with the previous one
warmstart_refs – flag if the reference solutions should be initialized with the current candidate
cand_policy_param_init – initial policy parameter values for the candidate, set None to be random
cand_critic_param_init – initial critic parameter values for the candidate, set None to be random
num_bs_reps – number of replications for the statistical bootstrap
studentized_ci – flag if a student T distribution should be applied for the confidence interval
base_seed – seed added to all other seeds in order to make the experiments distinct but repeatable
logger – logger for every step of the algorithm, if None the default logger will be created

iteration_key: str = 'spota_iteration'

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'spota'

property policy: Policy: Get the algorithm’s policy.

property sample_count: int: Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str = 'latest', meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subroutine_cand: Algorithm: Get the candidate subroutine.

udr

class UDR(env: EnvWrapper, subrtn: Algorithm)[source]

Bases: Algorithm

Uniform Domain Randomization (UDR)

This algorithm barely wraps another algorithm. The main purpose is to check if the domain randomizer is set up.

Constructor

Parameters:

env – same environment as the subroutine runs in. Only used for checking and saving the randomizer.
subrtn – algorithm which performs the policy / value-function optimization

load_snapshot(parsed_args) → Tuple[Env, Policy, dict][source]

Load the state of an experiment, which is specific to the algorithm.

Parameters:: parsed_args – arguments parsed by the argparser
Returns:: environment, policy, and (optional) algorithm-specific output, e.g. value function

name: str = 'udr'

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

property sample_count: int: Get the total number of samples, i.e. steps of a rollout, used for training so far.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subroutine: Algorithm: Get the policy optimization subroutine.

meta

adr

arpl

bayessim

bayrn

epopt

iudr

npdr

pddr

sbi_base

simopt

spdr

spota

udr

Module contents