episodic

cem

class CEM(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, pop_size: Optional[int], num_init_states_per_domain: int, num_is_samples: int, expl_std_init: float, expl_std_min: float = 0.01, extra_expl_std_init: float = 0.0, extra_expl_decay_iter: int = 10, num_domains: int = 1, soft_update_factor: float = 1, full_cov: bool = False, symm_sampling: bool = False, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Cross-Entropy Method (CEM) This implementation is basically Algorithm 3.3. in [1] with the addition of decreasing noise [2]. CEM is closely related to PoWER. The most significant difference is that the importance sampels are not kept over iterations and that the covariance matrix is not scaled with the returns, thus allowing for negative returns.

See also

[1] P.T. de Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, “A Tutorial on the Cross-Entropy Method”, Annals OR, 2005 [2] I. Szita, A. Lörnicz, “Learning Tetris Using the NoisyCross-Entropy Method”, Neural Computation, 2006

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • pop_size – number of solutions in the population

  • num_init_states_per_domain – number of rollouts to cover the variance over initial states

  • num_domains – number of rollouts due to the variance over domain parameters

  • num_is_samples – number of samples (policy parameter sets & returns) for importance sampling, indirectly specifies the performance quantile \(1 - \rho\) [1]

  • expl_std_init – initial standard deviation for the exploration strategy

  • expl_std_min – minimal standard deviation for the exploration strategy

  • extra_expl_std_init – additional standard deviation for the parameter exploration added to the diagonal entries of the covariance matirx, set to 0 to disable this functionality

  • extra_expl_decay_iter – limit for the linear decay of the additional standard deviation, i.e. last iteration in which the additional exploration noise is applied

  • soft_update_factor – a number between 0 an 1 to do linearly scale the updates of the policy, by default full updates are done, i.e. the new policy parameters are the mean of the importance samples

  • full_cov – pass True to compute a full covariance matrix for sampling the next policy parameter values, else a diagonal covariance is used

  • symm_sampling – use an exploration strategy which samples symmetric populations

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'cem'
update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]

Update the policy from the given samples.

Parameters:
  • param_results – Sampled parameters with evaluation

  • ret_avg_curr – Average return for the current parameters

hc

class HC(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_factor: float, num_domains: int = 1, pop_size: Optional[int] = None, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Hill Climbing (HC)

HC is a heuristic-based policy search method that samples a population of policy parameters per iteration and evaluates them on multiple rollouts. If one of the new parameters is better than the current one it is kept. If the exploration parameters grow too large, they are reset.

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • num_init_states_per_domain – number of rollouts to cover the variance over initial states

  • num_domains – number of rollouts due to the variance over domain parameters

  • expl_factor – scalar value which determines how the exploration strategy adapts its search space

  • pop_size – number of solutions in the population

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'hc'
update(param_results: ParameterSamplingResult, ret_avg_curr: float)[source]

Update the policy from the given samples.

Parameters:
  • param_results – Sampled parameters with evaluation

  • ret_avg_curr – Average return for the current parameters

abstract update_expl_strat(rets_avg_ros: ndarray, ret_avg_curr: float)[source]
class HCHyper(*args, **kwargs)[source]

Bases: HC

Hill Climbing variant using an exploration strategy that samples policy parameters from a hyper-sphere

Constructor

Parameters:
  • expl_r_init – initial radius of the hyper sphere for the exploration strategy

  • args – forwarded the superclass constructor

  • kwargs – forwarded the superclass constructor

update_expl_strat(rets_avg_ros: ndarray, ret_avg_curr: float)[source]
class HCNormal(*args, **kwargs)[source]

Bases: HC

Hill Climbing variant using an exploration strategy with normally distributed noise on the policy parameters

Constructor

Parameters:
  • expl_std_init – initial standard deviation for the exploration strategy

  • args – forwarded the superclass constructor

  • kwargs – forwarded the superclass constructor

update_expl_strat(rets_avg_ros: ndarray, ret_avg_curr: float)[source]

nes

class NES(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, pop_size: Optional[int] = None, eta_mean: float = 1.0, eta_std: Optional[float] = None, symm_sampling: bool = False, transform_returns: bool = True, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Simplified variant of Natural Evolution Strategies (NES)

See also

[1] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, J. Schmidhuber, “Natural Evolution Strategies”, JMLR, 2014

[2] This implementation was inspired by https://github.com/pybrain/pybrain/blob/master/pybrain/optimization/distributionbased/snes.py

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • num_init_states_per_domain – number of rollouts to cover the variance over initial states

  • num_domains – number of rollouts due to the variance over domain parameters

  • expl_std_init – initial standard deviation for the exploration strategy

  • expl_std_min – minimal standard deviation for the exploration strategy

  • pop_size – number of solutions in the population

  • eta_mean – step size factor for the mean

  • eta_std – step size factor for the standard deviation

  • symm_sampling – use an exploration strategy which samples symmetric populations

  • transform_returns – use a rank-transformation of the returns to update the policy

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

static compute_utilities(pop_size: Optional[int], eta_mean: float, eta_std: float)[source]

Compute the utilities as described in section 3.1 of [1] (a.k.a. Hansen ranking with uniform baseline)

Parameters:
  • pop_size – number of solutions in the population

  • eta_mean – step size factor for the mean

  • eta_std – step size factor for the standard deviation

Returns:

utility coefficient for the mean, and utility coefficient for the standard deviation

name: str = 'nes'
update(param_results: ParameterSamplingResult, ret_avg_curr: Optional[float] = None)[source]

Update the policy from the given samples.

Parameters:
  • param_results – Sampled parameters with evaluation

  • ret_avg_curr – Average return for the current parameters

parameter_exploring

class ParameterExploring(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, num_domains: int, pop_size: Optional[int] = None, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm

Base for all algorithms that explore directly in the policy parameter space

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • num_init_states_per_domain – number of rollouts to cover the variance over initial states

  • num_domains – number of rollouts due to the variance over domain parameters

  • pop_size – number of solutions in the population, pass None to use a default that scales logarithmically with the number of policy parameters

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

property env: Env

Get the environment in which the algorithm exploration trains.

property expl_strat: StochasticParamExplStrat

Get the algorithm’s exploration strategy.

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

property sampler: ParameterExplorationSampler

Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

abstract update(param_results: ParameterSamplingResult, ret_avg_curr: float)[source]

Update the policy from the given samples.

Parameters:
  • param_results – Sampled parameters with evaluation

  • ret_avg_curr – Average return for the current parameters

pepg

class PEPG(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, pop_size: Optional[int] = None, clip_ratio_std: float = 0.05, normalize_update: bool = False, transform_returns: bool = True, lr: float = 0.0005, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Parameter-Exploring Policy Gradients (PEPG)

See also

[1] F. Sehnke, C. Osendorfer, T. Rueckstiess, A. Graves, J. Peters, J. Schmidhuber, “Parameter-exploring Policy Gradients”, Neural Networks, 2010

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • pop_size – number of solutions in the population

  • num_init_states_per_domain – number of rollouts to cover the variance over initial states

  • num_domains – number of rollouts due to the variance over domain parameters

  • expl_std_init – initial standard deviation for the exploration strategy

  • expl_std_min – minimal standard deviation for the exploration strategy

  • clip_ratio_std – maximal ratio for the change of the exploration strategy’s standard deviation

  • transform_returns – use a rank-transformation of the returns to update the policy

  • lr – learning rate

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'pepg'
update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]

Update the policy from the given samples.

Parameters:
  • param_results – Sampled parameters with evaluation

  • ret_avg_curr – Average return for the current parameters

rank_transform(arr: ndarray, centered=True) ndarray[source]

Transform a 1-dim ndarray with arbitrary scalar values to an array with equally spaced rank values. This is a nonlinear transform.

Parameters:
  • arr – input array

  • centered – if the transform should by centered around zero

Returns:

transformed array

power

class PoWER(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, pop_size: Optional[int], num_init_states_per_domain: int, num_is_samples: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, symm_sampling: bool = False, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Return-based variant of Policy learning by Weighting Exploration with the Returns (PoWER)

Note

PoWER was designed for linear policies. PoWER is must use positive reward functions (improper probability distribution) [1, p.10]. The original implementation is tailored to movement primitives like DMPs.

See also

[1] J. Kober and J. Peters, “Policy Search for Motor Primitives in Robotics”, Machine Learning, 2011

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • pop_size – number of solutions in the population

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • num_init_states_per_domain – number of rollouts to cover the variance over initial states

  • num_domains – number of rollouts due to the variance over domain parameters

  • num_is_samples – number of samples (policy parameter sets & returns) for importance sampling

  • expl_std_init – initial standard deviation for the exploration strategy

  • expl_std_min – minimal standard deviation for the exploration strategy

  • symm_sampling – use an exploration strategy which samples symmetric populations

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'power'
reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]

Update the policy from the given samples.

Parameters:
  • param_results – Sampled parameters with evaluation

  • ret_avg_curr – Average return for the current parameters

predefined_lqr

reps

class REPS(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, eps: float, num_init_states_per_domain: int, pop_size: Optional[int], expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, symm_sampling: bool = False, softmax_transform: bool = False, use_map: bool = True, optim_mode: Optional[str] = 'scipy', num_epoch_dual: int = 1000, lr_dual: float = 0.0005, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Episodic variant of Relative Entropy Policy Search (REPS)

Note

REPS [1] was designed for linear policies.

See also

[1] J. Peters, K. Mülling, Y. Altuen, “Relative Entropy Policy Search”, AAAI, 2010 [2] A. Abdolmaleki, J.T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess, M. Riedmiller,

“Relative Entropy Regularized Policy Iteration”, arXiv, 2018

[3] This implementation is inspired by the work of H. Abdulsamad

https://github.com/hanyas/reps/blob/master/reps/ereps.py

Constructor

Parameters:
  • save_dir – directory to save the snapshots i.e. the results in

  • env – the environment which the policy operates

  • policy – policy to be updated

  • eps – bound on the KL divergence between policy updates, e.g. 0.1

  • max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs

  • pop_size – number of solutions in the population

  • num_init_states_per_domain – number of rollouts to cover the variance over initial states

  • num_domains – number of rollouts due to the variance over domain parameters

  • expl_std_init – initial standard deviation for the exploration strategy

  • expl_std_min – minimal standard deviation for the exploration strategy

  • symm_sampling – use an exploration strategy which samples symmetric populations

  • softmax_transform – pass True to use a softmax to transform the returns, else use a shifted exponential

  • use_map – use maximum a-posteriori likelihood (True) or maximum likelihood (False) update rule

  • optim_mode – choose the type of optimizer: ‘torch’ for a SGD-based optimizer or ‘scipy’ for the SLSQP optimizer from scipy (recommended)

  • num_epoch_dual – number of epochs for the minimization of the dual functions, ignored if optim_mode = ‘scipy’

  • lr_dual – learning rate for the dual’s optimizer, ignored if optim_mode = ‘scipy’

  • num_workers – number of environments for parallel sampling

  • logger – logger for every step of the algorithm, if None the default logger will be created

dual_evaluation(eta: Union[Tensor, ndarray], rets: Union[Tensor, ndarray]) Union[Tensor, ndarray][source]

Compute the REPS dual function value for policy evaluation.

Parameters:
  • eta – lagrangian multiplier (optimization variable of the dual)

  • rets – return values per policy sample after averaging over multiple rollouts using the same policy

Returns:

dual loss value

dual_improvement(eta: Union[Tensor, ndarray], param_samples: Tensor, w: Tensor) Union[Tensor, ndarray][source]

Compute the REPS dual function value for policy improvement.

Parameters:
  • eta – lagrangian multiplier (optimization variable of the dual)

  • param_samples – all sampled policy parameters

  • w – weights of the policy parameter samples

Returns:

dual loss value

property eta: Tensor

Get the Lagrange multiplier \(\eta\). In [2], \(/eta\) is called \(/alpha\).

minimize(loss_fcn: Callable, rets: Optional[Tensor] = None, param_samples: Optional[Tensor] = None, w: Optional[Tensor] = None)[source]

Minimize the given dual function. This function can be called for the dual evaluation loss or the dual improvement loss.

Parameters:
  • loss_fcn – function to minimize, different for wml() and wmap()

  • rets – return values per policy sample after averaging over multiple rollouts using the same policy

  • param_samples – all sampled policy parameters

  • w – weights of the policy parameter samples

name: Optional[str] = 'reps'
update(param_results: ParameterSamplingResult, ret_avg_curr: Optional[float] = None)[source]

Update the policy from the given samples.

Parameters:
  • param_results – Sampled parameters with evaluation

  • ret_avg_curr – Average return for the current parameters

weights(rets: Tensor) Tensor[source]

Compute the wights which are used to weights thy policy samples by their return. As stated in [2, sec 4.1], we could calculate weights using any rank preserving transformation.

Parameters:

rets – return values per policy sample after averaging over multiple rollouts using the same policy

Returns:

weights of the policy parameter samples

wmap(param_samples: Tensor, w: Tensor)[source]

Weighted maximum a-posteriori likelihood update of the policy’s mean and the exploration strategy’s covariance

Parameters:
  • param_samples – all sampled policy parameters

  • w – weights of the policy parameter samples

wml(eta: Tensor, param_samples: Tensor, w: Tensor)[source]

Weighted maximum likelihood update of the policy’s mean and the exploration strategy’s covariance

Parameters:
  • eta – lagrangian multiplier (optimization variable of the dual)

  • param_samples – all sampled policy parameters

  • w – weights of the policy parameter samples

sysid_via_episodic_rl

class SysIdViaEpisodicRL(subrtn: ParameterExploring, behavior_policy: Policy, num_rollouts_per_distr: int, metric: Optional[Callable[[ndarray], ndarray]], obs_dim_weight: Union[list, ndarray], std_obs_filt: int = 5, w_abs: float = 0.5, w_sq: float = 1.0, num_workers: int = 4, base_seed: int = 1001)[source]

Bases: Algorithm

Wrapper to frame black-box system identification as an episodic reinforcement learning problem

Note

This algorithm was designed as a subroutine of SimOpt. However, it could also be used independently.

Constructor

Parameters:
  • subrtn – wrapped algorithm to fit the domain parameter distribution

  • behavior_policy – lower level policy used to generate the rollouts

  • num_rollouts_per_distr – number of rollouts per domain distribution parameter set

  • metric – functional mapping from differences in observations to value

  • obs_dim_weight – (diagonal) weight matrix for the different observation dimensions for the default metric

  • std_obs_filt – number of standard deviations for the Gaussian filter applied to the observaitons

  • w_abs – weight for the mean absolute errors for the default metric

  • w_sq – weight for the mean squared errors for the default metric

  • num_workers – number of environments for parallel sampling

  • base_seed – seed to set for the parallel sampler in every iteration

iteration_key: str = 'sysiderl_iteration'
loss_fcn(rollout_real: StepSequence, rollout_sim: StepSequence) float[source]

Compute the discrepancy between two time sequences of observations given metric. Be sure to align and truncate the rollouts beforehand.

Parameters:
  • rollout_real – (concatenated) real-world rollout containing the observations

  • rollout_sim – (concatenated) simulated rollout containing the observations

Returns:

discrepancy cost summed over the observation dimensions

name: str = 'sysiderl'
static override_obs_bounds(bound_lo: ~numpy.ndarray, bound_up: ~numpy.ndarray, labels: ~numpy.ndarray) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Default overriding method for the bounds of an observation space. This is necessary when the observations are scaled with their range, e.g. to compare a deviation over different kinds of observations like position and annular velocity. Thus, infinite bounds are not feasible.

Parameters:
  • bound_lo – lower bound of the observation space

  • bound_up – upper bound of the observation space

  • labels – label for each dimension of the observation space to override

Returns:

clipped lower and upper bound

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:

seed – seed value for the random number generators, pass None for no seeding

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:

meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:
  • snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)

  • meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subrtn: ParameterExploring

Get the subroutine used for updating the domain parameter distribution.

static truncate_rollouts(rollouts_real: Sequence[StepSequence], rollouts_sim: Sequence[StepSequence], replicate: bool = True) Tuple[Sequence[StepSequence], Sequence[StepSequence]][source]

In case (some of the) rollouts failed or succeed in one domain, but not in the other, we truncate the longer observation sequence. When truncating, we compare every of the M real rollouts to every of the N simulated rollouts, thus replicate the real rollout N times and the simulated rollouts M times.

Parameters:
  • rollouts_real – M real-world rollouts of different length if replicate = True, else K real-world rollouts of different length

  • rollouts_sim – N simulated rollouts of different length if replicate = True, else K simulated rollouts of different length

  • replicate – if False the i-th rollout from rollouts_real is (only) compared with the i-th rollout from rollouts_sim, in this case the number of rollouts and the initial states have to match

Returns:

MxN real-world rollouts and MxN simulated rollouts of equal length if replicate = True, else K real-world rollouts and K simulated rollouts of equal length

static weighted_l1_l2_metric(err: ndarray, w_abs: float, w_sq: float, obs_dim_weight: ndarray)[source]

Compute the weighted linear combination of the observation error’s MAE and MSE, averaged over time

Note

In contrast to [1], we are using the mean absolute error and the mean squared error instead of the L1 and the L2 norm. The reason for this is that longer time series would be punished otherwise.

Parameters:
  • err – error signal with time steps along the first dimension

  • w_abs – weight for the mean absolute errors

  • w_sq – weight for the mean squared errors

  • obs_dim_weight – (diagonal) weight matrix for the different observation dimensions

Returns:

weighted linear combination of the error’s MAE and MSE, averaged over time

Module contents