episodic

cem

class CEM(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, pop_size: Optional[int], num_init_states_per_domain: int, num_is_samples: int, expl_std_init: float, expl_std_min: float = 0.01, extra_expl_std_init: float = 0.0, extra_expl_decay_iter: int = 10, num_domains: int = 1, soft_update_factor: float = 1, full_cov: bool = False, symm_sampling: bool = False, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Cross-Entropy Method (CEM) This implementation is basically Algorithm 3.3. in [1] with the addition of decreasing noise [2]. CEM is closely related to PoWER. The most significant difference is that the importance sampels are not kept over iterations and that the covariance matrix is not scaled with the returns, thus allowing for negative returns.

See also

[1] P.T. de Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, “A Tutorial on the Cross-Entropy Method”, Annals OR, 2005 [2] I. Szita, A. Lörnicz, “Learning Tetris Using the NoisyCross-Entropy Method”, Neural Computation, 2006

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
pop_size – number of solutions in the population
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
num_is_samples – number of samples (policy parameter sets & returns) for importance sampling, indirectly specifies the performance quantile \(1 - \rho\) [1]
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
extra_expl_std_init – additional standard deviation for the parameter exploration added to the diagonal entries of the covariance matirx, set to 0 to disable this functionality
extra_expl_decay_iter – limit for the linear decay of the additional standard deviation, i.e. last iteration in which the additional exploration noise is applied
soft_update_factor – a number between 0 an 1 to do linearly scale the updates of the policy, by default full updates are done, i.e. the new policy parameters are the mean of the importance samples
full_cov – pass True to compute a full covariance matrix for sampling the next policy parameter values, else a diagonal covariance is used
symm_sampling – use an exploration strategy which samples symmetric populations
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'cem'

update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]

Update the policy from the given samples.

Parameters:

param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters

hc

class HC(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_factor: float, num_domains: int = 1, pop_size: Optional[int] = None, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Hill Climbing (HC)

HC is a heuristic-based policy search method that samples a population of policy parameters per iteration and evaluates them on multiple rollouts. If one of the new parameters is better than the current one it is kept. If the exploration parameters grow too large, they are reset.

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
expl_factor – scalar value which determines how the exploration strategy adapts its search space
pop_size – number of solutions in the population
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'hc'

update(param_results: ParameterSamplingResult, ret_avg_curr: float)[source]

Update the policy from the given samples.

Parameters:

param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters

abstract update_expl_strat(rets_avg_ros: ndarray, ret_avg_curr: float)[source]

class HCHyper(*args, **kwargs)[source]

Bases: HC

Hill Climbing variant using an exploration strategy that samples policy parameters from a hyper-sphere

Constructor

Parameters:

expl_r_init – initial radius of the hyper sphere for the exploration strategy
args – forwarded the superclass constructor
kwargs – forwarded the superclass constructor

update_expl_strat(rets_avg_ros: ndarray, ret_avg_curr: float)[source]

class HCNormal(*args, **kwargs)[source]

Bases: HC

Hill Climbing variant using an exploration strategy with normally distributed noise on the policy parameters

Constructor

Parameters:

expl_std_init – initial standard deviation for the exploration strategy
args – forwarded the superclass constructor
kwargs – forwarded the superclass constructor

update_expl_strat(rets_avg_ros: ndarray, ret_avg_curr: float)[source]

nes

class NES(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, pop_size: Optional[int] = None, eta_mean: float = 1.0, eta_std: Optional[float] = None, symm_sampling: bool = False, transform_returns: bool = True, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Simplified variant of Natural Evolution Strategies (NES)

See also

[1] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, J. Schmidhuber, “Natural Evolution Strategies”, JMLR, 2014

[2] This implementation was inspired by https://github.com/pybrain/pybrain/blob/master/pybrain/optimization/distributionbased/snes.py

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
pop_size – number of solutions in the population
eta_mean – step size factor for the mean
eta_std – step size factor for the standard deviation
symm_sampling – use an exploration strategy which samples symmetric populations
transform_returns – use a rank-transformation of the returns to update the policy
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created

static compute_utilities(pop_size: Optional[int], eta_mean: float, eta_std: float)[source]

Compute the utilities as described in section 3.1 of [1] (a.k.a. Hansen ranking with uniform baseline)

Parameters:

pop_size – number of solutions in the population
eta_mean – step size factor for the mean
eta_std – step size factor for the standard deviation

Returns:

utility coefficient for the mean, and utility coefficient for the standard deviation

name: str = 'nes'

update(param_results: ParameterSamplingResult, ret_avg_curr: Optional[float] = None)[source]

Update the policy from the given samples.

Parameters:

param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters

parameter_exploring

class ParameterExploring(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, num_domains: int, pop_size: Optional[int] = None, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: Algorithm

Base for all algorithms that explore directly in the policy parameter space

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
pop_size – number of solutions in the population, pass None to use a default that scales logarithmically with the number of policy parameters
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created

property env: Env: Get the environment in which the algorithm exploration trains.

property expl_strat: StochasticParamExplStrat: Get the algorithm’s exploration strategy.

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

property sampler: ParameterExplorationSampler: Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

abstract update(param_results: ParameterSamplingResult, ret_avg_curr: float)[source]

Update the policy from the given samples.

Parameters:

param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters

pepg

class PEPG(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, pop_size: Optional[int] = None, clip_ratio_std: float = 0.05, normalize_update: bool = False, transform_returns: bool = True, lr: float = 0.0005, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Parameter-Exploring Policy Gradients (PEPG)

See also

[1] F. Sehnke, C. Osendorfer, T. Rueckstiess, A. Graves, J. Peters, J. Schmidhuber, “Parameter-exploring Policy Gradients”, Neural Networks, 2010

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
pop_size – number of solutions in the population
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
clip_ratio_std – maximal ratio for the change of the exploration strategy’s standard deviation
transform_returns – use a rank-transformation of the returns to update the policy
lr – learning rate
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'pepg'

update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]

Update the policy from the given samples.

Parameters:

param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters

rank_transform(arr: ndarray, centered=True) → ndarray[source]

Transform a 1-dim ndarray with arbitrary scalar values to an array with equally spaced rank values. This is a nonlinear transform.

Parameters:

arr – input array
centered – if the transform should by centered around zero

Returns:

transformed array

power

class PoWER(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, pop_size: Optional[int], num_init_states_per_domain: int, num_is_samples: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, symm_sampling: bool = False, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Return-based variant of Policy learning by Weighting Exploration with the Returns (PoWER)

Note

PoWER was designed for linear policies. PoWER is must use positive reward functions (improper probability distribution) [1, p.10]. The original implementation is tailored to movement primitives like DMPs.

See also

[1] J. Kober and J. Peters, “Policy Search for Motor Primitives in Robotics”, Machine Learning, 2011

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
pop_size – number of solutions in the population
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
num_is_samples – number of samples (policy parameter sets & returns) for importance sampling
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
symm_sampling – use an exploration strategy which samples symmetric populations
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created

name: str = 'power'

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]

Update the policy from the given samples.

Parameters:

param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters

predefined_lqr

reps

class REPS(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, eps: float, num_init_states_per_domain: int, pop_size: Optional[int], expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, symm_sampling: bool = False, softmax_transform: bool = False, use_map: bool = True, optim_mode: Optional[str] = 'scipy', num_epoch_dual: int = 1000, lr_dual: float = 0.0005, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]

Bases: ParameterExploring

Episodic variant of Relative Entropy Policy Search (REPS)

Note

REPS [1] was designed for linear policies.

See also

[1] J. Peters, K. Mülling, Y. Altuen, “Relative Entropy Policy Search”, AAAI, 2010 [2] A. Abdolmaleki, J.T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess, M. Riedmiller,

“Relative Entropy Regularized Policy Iteration”, arXiv, 2018

[3] This implementation is inspired by the work of H. Abdulsamad: https://github.com/hanyas/reps/blob/master/reps/ereps.py

Constructor

Parameters:

save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
eps – bound on the KL divergence between policy updates, e.g. 0.1
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
pop_size – number of solutions in the population
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
symm_sampling – use an exploration strategy which samples symmetric populations
softmax_transform – pass True to use a softmax to transform the returns, else use a shifted exponential
use_map – use maximum a-posteriori likelihood (True) or maximum likelihood (False) update rule
optim_mode – choose the type of optimizer: ‘torch’ for a SGD-based optimizer or ‘scipy’ for the SLSQP optimizer from scipy (recommended)
num_epoch_dual – number of epochs for the minimization of the dual functions, ignored if optim_mode = ‘scipy’
lr_dual – learning rate for the dual’s optimizer, ignored if optim_mode = ‘scipy’
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created

dual_evaluation(eta: Union[Tensor, ndarray], rets: Union[Tensor, ndarray]) → Union[Tensor, ndarray][source]

Compute the REPS dual function value for policy evaluation.

Parameters:

eta – lagrangian multiplier (optimization variable of the dual)
rets – return values per policy sample after averaging over multiple rollouts using the same policy

Returns:

dual loss value

dual_improvement(eta: Union[Tensor, ndarray], param_samples: Tensor, w: Tensor) → Union[Tensor, ndarray][source]

Compute the REPS dual function value for policy improvement.

Parameters:

eta – lagrangian multiplier (optimization variable of the dual)
param_samples – all sampled policy parameters
w – weights of the policy parameter samples

Returns:

dual loss value

property eta: Tensor: Get the Lagrange multiplier \(\eta\). In [2], \(/eta\) is called \(/alpha\).

minimize(loss_fcn: Callable, rets: Optional[Tensor] = None, param_samples: Optional[Tensor] = None, w: Optional[Tensor] = None)[source]

Minimize the given dual function. This function can be called for the dual evaluation loss or the dual improvement loss.

Parameters:

loss_fcn – function to minimize, different for wml() and wmap()
rets – return values per policy sample after averaging over multiple rollouts using the same policy
param_samples – all sampled policy parameters
w – weights of the policy parameter samples

name: Optional[str] = 'reps'

update(param_results: ParameterSamplingResult, ret_avg_curr: Optional[float] = None)[source]

Update the policy from the given samples.

Parameters:

param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters

weights(rets: Tensor) → Tensor[source]

Compute the wights which are used to weights thy policy samples by their return. As stated in [2, sec 4.1], we could calculate weights using any rank preserving transformation.

Parameters:: rets – return values per policy sample after averaging over multiple rollouts using the same policy
Returns:: weights of the policy parameter samples

wmap(param_samples: Tensor, w: Tensor)[source]

Weighted maximum a-posteriori likelihood update of the policy’s mean and the exploration strategy’s covariance

Parameters:

param_samples – all sampled policy parameters
w – weights of the policy parameter samples

wml(eta: Tensor, param_samples: Tensor, w: Tensor)[source]

Weighted maximum likelihood update of the policy’s mean and the exploration strategy’s covariance

Parameters:

eta – lagrangian multiplier (optimization variable of the dual)
param_samples – all sampled policy parameters
w – weights of the policy parameter samples

sysid_via_episodic_rl

class SysIdViaEpisodicRL(subrtn: ParameterExploring, behavior_policy: Policy, num_rollouts_per_distr: int, metric: Optional[Callable[[ndarray], ndarray]], obs_dim_weight: Union[list, ndarray], std_obs_filt: int = 5, w_abs: float = 0.5, w_sq: float = 1.0, num_workers: int = 4, base_seed: int = 1001)[source]

Bases: Algorithm

Wrapper to frame black-box system identification as an episodic reinforcement learning problem

Note

This algorithm was designed as a subroutine of SimOpt. However, it could also be used independently.

Constructor

Parameters:

subrtn – wrapped algorithm to fit the domain parameter distribution
behavior_policy – lower level policy used to generate the rollouts
num_rollouts_per_distr – number of rollouts per domain distribution parameter set
metric – functional mapping from differences in observations to value
obs_dim_weight – (diagonal) weight matrix for the different observation dimensions for the default metric
std_obs_filt – number of standard deviations for the Gaussian filter applied to the observaitons
w_abs – weight for the mean absolute errors for the default metric
w_sq – weight for the mean squared errors for the default metric
num_workers – number of environments for parallel sampling
base_seed – seed to set for the parallel sampler in every iteration

iteration_key: str = 'sysiderl_iteration'

loss_fcn(rollout_real: StepSequence, rollout_sim: StepSequence) → float[source]

Compute the discrepancy between two time sequences of observations given metric. Be sure to align and truncate the rollouts beforehand.

Parameters:

rollout_real – (concatenated) real-world rollout containing the observations
rollout_sim – (concatenated) simulated rollout containing the observations

Returns:

discrepancy cost summed over the observation dimensions

name: str = 'sysiderl'

static override_obs_bounds(bound_lo: ~numpy.ndarray, bound_up: ~numpy.ndarray, labels: ~numpy.ndarray) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Default overriding method for the bounds of an observation space. This is necessary when the observations are scaled with their range, e.g. to compare a deviation over different kinds of observations like position and annular velocity. Thus, infinite bounds are not feasible.

Parameters:

bound_lo – lower bound of the observation space
bound_up – upper bound of the observation space
labels – label for each dimension of the observation space to override

Returns:

clipped lower and upper bound

reset(seed: Optional[int] = None)[source]

Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.

Parameters:: seed – seed value for the random number generators, pass None for no seeding

save_snapshot(meta_info: Optional[dict] = None)[source]

Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.

Parameters:: meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]

Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.

Parameters:

snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm

property subrtn: ParameterExploring: Get the subroutine used for updating the domain parameter distribution.

static truncate_rollouts(rollouts_real: Sequence[StepSequence], rollouts_sim: Sequence[StepSequence], replicate: bool = True) → Tuple[Sequence[StepSequence], Sequence[StepSequence]][source]

In case (some of the) rollouts failed or succeed in one domain, but not in the other, we truncate the longer observation sequence. When truncating, we compare every of the M real rollouts to every of the N simulated rollouts, thus replicate the real rollout N times and the simulated rollouts M times.

Parameters:

rollouts_real – M real-world rollouts of different length if replicate = True, else K real-world rollouts of different length
rollouts_sim – N simulated rollouts of different length if replicate = True, else K simulated rollouts of different length
replicate – if False the i-th rollout from rollouts_real is (only) compared with the i-th rollout from rollouts_sim, in this case the number of rollouts and the initial states have to match

Returns:

MxN real-world rollouts and MxN simulated rollouts of equal length if replicate = True, else K real-world rollouts and K simulated rollouts of equal length

static weighted_l1_l2_metric(err: ndarray, w_abs: float, w_sq: float, obs_dim_weight: ndarray)[source]

Compute the weighted linear combination of the observation error’s MAE and MSE, averaged over time

Note

In contrast to [1], we are using the mean absolute error and the mean squared error instead of the L1 and the L2 norm. The reason for this is that longer time series would be punished otherwise.

Parameters:

err – error signal with time steps along the first dimension
w_abs – weight for the mean absolute errors
w_sq – weight for the mean squared errors
obs_dim_weight – (diagonal) weight matrix for the different observation dimensions

Returns:

weighted linear combination of the error’s MAE and MSE, averaged over time

episodic

cem

hc

nes

parameter_exploring

pepg

power

predefined_lqr

reps

sysid_via_episodic_rl

Module contents