episodic
cem
- class CEM(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, pop_size: Optional[int], num_init_states_per_domain: int, num_is_samples: int, expl_std_init: float, expl_std_min: float = 0.01, extra_expl_std_init: float = 0.0, extra_expl_decay_iter: int = 10, num_domains: int = 1, soft_update_factor: float = 1, full_cov: bool = False, symm_sampling: bool = False, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]
Bases:
ParameterExploring
Cross-Entropy Method (CEM) This implementation is basically Algorithm 3.3. in [1] with the addition of decreasing noise [2]. CEM is closely related to PoWER. The most significant difference is that the importance sampels are not kept over iterations and that the covariance matrix is not scaled with the returns, thus allowing for negative returns.
See also
[1] P.T. de Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, “A Tutorial on the Cross-Entropy Method”, Annals OR, 2005 [2] I. Szita, A. Lörnicz, “Learning Tetris Using the NoisyCross-Entropy Method”, Neural Computation, 2006
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
pop_size – number of solutions in the population
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
num_is_samples – number of samples (policy parameter sets & returns) for importance sampling, indirectly specifies the performance quantile \(1 - \rho\) [1]
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
extra_expl_std_init – additional standard deviation for the parameter exploration added to the diagonal entries of the covariance matirx, set to 0 to disable this functionality
extra_expl_decay_iter – limit for the linear decay of the additional standard deviation, i.e. last iteration in which the additional exploration noise is applied
soft_update_factor – a number between 0 an 1 to do linearly scale the updates of the policy, by default full updates are done, i.e. the new policy parameters are the mean of the importance samples
full_cov – pass True to compute a full covariance matrix for sampling the next policy parameter values, else a diagonal covariance is used
symm_sampling – use an exploration strategy which samples symmetric populations
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
- name: str = 'cem'
- update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]
Update the policy from the given samples.
- Parameters:
param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters
hc
- class HC(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_factor: float, num_domains: int = 1, pop_size: Optional[int] = None, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]
Bases:
ParameterExploring
Hill Climbing (HC)
HC is a heuristic-based policy search method that samples a population of policy parameters per iteration and evaluates them on multiple rollouts. If one of the new parameters is better than the current one it is kept. If the exploration parameters grow too large, they are reset.
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
expl_factor – scalar value which determines how the exploration strategy adapts its search space
pop_size – number of solutions in the population
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
- name: str = 'hc'
- update(param_results: ParameterSamplingResult, ret_avg_curr: float)[source]
Update the policy from the given samples.
- Parameters:
param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters
- class HCHyper(*args, **kwargs)[source]
Bases:
HC
Hill Climbing variant using an exploration strategy that samples policy parameters from a hyper-sphere
Constructor
- Parameters:
expl_r_init – initial radius of the hyper sphere for the exploration strategy
args – forwarded the superclass constructor
kwargs – forwarded the superclass constructor
- class HCNormal(*args, **kwargs)[source]
Bases:
HC
Hill Climbing variant using an exploration strategy with normally distributed noise on the policy parameters
Constructor
- Parameters:
expl_std_init – initial standard deviation for the exploration strategy
args – forwarded the superclass constructor
kwargs – forwarded the superclass constructor
nes
- class NES(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, pop_size: Optional[int] = None, eta_mean: float = 1.0, eta_std: Optional[float] = None, symm_sampling: bool = False, transform_returns: bool = True, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]
Bases:
ParameterExploring
Simplified variant of Natural Evolution Strategies (NES)
See also
[1] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, J. Schmidhuber, “Natural Evolution Strategies”, JMLR, 2014
[2] This implementation was inspired by https://github.com/pybrain/pybrain/blob/master/pybrain/optimization/distributionbased/snes.py
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
pop_size – number of solutions in the population
eta_mean – step size factor for the mean
eta_std – step size factor for the standard deviation
symm_sampling – use an exploration strategy which samples symmetric populations
transform_returns – use a rank-transformation of the returns to update the policy
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
- static compute_utilities(pop_size: Optional[int], eta_mean: float, eta_std: float)[source]
Compute the utilities as described in section 3.1 of [1] (a.k.a. Hansen ranking with uniform baseline)
- Parameters:
pop_size – number of solutions in the population
eta_mean – step size factor for the mean
eta_std – step size factor for the standard deviation
- Returns:
utility coefficient for the mean, and utility coefficient for the standard deviation
- name: str = 'nes'
- update(param_results: ParameterSamplingResult, ret_avg_curr: Optional[float] = None)[source]
Update the policy from the given samples.
- Parameters:
param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters
parameter_exploring
- class ParameterExploring(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, num_domains: int, pop_size: Optional[int] = None, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]
Bases:
Algorithm
Base for all algorithms that explore directly in the policy parameter space
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
pop_size – number of solutions in the population, pass None to use a default that scales logarithmically with the number of policy parameters
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
- property expl_strat: StochasticParamExplStrat
Get the algorithm’s exploration strategy.
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
- property sampler: ParameterExplorationSampler
Get the sampler. For algorithms with multiple samplers, this is the once collecting the training data.
- save_snapshot(meta_info: Optional[dict] = None)[source]
Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.
- Parameters:
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]
Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.
- Parameters:
snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- abstract update(param_results: ParameterSamplingResult, ret_avg_curr: float)[source]
Update the policy from the given samples.
- Parameters:
param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters
pepg
- class PEPG(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, num_init_states_per_domain: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, pop_size: Optional[int] = None, clip_ratio_std: float = 0.05, normalize_update: bool = False, transform_returns: bool = True, lr: float = 0.0005, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]
Bases:
ParameterExploring
Parameter-Exploring Policy Gradients (PEPG)
See also
[1] F. Sehnke, C. Osendorfer, T. Rueckstiess, A. Graves, J. Peters, J. Schmidhuber, “Parameter-exploring Policy Gradients”, Neural Networks, 2010
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
pop_size – number of solutions in the population
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
clip_ratio_std – maximal ratio for the change of the exploration strategy’s standard deviation
transform_returns – use a rank-transformation of the returns to update the policy
lr – learning rate
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
- name: str = 'pepg'
- update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]
Update the policy from the given samples.
- Parameters:
param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters
- rank_transform(arr: ndarray, centered=True) ndarray [source]
Transform a 1-dim ndarray with arbitrary scalar values to an array with equally spaced rank values. This is a nonlinear transform.
- Parameters:
arr – input array
centered – if the transform should by centered around zero
- Returns:
transformed array
power
- class PoWER(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, pop_size: Optional[int], num_init_states_per_domain: int, num_is_samples: int, expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, symm_sampling: bool = False, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]
Bases:
ParameterExploring
Return-based variant of Policy learning by Weighting Exploration with the Returns (PoWER)
Note
PoWER was designed for linear policies. PoWER is must use positive reward functions (improper probability distribution) [1, p.10]. The original implementation is tailored to movement primitives like DMPs.
See also
[1] J. Kober and J. Peters, “Policy Search for Motor Primitives in Robotics”, Machine Learning, 2011
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
pop_size – number of solutions in the population
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
num_is_samples – number of samples (policy parameter sets & returns) for importance sampling
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
symm_sampling – use an exploration strategy which samples symmetric populations
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
- name: str = 'power'
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
- update(param_results: ParameterSamplingResult, ret_avg_curr: float = None)[source]
Update the policy from the given samples.
- Parameters:
param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters
predefined_lqr
reps
- class REPS(save_dir: PathLike, env: Env, policy: Policy, max_iter: int, eps: float, num_init_states_per_domain: int, pop_size: Optional[int], expl_std_init: float, expl_std_min: float = 0.01, num_domains: int = 1, symm_sampling: bool = False, softmax_transform: bool = False, use_map: bool = True, optim_mode: Optional[str] = 'scipy', num_epoch_dual: int = 1000, lr_dual: float = 0.0005, num_workers: int = 4, logger: Optional[StepLogger] = None)[source]
Bases:
ParameterExploring
Episodic variant of Relative Entropy Policy Search (REPS)
Note
REPS [1] was designed for linear policies.
See also
[1] J. Peters, K. Mülling, Y. Altuen, “Relative Entropy Policy Search”, AAAI, 2010 [2] A. Abdolmaleki, J.T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess, M. Riedmiller,
“Relative Entropy Regularized Policy Iteration”, arXiv, 2018
- [3] This implementation is inspired by the work of H. Abdulsamad
Constructor
- Parameters:
save_dir – directory to save the snapshots i.e. the results in
env – the environment which the policy operates
policy – policy to be updated
eps – bound on the KL divergence between policy updates, e.g. 0.1
max_iter – maximum number of iterations (i.e. policy updates) that this algorithm runs
pop_size – number of solutions in the population
num_init_states_per_domain – number of rollouts to cover the variance over initial states
num_domains – number of rollouts due to the variance over domain parameters
expl_std_init – initial standard deviation for the exploration strategy
expl_std_min – minimal standard deviation for the exploration strategy
symm_sampling – use an exploration strategy which samples symmetric populations
softmax_transform – pass True to use a softmax to transform the returns, else use a shifted exponential
use_map – use maximum a-posteriori likelihood (True) or maximum likelihood (False) update rule
optim_mode – choose the type of optimizer: ‘torch’ for a SGD-based optimizer or ‘scipy’ for the SLSQP optimizer from scipy (recommended)
num_epoch_dual – number of epochs for the minimization of the dual functions, ignored if optim_mode = ‘scipy’
lr_dual – learning rate for the dual’s optimizer, ignored if optim_mode = ‘scipy’
num_workers – number of environments for parallel sampling
logger – logger for every step of the algorithm, if None the default logger will be created
- dual_evaluation(eta: Union[Tensor, ndarray], rets: Union[Tensor, ndarray]) Union[Tensor, ndarray] [source]
Compute the REPS dual function value for policy evaluation.
- Parameters:
eta – lagrangian multiplier (optimization variable of the dual)
rets – return values per policy sample after averaging over multiple rollouts using the same policy
- Returns:
dual loss value
- dual_improvement(eta: Union[Tensor, ndarray], param_samples: Tensor, w: Tensor) Union[Tensor, ndarray] [source]
Compute the REPS dual function value for policy improvement.
- Parameters:
eta – lagrangian multiplier (optimization variable of the dual)
param_samples – all sampled policy parameters
w – weights of the policy parameter samples
- Returns:
dual loss value
- property eta: Tensor
Get the Lagrange multiplier \(\eta\). In [2], \(/eta\) is called \(/alpha\).
- minimize(loss_fcn: Callable, rets: Optional[Tensor] = None, param_samples: Optional[Tensor] = None, w: Optional[Tensor] = None)[source]
Minimize the given dual function. This function can be called for the dual evaluation loss or the dual improvement loss.
- Parameters:
loss_fcn – function to minimize, different for wml() and wmap()
rets – return values per policy sample after averaging over multiple rollouts using the same policy
param_samples – all sampled policy parameters
w – weights of the policy parameter samples
- name: Optional[str] = 'reps'
- update(param_results: ParameterSamplingResult, ret_avg_curr: Optional[float] = None)[source]
Update the policy from the given samples.
- Parameters:
param_results – Sampled parameters with evaluation
ret_avg_curr – Average return for the current parameters
- weights(rets: Tensor) Tensor [source]
Compute the wights which are used to weights thy policy samples by their return. As stated in [2, sec 4.1], we could calculate weights using any rank preserving transformation.
- Parameters:
rets – return values per policy sample after averaging over multiple rollouts using the same policy
- Returns:
weights of the policy parameter samples
- wmap(param_samples: Tensor, w: Tensor)[source]
Weighted maximum a-posteriori likelihood update of the policy’s mean and the exploration strategy’s covariance
- Parameters:
param_samples – all sampled policy parameters
w – weights of the policy parameter samples
- wml(eta: Tensor, param_samples: Tensor, w: Tensor)[source]
Weighted maximum likelihood update of the policy’s mean and the exploration strategy’s covariance
- Parameters:
eta – lagrangian multiplier (optimization variable of the dual)
param_samples – all sampled policy parameters
w – weights of the policy parameter samples
sysid_via_episodic_rl
- class SysIdViaEpisodicRL(subrtn: ParameterExploring, behavior_policy: Policy, num_rollouts_per_distr: int, metric: Optional[Callable[[ndarray], ndarray]], obs_dim_weight: Union[list, ndarray], std_obs_filt: int = 5, w_abs: float = 0.5, w_sq: float = 1.0, num_workers: int = 4, base_seed: int = 1001)[source]
Bases:
Algorithm
Wrapper to frame black-box system identification as an episodic reinforcement learning problem
Note
This algorithm was designed as a subroutine of SimOpt. However, it could also be used independently.
Constructor
- Parameters:
subrtn – wrapped algorithm to fit the domain parameter distribution
behavior_policy – lower level policy used to generate the rollouts
num_rollouts_per_distr – number of rollouts per domain distribution parameter set
metric – functional mapping from differences in observations to value
obs_dim_weight – (diagonal) weight matrix for the different observation dimensions for the default metric
std_obs_filt – number of standard deviations for the Gaussian filter applied to the observaitons
w_abs – weight for the mean absolute errors for the default metric
w_sq – weight for the mean squared errors for the default metric
num_workers – number of environments for parallel sampling
base_seed – seed to set for the parallel sampler in every iteration
- iteration_key: str = 'sysiderl_iteration'
- loss_fcn(rollout_real: StepSequence, rollout_sim: StepSequence) float [source]
Compute the discrepancy between two time sequences of observations given metric. Be sure to align and truncate the rollouts beforehand.
- Parameters:
rollout_real – (concatenated) real-world rollout containing the observations
rollout_sim – (concatenated) simulated rollout containing the observations
- Returns:
discrepancy cost summed over the observation dimensions
- name: str = 'sysiderl'
- static override_obs_bounds(bound_lo: ~numpy.ndarray, bound_up: ~numpy.ndarray, labels: ~numpy.ndarray) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Default overriding method for the bounds of an observation space. This is necessary when the observations are scaled with their range, e.g. to compare a deviation over different kinds of observations like position and annular velocity. Thus, infinite bounds are not feasible.
- Parameters:
bound_lo – lower bound of the observation space
bound_up – upper bound of the observation space
labels – label for each dimension of the observation space to override
- Returns:
clipped lower and upper bound
- reset(seed: Optional[int] = None)[source]
Reset the algorithm to it’s initial state. This should NOT reset learned policy parameters. By default, this resets the iteration count and the exploration strategy. Be sure to call this function if you override it.
- Parameters:
seed – seed value for the random number generators, pass None for no seeding
- save_snapshot(meta_info: Optional[dict] = None)[source]
Save the algorithm information (e.g., environment, policy, ect.). Subclasses should call the base method to save the policy.
- Parameters:
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- step(snapshot_mode: str, meta_info: Optional[dict] = None)[source]
Perform a single iteration of the algorithm. This includes collecting the data, updating the parameters, and adding the metrics of interest to the logger. Does not update the curr_iter attribute.
- Parameters:
snapshot_mode – determines when the snapshots are stored (e.g. on every iteration or on new highscore)
meta_info – is not None if this algorithm is run as a subroutine of a meta-algorithm, contains a dict of information about the current iteration of the meta-algorithm
- property subrtn: ParameterExploring
Get the subroutine used for updating the domain parameter distribution.
- static truncate_rollouts(rollouts_real: Sequence[StepSequence], rollouts_sim: Sequence[StepSequence], replicate: bool = True) Tuple[Sequence[StepSequence], Sequence[StepSequence]] [source]
In case (some of the) rollouts failed or succeed in one domain, but not in the other, we truncate the longer observation sequence. When truncating, we compare every of the M real rollouts to every of the N simulated rollouts, thus replicate the real rollout N times and the simulated rollouts M times.
- Parameters:
rollouts_real – M real-world rollouts of different length if replicate = True, else K real-world rollouts of different length
rollouts_sim – N simulated rollouts of different length if replicate = True, else K simulated rollouts of different length
replicate – if False the i-th rollout from rollouts_real is (only) compared with the i-th rollout from rollouts_sim, in this case the number of rollouts and the initial states have to match
- Returns:
MxN real-world rollouts and MxN simulated rollouts of equal length if replicate = True, else K real-world rollouts and K simulated rollouts of equal length
- static weighted_l1_l2_metric(err: ndarray, w_abs: float, w_sq: float, obs_dim_weight: ndarray)[source]
Compute the weighted linear combination of the observation error’s MAE and MSE, averaged over time
Note
In contrast to [1], we are using the mean absolute error and the mean squared error instead of the L1 and the L2 norm. The reason for this is that longer time series would be punished otherwise.
- Parameters:
err – error signal with time steps along the first dimension
w_abs – weight for the mean absolute errors
w_sq – weight for the mean squared errors
obs_dim_weight – (diagonal) weight matrix for the different observation dimensions
- Returns:
weighted linear combination of the error’s MAE and MSE, averaged over time