exploration

normal_noise

class DiagNormalNoise(noise_dim: [<class 'int'>, <class 'tuple'>], std_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True, use_cuda: bool = False)[source]

Bases: Module

Module for learnable additive Gaussian noise with a diagonal covariance matrix

Constructor

Parameters:
  • noise_dim – number of dimension

  • std_init – initial standard deviation for the exploration noise

  • std_min – minimal standard deviation for the exploration noise

  • train_meanTrue if the noise should have an adaptive nonzero mean, False otherwise

  • learnableTrue if the parameters should be tuneable (default), False for shallow use (just sampling)

  • use_cudaTrue to move the module to the GPU, False (default) to use the CPU

adapt(mean: Optional[Tensor] = None, std: Optional[Union[Tensor, float]] = None)[source]

Adapt the mean and the variance of the noise on the action or parameters. Use None to leave one of the parameters at their current value.

Parameters:
  • mean – exploration strategy’s new mean

  • std – exploration strategy’s new standard deviation

property device: str

Get the device (CPU or GPU) on which the policy is stored.

forward(value: Tensor) Normal[source]

Return the noise distribution for a specific noise-free value.

Parameters:

value – value to evaluate the distribution around

Returns:

noise distribution

get_entropy() Tensor[source]

Get the exploration distribution’s entropy. The entropy of a normal distribution is independent of the mean.

Returns:

entropy value

reset_expl_params()[source]

Reset all parameters of the exploration strategy.

property std: Tensor

Get the untransformed standard deviation from the log-transformed.

training: bool
class FullNormalNoise(noise_dim: [<class 'int'>, <class 'tuple'>], std_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True, use_cuda: bool = False)[source]

Bases: Module

Module for learnable additive Gaussian noise with a full covariance matrix

Constructor

Parameters:
  • noise_dim – number of dimension

  • std_init – initial standard deviation for the exploration noise

  • std_min – minimal standard deviation for the exploration noise

  • train_meanTrue if the noise should have an adaptive nonzero mean, False otherwise

  • learnableTrue if the parameters should be tuneable (default), False for shallow use (just sampling)

  • use_cudaTrue to move the module to the GPU, False (default) to use the CPU

adapt(mean: Optional[Tensor] = None, cov: Optional[Union[Tensor, float]] = None)[source]

Adapt the mean and the variance of the noise on the action or parameters. Use None to leave one of the parameters at their current value.

Parameters:
  • mean – exploration strategy’s new mean

  • cov – exploration strategy’s new standard deviation

property device: str

Get the device (CPU or GPU) on which the policy is stored.

forward(value: Tensor) MultivariateNormal[source]

Return the noise distribution for a specific noise-free value.

Parameters:

value – value to evaluate the distribution around

Returns:

noise distribution

get_entropy() Tensor[source]

Get the exploration distribution’s entropy. The entropy of a normal distribution is independent of the mean.

Returns:

entropy value

reset_expl_params()[source]

Reset all parameters of the exploration strategy.

property std: Tensor

Get the standard deviations from the internal covariance matrix.

training: bool

stochastic_action

class EpsGreedyExplStrat(policy: Policy, eps: float = 1.0, eps_schedule_gamma: float = 0.99, eps_final: float = 0.05)[source]

Bases: StochasticActionExplStrat

Exploration strategy which selects discrete actions epsilon-greedily

Constructor

Parameters:
  • policy – wrapped policy

  • eps – parameter determining the greediness, can be optimized or scheduled

  • eps_schedule_gamma – temporal discount factor for the exponential decay of epsilon

  • eps_final – minimum value of epsilon

action_dist_at(policy_output: Tensor) Distribution[source]

Return the action distribution for the given output from the wrapped policy.

Parameters:

policy_output – output from the wrapped policy, i.e. the noise-free action values

Returns:

action distribution

eval()[source]

Call PyTorch’s eval function and set the deny every exploration.

forward(obs: ~torch.Tensor, *extra) -> (<class 'torch.Tensor'>, <class 'tuple'>)[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

schedule_eps(steps: int)[source]
train(mode=True)[source]

Call PyTorch’s eval function and set the re-activate every exploration.

class NormalActNoiseExplStrat(policy: ~pyrado.policies.base.Policy, std_init: [<class 'float'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True)[source]

Bases: StochasticActionExplStrat

Exploration strategy which adds Gaussian noise to the continuous policy actions

Constructor

Parameters:
  • policy – wrapped policy

  • std_init – initial standard deviation for the exploration noise

  • std_min – minimal standard deviation for the exploration noise

  • train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise

  • learnableTrue if the parameters should be tuneable (default), False for shallow use (just sampling)

action_dist_at(policy_output: Tensor) Distribution[source]

Return the action distribution for the given output from the wrapped policy.

Parameters:

policy_output – output from the wrapped policy, i.e. the noise-free action values

Returns:

action distribution

get_entropy(*args, **kwargs)[source]
property mean
property noise: DiagNormalNoise

Get the exploration noise.

reset_expl_params(*args, **kwargs)[source]
property std
class SACExplStrat(policy: Policy)[source]

Bases: StochasticActionExplStrat

State-dependent exploration strategy which adds normal noise squashed into by a tanh to the continuous actions.

Note

This exploration strategy is specifically designed for SAC. Due to the tanh transformation, it returns action values within [-1,1].

Constructor

Parameters:

policy – wrapped policy

action_dist_at(policy_out_1: Tensor, policy_out_2: Tensor) Distribution[source]

Return the action distribution for the given output from the wrapped policy. This method is made for two-headed policies, e.g. used with SAC.

Parameters:
  • policy_out_1 – first head’s output from the wrapped policy, noise-free action values

  • policy_out_2 – second head’s output from the wrapped policy, state-dependent log std values

Returns:

action distribution at the mean given by policy_out_1

evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Distribution[source]

Re-evaluate the given rollout using the policy wrapped by this exploration strategy. Use this method to get gradient data on the action distribution. This version is tailored to the two-headed policy architecture used for SAC, since it requires a two-headed policy, where the first head returns the mean action and the second head returns the state-dependent std.

Parameters:
  • rollout – complete rollout

  • hidden_states_name – name of hidden states rollout entry, used for recurrent networks

Returns:

actions with gradient data

forward(obs: Tensor, *extra) [(<class 'torch.Tensor'>, <class 'torch.Tensor'>), (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)][source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

get_entropy(*args, **kwargs)[source]
property mean
property noise: DiagNormalNoise

Get the exploration noise.

reset_expl_params(*args, **kwargs)[source]
property std
class StochasticActionExplStrat(policy: Policy)[source]

Bases: Policy, ABC

Explore by sampling actions from a distribution.

Constructor

Parameters:

policy – wrapped policy

action_dist_at(policy_output: Tensor) Distribution[source]

Return the action distribution for the given output from the wrapped policy.

Parameters:

policy_output – output from the wrapped policy, i.e. the noise-free action values

Returns:

action distribution

evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Distribution[source]

Re-evaluate the given rollout using the policy wrapped by this exploration strategy. Use this method to get gradient data on the action distribution.

Parameters:
  • rollout – complete rollout

  • hidden_states_name – name of hidden states rollout entry, used for recurrent networks

Returns:

actions with gradient data

forward(obs: ~torch.Tensor, *extra) -> (<class 'torch.Tensor'>, <class 'tuple'>)[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_hidden(batch_size: Optional[int] = None) Tensor[source]

Provide initial values for the hidden parameters. This should usually be a zero tensor. The default implementation will raise an error, to enforce override this function for recurrent policies.

Parameters:

batch_size – number of states to track in parallel

Returns:

Tensor of batch_size x hidden_size

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

property is_recurrent: bool

Bool to signalise it the policy has a recurrent architecture.

reset(**kwargs)[source]

Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

class UniformActNoiseExplStrat(policy: ~pyrado.policies.base.Policy, halfspan_init: [<class 'float'>, <class 'torch.Tensor'>], halfspan_min: [<class 'float'>, <class 'list'>] = 0.01, train_mean: bool = False, learnable: bool = True)[source]

Bases: StochasticActionExplStrat

Exploration strategy which adds uniform noise to the continuous policy actions

Constructor

Parameters:
  • policy – wrapped policy

  • halfspan_init – initial value of the half interval for the exploration noise

  • halfspan_min – minimal standard deviation for the exploration noise

  • train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise

  • learnableTrue if the parameters should be tuneable (default), False for shallow use (just sampling)

action_dist_at(policy_output: Tensor) Distribution[source]

Return the action distribution for the given output from the wrapped policy.

Parameters:

policy_output – output from the wrapped policy, i.e. the noise-free action values

Returns:

action distribution

get_entropy(*args, **kwargs)[source]
property halfspan
property noise: UniformNoise

Get the exploration noise.

reset_expl_params(*args, **kwargs)[source]

stochastic_params

class HyperSphereParamNoise(param_dim: int, expl_r_init: float = 1.0)[source]

Bases: StochasticParamExplStrat

Sampling parameters from a normal distribution

Constructor

Parameters:
  • param_dim – number of policy parameters

  • expl_r_init – initial radius of the hyper-sphere

adapt(r: float)[source]

Set a new radius for the hyper sphere from which the policy parameters are sampled.

property r: float

Get the radius of the hypersphere.

reset_expl_params()[source]

Reset all parameters of the exploration strategy.

sample_param_set(nominal_params: Tensor) Tensor[source]

Sample one set of policy parameters from the current distribution.

Parameters:

nominal_params – parameter set (1-dim tensor) to sample around

Returns:

sampled parameter set (1-dim tensor)

class NormalParamNoise(param_dim: int, full_cov: bool = False, std_init: float = 1.0, std_min: [<class 'float'>, typing.Sequence[float]] = 0.01, train_mean: bool = False, use_cuda: bool = False)[source]

Bases: StochasticParamExplStrat

Sampling parameters from a normal distribution

Constructor

Parameters:
  • param_dim – number of policy parameters

  • full_cov – use a full covariance matrix or a diagonal covariance matrix (independent random variables)

  • std_init – initial standard deviation for the noise distribution

  • std_min – minimal standard deviation for the exploration noise

  • train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise

  • use_cudaTrue to move the module to the GPU, False (default) to use the CPU

adapt(*args, **kwargs)[source]
property cov
get_entropy(*args, **kwargs)[source]
property noise: [<class 'pyrado.exploration.normal_noise.FullNormalNoise'>, <class 'pyrado.exploration.normal_noise.DiagNormalNoise'>]

Get the exploration noise.

reset_expl_params(*args, **kwargs)[source]
sample_param_set(nominal_params: Tensor) Tensor[source]

Sample one set of policy parameters from the current distribution.

Parameters:

nominal_params – parameter set (1-dim tensor) to sample around

Returns:

sampled parameter set (1-dim tensor)

sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) Tensor[source]

Sample multiple sets of policy parameters from the current distribution.

Parameters:
  • nominal_params – parameter set (1-dim tensor) to sample around

  • num_samples – number of parameter sets

  • include_nominal_paramsTrue to include the nominal parameter values as first parameter set

Returns:

policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters

property std
class StochasticParamExplStrat(param_dim: int)[source]

Bases: ABC

Exploration strategy which samples policy parameters from a distribution

Constructor

Parameters:

param_dim – number of policy parameters

abstract sample_param_set(nominal_params: Tensor) Tensor[source]

Sample one set of policy parameters from the current distribution.

Parameters:

nominal_params – parameter set (1-dim tensor) to sample around

Returns:

sampled parameter set (1-dim tensor)

sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) Tensor[source]

Sample multiple sets of policy parameters from the current distribution.

Parameters:
  • nominal_params – parameter set (1-dim tensor) to sample around

  • num_samples – number of parameter sets

  • include_nominal_paramsTrue to include the nominal parameter values as first parameter set

Returns:

policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters

class SymmParamExplStrat(wrapped: StochasticParamExplStrat)[source]

Bases: StochasticParamExplStrat

Wrap a parameter exploration strategy to enforce symmetric sampling. The function sample_param_sets will always return an even number of parameters, and it’s guaranteed that ps[:len(ps)//2] == -ps[len(ps)//2:]

Constructor

Parameters:

wrapped – exploration strategy to wrap around

sample_param_set(nominal_params: Tensor) Tensor[source]

Sample one set of policy parameters from the current distribution.

Parameters:

nominal_params – parameter set (1-dim tensor) to sample around

Returns:

sampled parameter set (1-dim tensor)

sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) Tensor[source]

Sample multiple sets of policy parameters from the current distribution.

Parameters:
  • nominal_params – parameter set (1-dim tensor) to sample around

  • num_samples – number of parameter sets

  • include_nominal_paramsTrue to include the nominal parameter values as first parameter set

Returns:

policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters

uniform_noise

class UniformNoise(use_cuda: bool, noise_dim: [<class 'int'>, <class 'tuple'>], halfspan_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], halfspan_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.01, train_mean: bool = False, learnable: bool = True)[source]

Bases: Module

Module for learnable additive uniform noise

Constructor

Parameters:
  • use_cudaTrue to move the module to the GPU, False (default) to use the CPU

  • noise_dim – number of dimension

  • halfspan_init – initial value of the half interval for the exploration noise

  • halfspan_min – minimal value of the half interval for the exploration noise

  • train_meanTrue if the noise should have an adaptive nonzero mean, False otherwise

  • learnableTrue if the parameters should be tuneable (default), False for shallow use (just sampling)

adapt(mean: Optional[Tensor] = None, halfspan: Optional[Union[Tensor, float]] = None)[source]

Adapt the mean and the half interval span of the noise on the action or parameters. Use None to leave one of the parameters at their current value.

Parameters:
  • mean – exploration strategy’s new mean

  • halfspan – exploration strategy’s new half interval span

property device: str

Get the device (CPU or GPU) on which the policy is stored.

forward(value: Tensor) Uniform[source]

Return the noise distribution for a specific noise-free value.

Parameters:

value – value to evaluate the distribution around

Returns:

noise distribution

get_entropy() Tensor[source]

Get the exploration distribution’s entropy. The entropy of a uniform distribution is independent of the mean.

Returns:

entropy value

property halfspan: Tensor

Get the untransformed standard deviation vector given the log-transformed.

reset_expl_params()[source]

Reset all parameters of the exploration strategy.

training: bool

Module contents