exploration
normal_noise
- class DiagNormalNoise(noise_dim: [<class 'int'>, <class 'tuple'>], std_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True, use_cuda: bool = False)[source]
Bases:
Module
Module for learnable additive Gaussian noise with a diagonal covariance matrix
Constructor
- Parameters:
noise_dim – number of dimension
std_init – initial standard deviation for the exploration noise
std_min – minimal standard deviation for the exploration noise
train_mean – True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)
use_cuda – True to move the module to the GPU, False (default) to use the CPU
- adapt(mean: Optional[Tensor] = None, std: Optional[Union[Tensor, float]] = None)[source]
Adapt the mean and the variance of the noise on the action or parameters. Use None to leave one of the parameters at their current value.
- Parameters:
mean – exploration strategy’s new mean
std – exploration strategy’s new standard deviation
- property device: str
Get the device (CPU or GPU) on which the policy is stored.
- forward(value: Tensor) Normal [source]
Return the noise distribution for a specific noise-free value.
- Parameters:
value – value to evaluate the distribution around
- Returns:
noise distribution
- get_entropy() Tensor [source]
Get the exploration distribution’s entropy. The entropy of a normal distribution is independent of the mean.
- Returns:
entropy value
- property std: Tensor
Get the untransformed standard deviation from the log-transformed.
- training: bool
- class FullNormalNoise(noise_dim: [<class 'int'>, <class 'tuple'>], std_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True, use_cuda: bool = False)[source]
Bases:
Module
Module for learnable additive Gaussian noise with a full covariance matrix
Constructor
- Parameters:
noise_dim – number of dimension
std_init – initial standard deviation for the exploration noise
std_min – minimal standard deviation for the exploration noise
train_mean – True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)
use_cuda – True to move the module to the GPU, False (default) to use the CPU
- adapt(mean: Optional[Tensor] = None, cov: Optional[Union[Tensor, float]] = None)[source]
Adapt the mean and the variance of the noise on the action or parameters. Use None to leave one of the parameters at their current value.
- Parameters:
mean – exploration strategy’s new mean
cov – exploration strategy’s new standard deviation
- property device: str
Get the device (CPU or GPU) on which the policy is stored.
- forward(value: Tensor) MultivariateNormal [source]
Return the noise distribution for a specific noise-free value.
- Parameters:
value – value to evaluate the distribution around
- Returns:
noise distribution
- get_entropy() Tensor [source]
Get the exploration distribution’s entropy. The entropy of a normal distribution is independent of the mean.
- Returns:
entropy value
- property std: Tensor
Get the standard deviations from the internal covariance matrix.
- training: bool
stochastic_action
- class EpsGreedyExplStrat(policy: Policy, eps: float = 1.0, eps_schedule_gamma: float = 0.99, eps_final: float = 0.05)[source]
Bases:
StochasticActionExplStrat
Exploration strategy which selects discrete actions epsilon-greedily
Constructor
- Parameters:
policy – wrapped policy
eps – parameter determining the greediness, can be optimized or scheduled
eps_schedule_gamma – temporal discount factor for the exponential decay of epsilon
eps_final – minimum value of epsilon
- action_dist_at(policy_output: Tensor) Distribution [source]
Return the action distribution for the given output from the wrapped policy.
- Parameters:
policy_output – output from the wrapped policy, i.e. the noise-free action values
- Returns:
action distribution
- forward(obs: ~torch.Tensor, *extra) -> (<class 'torch.Tensor'>, <class 'tuple'>)[source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- class NormalActNoiseExplStrat(policy: ~pyrado.policies.base.Policy, std_init: [<class 'float'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True)[source]
Bases:
StochasticActionExplStrat
Exploration strategy which adds Gaussian noise to the continuous policy actions
Constructor
- Parameters:
policy – wrapped policy
std_init – initial standard deviation for the exploration noise
std_min – minimal standard deviation for the exploration noise
train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)
- action_dist_at(policy_output: Tensor) Distribution [source]
Return the action distribution for the given output from the wrapped policy.
- Parameters:
policy_output – output from the wrapped policy, i.e. the noise-free action values
- Returns:
action distribution
- property mean
- property noise: DiagNormalNoise
Get the exploration noise.
- property std
- class SACExplStrat(policy: Policy)[source]
Bases:
StochasticActionExplStrat
State-dependent exploration strategy which adds normal noise squashed into by a tanh to the continuous actions.
Note
This exploration strategy is specifically designed for SAC. Due to the tanh transformation, it returns action values within [-1,1].
Constructor
- Parameters:
policy – wrapped policy
- action_dist_at(policy_out_1: Tensor, policy_out_2: Tensor) Distribution [source]
Return the action distribution for the given output from the wrapped policy. This method is made for two-headed policies, e.g. used with SAC.
- Parameters:
policy_out_1 – first head’s output from the wrapped policy, noise-free action values
policy_out_2 – second head’s output from the wrapped policy, state-dependent log std values
- Returns:
action distribution at the mean given by policy_out_1
- evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Distribution [source]
Re-evaluate the given rollout using the policy wrapped by this exploration strategy. Use this method to get gradient data on the action distribution. This version is tailored to the two-headed policy architecture used for SAC, since it requires a two-headed policy, where the first head returns the mean action and the second head returns the state-dependent std.
- Parameters:
rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks
- Returns:
actions with gradient data
- forward(obs: Tensor, *extra) [(<class 'torch.Tensor'>, <class 'torch.Tensor'>), (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)] [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- property mean
- property noise: DiagNormalNoise
Get the exploration noise.
- property std
- class StochasticActionExplStrat(policy: Policy)[source]
Bases:
Policy
,ABC
Explore by sampling actions from a distribution.
Constructor
- Parameters:
policy – wrapped policy
- action_dist_at(policy_output: Tensor) Distribution [source]
Return the action distribution for the given output from the wrapped policy.
- Parameters:
policy_output – output from the wrapped policy, i.e. the noise-free action values
- Returns:
action distribution
- evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Distribution [source]
Re-evaluate the given rollout using the policy wrapped by this exploration strategy. Use this method to get gradient data on the action distribution.
- Parameters:
rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks
- Returns:
actions with gradient data
- forward(obs: ~torch.Tensor, *extra) -> (<class 'torch.Tensor'>, <class 'tuple'>)[source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
Provide initial values for the hidden parameters. This should usually be a zero tensor. The default implementation will raise an error, to enforce override this function for recurrent policies.
- Parameters:
batch_size – number of states to track in parallel
- Returns:
Tensor of batch_size x hidden_size
- init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- property is_recurrent: bool
Bool to signalise it the policy has a recurrent architecture.
- class UniformActNoiseExplStrat(policy: ~pyrado.policies.base.Policy, halfspan_init: [<class 'float'>, <class 'torch.Tensor'>], halfspan_min: [<class 'float'>, <class 'list'>] = 0.01, train_mean: bool = False, learnable: bool = True)[source]
Bases:
StochasticActionExplStrat
Exploration strategy which adds uniform noise to the continuous policy actions
Constructor
- Parameters:
policy – wrapped policy
halfspan_init – initial value of the half interval for the exploration noise
halfspan_min – minimal standard deviation for the exploration noise
train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)
- action_dist_at(policy_output: Tensor) Distribution [source]
Return the action distribution for the given output from the wrapped policy.
- Parameters:
policy_output – output from the wrapped policy, i.e. the noise-free action values
- Returns:
action distribution
- property halfspan
- property noise: UniformNoise
Get the exploration noise.
stochastic_params
- class HyperSphereParamNoise(param_dim: int, expl_r_init: float = 1.0)[source]
Bases:
StochasticParamExplStrat
Sampling parameters from a normal distribution
Constructor
- Parameters:
param_dim – number of policy parameters
expl_r_init – initial radius of the hyper-sphere
- adapt(r: float)[source]
Set a new radius for the hyper sphere from which the policy parameters are sampled.
- property r: float
Get the radius of the hypersphere.
- class NormalParamNoise(param_dim: int, full_cov: bool = False, std_init: float = 1.0, std_min: [<class 'float'>, typing.Sequence[float]] = 0.01, train_mean: bool = False, use_cuda: bool = False)[source]
Bases:
StochasticParamExplStrat
Sampling parameters from a normal distribution
Constructor
- Parameters:
param_dim – number of policy parameters
full_cov – use a full covariance matrix or a diagonal covariance matrix (independent random variables)
std_init – initial standard deviation for the noise distribution
std_min – minimal standard deviation for the exploration noise
train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise
use_cuda – True to move the module to the GPU, False (default) to use the CPU
- property cov
- property noise: [<class 'pyrado.exploration.normal_noise.FullNormalNoise'>, <class 'pyrado.exploration.normal_noise.DiagNormalNoise'>]
Get the exploration noise.
- sample_param_set(nominal_params: Tensor) Tensor [source]
Sample one set of policy parameters from the current distribution.
- Parameters:
nominal_params – parameter set (1-dim tensor) to sample around
- Returns:
sampled parameter set (1-dim tensor)
- sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) Tensor [source]
Sample multiple sets of policy parameters from the current distribution.
- Parameters:
nominal_params – parameter set (1-dim tensor) to sample around
num_samples – number of parameter sets
include_nominal_params – True to include the nominal parameter values as first parameter set
- Returns:
policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters
- property std
- class StochasticParamExplStrat(param_dim: int)[source]
Bases:
ABC
Exploration strategy which samples policy parameters from a distribution
Constructor
- Parameters:
param_dim – number of policy parameters
- abstract sample_param_set(nominal_params: Tensor) Tensor [source]
Sample one set of policy parameters from the current distribution.
- Parameters:
nominal_params – parameter set (1-dim tensor) to sample around
- Returns:
sampled parameter set (1-dim tensor)
- sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) Tensor [source]
Sample multiple sets of policy parameters from the current distribution.
- Parameters:
nominal_params – parameter set (1-dim tensor) to sample around
num_samples – number of parameter sets
include_nominal_params – True to include the nominal parameter values as first parameter set
- Returns:
policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters
- class SymmParamExplStrat(wrapped: StochasticParamExplStrat)[source]
Bases:
StochasticParamExplStrat
Wrap a parameter exploration strategy to enforce symmetric sampling. The function sample_param_sets will always return an even number of parameters, and it’s guaranteed that ps[:len(ps)//2] == -ps[len(ps)//2:]
Constructor
- Parameters:
wrapped – exploration strategy to wrap around
- sample_param_set(nominal_params: Tensor) Tensor [source]
Sample one set of policy parameters from the current distribution.
- Parameters:
nominal_params – parameter set (1-dim tensor) to sample around
- Returns:
sampled parameter set (1-dim tensor)
- sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) Tensor [source]
Sample multiple sets of policy parameters from the current distribution.
- Parameters:
nominal_params – parameter set (1-dim tensor) to sample around
num_samples – number of parameter sets
include_nominal_params – True to include the nominal parameter values as first parameter set
- Returns:
policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters
uniform_noise
- class UniformNoise(use_cuda: bool, noise_dim: [<class 'int'>, <class 'tuple'>], halfspan_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], halfspan_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.01, train_mean: bool = False, learnable: bool = True)[source]
Bases:
Module
Module for learnable additive uniform noise
Constructor
- Parameters:
use_cuda – True to move the module to the GPU, False (default) to use the CPU
noise_dim – number of dimension
halfspan_init – initial value of the half interval for the exploration noise
halfspan_min – minimal value of the half interval for the exploration noise
train_mean – True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)
- adapt(mean: Optional[Tensor] = None, halfspan: Optional[Union[Tensor, float]] = None)[source]
Adapt the mean and the half interval span of the noise on the action or parameters. Use None to leave one of the parameters at their current value.
- Parameters:
mean – exploration strategy’s new mean
halfspan – exploration strategy’s new half interval span
- property device: str
Get the device (CPU or GPU) on which the policy is stored.
- forward(value: Tensor) Uniform [source]
Return the noise distribution for a specific noise-free value.
- Parameters:
value – value to evaluate the distribution around
- Returns:
noise distribution
- get_entropy() Tensor [source]
Get the exploration distribution’s entropy. The entropy of a uniform distribution is independent of the mean.
- Returns:
entropy value
- property halfspan: Tensor
Get the untransformed standard deviation vector given the log-transformed.
- training: bool