exploration

normal_noise

class DiagNormalNoise(noise_dim: [<class 'int'>, <class 'tuple'>], std_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True, use_cuda: bool = False)[source]

Bases: Module

Module for learnable additive Gaussian noise with a diagonal covariance matrix

Constructor

Parameters:

noise_dim – number of dimension
std_init – initial standard deviation for the exploration noise
std_min – minimal standard deviation for the exploration noise
train_mean – True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)
use_cuda – True to move the module to the GPU, False (default) to use the CPU

adapt(mean: Optional[Tensor] = None, std: Optional[Union[Tensor, float]] = None)[source]

Adapt the mean and the variance of the noise on the action or parameters. Use None to leave one of the parameters at their current value.

Parameters:

mean – exploration strategy’s new mean
std – exploration strategy’s new standard deviation

property device: str: Get the device (CPU or GPU) on which the policy is stored.

forward(value: Tensor) → Normal[source]

Return the noise distribution for a specific noise-free value.

Parameters:: value – value to evaluate the distribution around
Returns:: noise distribution

get_entropy() → Tensor[source]

Get the exploration distribution’s entropy. The entropy of a normal distribution is independent of the mean.

Returns:: entropy value

reset_expl_params()[source]: Reset all parameters of the exploration strategy.

property std: Tensor: Get the untransformed standard deviation from the log-transformed.

training: bool

class FullNormalNoise(noise_dim: [<class 'int'>, <class 'tuple'>], std_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True, use_cuda: bool = False)[source]

Bases: Module

Module for learnable additive Gaussian noise with a full covariance matrix

Constructor

Parameters:

noise_dim – number of dimension
std_init – initial standard deviation for the exploration noise
std_min – minimal standard deviation for the exploration noise
train_mean – True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)
use_cuda – True to move the module to the GPU, False (default) to use the CPU

adapt(mean: Optional[Tensor] = None, cov: Optional[Union[Tensor, float]] = None)[source]

Adapt the mean and the variance of the noise on the action or parameters. Use None to leave one of the parameters at their current value.

Parameters:

mean – exploration strategy’s new mean
cov – exploration strategy’s new standard deviation

property device: str: Get the device (CPU or GPU) on which the policy is stored.

forward(value: Tensor) → MultivariateNormal[source]

Return the noise distribution for a specific noise-free value.

Parameters:: value – value to evaluate the distribution around
Returns:: noise distribution

get_entropy() → Tensor[source]

Get the exploration distribution’s entropy. The entropy of a normal distribution is independent of the mean.

Returns:: entropy value

reset_expl_params()[source]: Reset all parameters of the exploration strategy.

property std: Tensor: Get the standard deviations from the internal covariance matrix.

training: bool

stochastic_action

class EpsGreedyExplStrat(policy: Policy, eps: float = 1.0, eps_schedule_gamma: float = 0.99, eps_final: float = 0.05)[source]

Bases: StochasticActionExplStrat

Exploration strategy which selects discrete actions epsilon-greedily

Constructor

Parameters:

policy – wrapped policy
eps – parameter determining the greediness, can be optimized or scheduled
eps_schedule_gamma – temporal discount factor for the exponential decay of epsilon
eps_final – minimum value of epsilon

action_dist_at(policy_output: Tensor) → Distribution[source]

Return the action distribution for the given output from the wrapped policy.

Parameters:: policy_output – output from the wrapped policy, i.e. the noise-free action values
Returns:: action distribution

eval()[source]: Call PyTorch’s eval function and set the deny every exploration.

forward(obs: ~torch.Tensor, *extra) -> (<class 'torch.Tensor'>, <class 'tuple'>)[source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

schedule_eps(steps: int)[source]

train(mode=True)[source]: Call PyTorch’s eval function and set the re-activate every exploration.

class NormalActNoiseExplStrat(policy: ~pyrado.policies.base.Policy, std_init: [<class 'float'>, <class 'torch.Tensor'>], std_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.001, train_mean: bool = False, learnable: bool = True)[source]

Bases: StochasticActionExplStrat

Exploration strategy which adds Gaussian noise to the continuous policy actions

Constructor

Parameters:

policy – wrapped policy
std_init – initial standard deviation for the exploration noise
std_min – minimal standard deviation for the exploration noise
train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)

action_dist_at(policy_output: Tensor) → Distribution[source]

Return the action distribution for the given output from the wrapped policy.

Parameters:: policy_output – output from the wrapped policy, i.e. the noise-free action values
Returns:: action distribution

get_entropy(*args, **kwargs)[source]

property mean

property noise: DiagNormalNoise: Get the exploration noise.

reset_expl_params(*args, **kwargs)[source]

property std

class SACExplStrat(policy: Policy)[source]

Bases: StochasticActionExplStrat

State-dependent exploration strategy which adds normal noise squashed into by a tanh to the continuous actions.

Note

This exploration strategy is specifically designed for SAC. Due to the tanh transformation, it returns action values within [-1,1].

Constructor

Parameters:: policy – wrapped policy

action_dist_at(policy_out_1: Tensor, policy_out_2: Tensor) → Distribution[source]

Return the action distribution for the given output from the wrapped policy. This method is made for two-headed policies, e.g. used with SAC.

Parameters:

policy_out_1 – first head’s output from the wrapped policy, noise-free action values
policy_out_2 – second head’s output from the wrapped policy, state-dependent log std values

Returns:

action distribution at the mean given by policy_out_1

evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') → Distribution[source]

Re-evaluate the given rollout using the policy wrapped by this exploration strategy. Use this method to get gradient data on the action distribution. This version is tailored to the two-headed policy architecture used for SAC, since it requires a two-headed policy, where the first head returns the mean action and the second head returns the state-dependent std.

Parameters:

rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks

Returns:

actions with gradient data

forward(obs: Tensor, *extra) → [(<class 'torch.Tensor'>, <class 'torch.Tensor'>), (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)][source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

get_entropy(*args, **kwargs)[source]

property mean

property noise: DiagNormalNoise: Get the exploration noise.

reset_expl_params(*args, **kwargs)[source]

property std

class StochasticActionExplStrat(policy: Policy)[source]

Bases: Policy, ABC

Explore by sampling actions from a distribution.

Constructor

Parameters:: policy – wrapped policy

action_dist_at(policy_output: Tensor) → Distribution[source]

Return the action distribution for the given output from the wrapped policy.

Parameters:: policy_output – output from the wrapped policy, i.e. the noise-free action values
Returns:: action distribution

evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') → Distribution[source]

Re-evaluate the given rollout using the policy wrapped by this exploration strategy. Use this method to get gradient data on the action distribution.

Parameters:

rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks

Returns:

actions with gradient data

forward(obs: ~torch.Tensor, *extra) -> (<class 'torch.Tensor'>, <class 'tuple'>)[source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_hidden(batch_size: Optional[int] = None) → Tensor[source]

Provide initial values for the hidden parameters. This should usually be a zero tensor. The default implementation will raise an error, to enforce override this function for recurrent policies.

Parameters:: batch_size – number of states to track in parallel
Returns:: Tensor of batch_size x hidden_size

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

property is_recurrent: bool: Bool to signalise it the policy has a recurrent architecture.

reset(**kwargs)[source]: Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

class UniformActNoiseExplStrat(policy: ~pyrado.policies.base.Policy, halfspan_init: [<class 'float'>, <class 'torch.Tensor'>], halfspan_min: [<class 'float'>, <class 'list'>] = 0.01, train_mean: bool = False, learnable: bool = True)[source]

Bases: StochasticActionExplStrat

Exploration strategy which adds uniform noise to the continuous policy actions

Constructor

Parameters:

policy – wrapped policy
halfspan_init – initial value of the half interval for the exploration noise
halfspan_min – minimal standard deviation for the exploration noise
train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)

action_dist_at(policy_output: Tensor) → Distribution[source]

Return the action distribution for the given output from the wrapped policy.

Parameters:: policy_output – output from the wrapped policy, i.e. the noise-free action values
Returns:: action distribution

get_entropy(*args, **kwargs)[source]

property halfspan

property noise: UniformNoise: Get the exploration noise.

reset_expl_params(*args, **kwargs)[source]

stochastic_params

class HyperSphereParamNoise(param_dim: int, expl_r_init: float = 1.0)[source]

Bases: StochasticParamExplStrat

Sampling parameters from a normal distribution

Constructor

Parameters:

param_dim – number of policy parameters
expl_r_init – initial radius of the hyper-sphere

adapt(r: float)[source]: Set a new radius for the hyper sphere from which the policy parameters are sampled.

property r: float: Get the radius of the hypersphere.

reset_expl_params()[source]: Reset all parameters of the exploration strategy.

sample_param_set(nominal_params: Tensor) → Tensor[source]

Sample one set of policy parameters from the current distribution.

Parameters:: nominal_params – parameter set (1-dim tensor) to sample around
Returns:: sampled parameter set (1-dim tensor)

class NormalParamNoise(param_dim: int, full_cov: bool = False, std_init: float = 1.0, std_min: [<class 'float'>, typing.Sequence[float]] = 0.01, train_mean: bool = False, use_cuda: bool = False)[source]

Bases: StochasticParamExplStrat

Sampling parameters from a normal distribution

Constructor

Parameters:

param_dim – number of policy parameters
full_cov – use a full covariance matrix or a diagonal covariance matrix (independent random variables)
std_init – initial standard deviation for the noise distribution
std_min – minimal standard deviation for the exploration noise
train_mean – set True if the noise should have an adaptive nonzero mean, False otherwise
use_cuda – True to move the module to the GPU, False (default) to use the CPU

adapt(*args, **kwargs)[source]

property cov

get_entropy(*args, **kwargs)[source]

property noise: [<class 'pyrado.exploration.normal_noise.FullNormalNoise'>, <class 'pyrado.exploration.normal_noise.DiagNormalNoise'>]: Get the exploration noise.

reset_expl_params(*args, **kwargs)[source]

sample_param_set(nominal_params: Tensor) → Tensor[source]

Sample one set of policy parameters from the current distribution.

Parameters:: nominal_params – parameter set (1-dim tensor) to sample around
Returns:: sampled parameter set (1-dim tensor)

sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) → Tensor[source]

Sample multiple sets of policy parameters from the current distribution.

Parameters:

nominal_params – parameter set (1-dim tensor) to sample around
num_samples – number of parameter sets
include_nominal_params – True to include the nominal parameter values as first parameter set

Returns:

policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters

property std

class StochasticParamExplStrat(param_dim: int)[source]

Bases: ABC

Exploration strategy which samples policy parameters from a distribution

Constructor

Parameters:: param_dim – number of policy parameters

abstract sample_param_set(nominal_params: Tensor) → Tensor[source]

Sample one set of policy parameters from the current distribution.

Parameters:: nominal_params – parameter set (1-dim tensor) to sample around
Returns:: sampled parameter set (1-dim tensor)

sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) → Tensor[source]

Sample multiple sets of policy parameters from the current distribution.

Parameters:

nominal_params – parameter set (1-dim tensor) to sample around
num_samples – number of parameter sets
include_nominal_params – True to include the nominal parameter values as first parameter set

Returns:

policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters

class SymmParamExplStrat(wrapped: StochasticParamExplStrat)[source]

Bases: StochasticParamExplStrat

Wrap a parameter exploration strategy to enforce symmetric sampling. The function sample_param_sets will always return an even number of parameters, and it’s guaranteed that ps[:len(ps)//2] == -ps[len(ps)//2:]

Constructor

Parameters:: wrapped – exploration strategy to wrap around

sample_param_set(nominal_params: Tensor) → Tensor[source]

Sample one set of policy parameters from the current distribution.

Parameters:: nominal_params – parameter set (1-dim tensor) to sample around
Returns:: sampled parameter set (1-dim tensor)

sample_param_sets(nominal_params: Tensor, num_samples: int, include_nominal_params: bool = False) → Tensor[source]

Sample multiple sets of policy parameters from the current distribution.

Parameters:

nominal_params – parameter set (1-dim tensor) to sample around
num_samples – number of parameter sets
include_nominal_params – True to include the nominal parameter values as first parameter set

Returns:

policy parameter sets as NxP or (N+1)xP tensor where N is the number samples and P is the number of policy parameters

uniform_noise

class UniformNoise(use_cuda: bool, noise_dim: [<class 'int'>, <class 'tuple'>], halfspan_init: [<class 'float'>, <class 'int'>, <class 'torch.Tensor'>], halfspan_min: [<class 'float'>, <class 'torch.Tensor'>] = 0.01, train_mean: bool = False, learnable: bool = True)[source]

Bases: Module

Module for learnable additive uniform noise

Constructor

Parameters:

use_cuda – True to move the module to the GPU, False (default) to use the CPU
noise_dim – number of dimension
halfspan_init – initial value of the half interval for the exploration noise
halfspan_min – minimal value of the half interval for the exploration noise
train_mean – True if the noise should have an adaptive nonzero mean, False otherwise
learnable – True if the parameters should be tuneable (default), False for shallow use (just sampling)

adapt(mean: Optional[Tensor] = None, halfspan: Optional[Union[Tensor, float]] = None)[source]

Adapt the mean and the half interval span of the noise on the action or parameters. Use None to leave one of the parameters at their current value.

Parameters:

mean – exploration strategy’s new mean
halfspan – exploration strategy’s new half interval span

property device: str: Get the device (CPU or GPU) on which the policy is stored.

forward(value: Tensor) → Uniform[source]

Return the noise distribution for a specific noise-free value.

Parameters:: value – value to evaluate the distribution around
Returns:: noise distribution

get_entropy() → Tensor[source]

Get the exploration distribution’s entropy. The entropy of a uniform distribution is independent of the mean.

Returns:: entropy value

property halfspan: Tensor: Get the untransformed standard deviation vector given the log-transformed.

reset_expl_params()[source]: Reset all parameters of the exploration strategy.

training: bool

exploration

normal_noise

stochastic_action

stochastic_params

uniform_noise

Module contents