policies

base

class Policy(spec: EnvSpec, use_cuda: bool)[source]

Bases: Module, ABC

Base class for all policies in Pyrado

Constructor

Parameters:
  • spec – environment specification

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

property device: str

Get the device (CPU or GPU) on which the policy is stored.

property env_spec: EnvSpec

Get the specification of environment the policy acts in.

evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Tensor[source]

Re-evaluate the given rollout and return a derivable action tensor. The default implementation simply calls forward().

Parameters:
  • rollout – complete rollout

  • hidden_states_name – name of hidden states rollout entry, used for recurrent networks. Defaults to ‘hidden_states’. Change for value functions.

Returns:

actions with gradient data

abstract forward(*args, **kwargs) Union[Tensor, Tuple[Tensor, Tensor]][source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_hidden(batch_size: Optional[int] = None) Tensor[source]

Provide initial values for the hidden parameters. This should usually be a zero tensor. The default implementation will raise an error, to enforce override this function for recurrent policies.

Parameters:

batch_size – number of states to track in parallel

Returns:

Tensor of batch_size x hidden_size

abstract init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

property is_recurrent: bool

Bool to signalise it the policy has a recurrent architecture.

name: str = None
property num_param: int

Get the number of policy parameters.

property param_grad: Tensor

Get the gradient of the parameters as 1d array. The values are copied, modifying the return value does not propagate to the actual policy parameters. However, setting this variable will change the gradient.

property param_values: Tensor

Get the parameters of the policy as 1d array. The values are copied, modifying the return value does not propagate to the actual policy parameters. However, setting this variable will change the parameters.

reset(**kwargs)[source]

Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

script() ScriptModule[source]

Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.

class TracedPolicyWrapper(module: Policy)[source]

Bases: Module

Wrapper for a traced policy. Mainly used to add input_size and output_size attributes.

Constructor

Parameters:

module – non-recurrent network to wrap, which must not be a script module

forward(obs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

input_size: int
output_size: int
class TwoHeadedPolicy(spec: EnvSpec, use_cuda: bool)[source]

Bases: Policy, ABC

Base class for policies with a shared body and two separate heads.

Constructor

Parameters:
  • spec – environment specification

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

abstract forward(obs: Tensor) Union[Tensor, Tuple[Tensor, Tensor]][source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

abstract init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

training: bool

features

class ATan2Feat(idx_sin: int, idx_cos: int)[source]

Bases: object

Feature that computes the atan2 from two dimensions of the given input / observation.

Constructor

Parameters:
  • idx_sin – indices of the numerator, i.e. the sin-transformed observation dimension

  • idx_cos – indices of the denominator, i.e. the cos-transformed observation dimension

class FeatureStack(*feat_fcns: Sequence[Callable[[Tensor], Any]])[source]

Bases: object

Features are nonlinear transformations of the inputs.

Note

We only consider 1-dim inputs.

Constructor

Parameters:

feat_fcns – feature functions, each of them maps from a multi-dim input to a multi-dim output (e.g. identity_feat, squared_feat, exception: const_feat)

get_num_feat(inp_flat_dim: int) int[source]

Calculate the number of features which depends on the dimension of the input and the selected feature functions.

Parameters:

inp_flat_dim – flattened dimension input to the feature functions

Returns:

number of feature values

class MultFeat(idcs: Tuple)[source]

Bases: object

Feature that multiplies two dimensions of the given input / observation

Constructor

Parameters:

idcs – indices of the dimensions to multiply

class RBFFeat(num_feat_per_dim: int, bounds: [Sequence[list], Sequence[tuple], Sequence[numpy.ndarray], Sequence[torch.Tensor], Sequence[float]], scale: Optional[Union[Tensor, float]] = None, state_wise_norm: bool = True, use_cuda: bool = False)[source]

Bases: object

Normalized Gaussian radial basis function features

Constructor

Parameters:
  • num_feat_per_dim – number of radial basis functions, identical for every dimension of the input

  • bounds – lower and upper bound for the Gaussians’ centers, the input dimension is inferred from them

  • scale – scaling factor for the squared distance, if None the factor is determined such that two neighboring RBFs have a value of 0.2 at the other center

  • state_wise_normTrue to apply the normalization across input state dimensions separately (every dimension sums to one), or False to jointly normalize them

  • use_cudaTrue to move the module to the GPU, False (default) to use the CPU

derivative(inp: Tensor) Tensor[source]

Compute the derivative of the features w.r.t. the inputs.

Note

Only processing of 1-dim input (e.g., no images)! The input can be batched along the first dimension.

Parameters:

inp – input i.e. observations in the RL setting

Returns:

value of all features derivatives given the observations

class RFFeat(inp_dim: int, num_feat_per_dim: int, bandwidth: Union[float, ndarray, Tensor], use_cuda: bool = False)[source]

Bases: object

Random Fourier (RF) features

See also

[1] A. Rahimi and B. Recht “Random Features for Large-Scale Kernel Machines”, NIPS, 2007

Gaussian kernel: \(k(x,y) = \exp(-\sigma**2 / (2*d) * ||x-y||^2)\)

Sample from \(\mathcal{N}(0,1)\) and scale the result by \(\sigma / \sqrt{2*d}\)

Parameters:
  • inp_dim – flat dimension of the inputs i.e. the observations, called \(d\) in [1]

  • num_feat_per_dim – number of random Fourier features, called \(D\) in [1]. In contrast to the RBFFeat class, the output dimensionality, thus the number of associated policy parameters is num_feat_per_dim and not`num_feat_per_dim * inp_dim`.

  • bandwidth – scaling factor for the sampled frequencies. Pass a constant scalar value, for example env.obs_space.bound_up. According to [1] and the note above we should use d here. Actually, it is not a bandwidth since it is not a frequency.

  • use_cudaTrue to move the module to the GPU, False (default) to use the CPU

abs_feat(inp: Tensor)[source]
bell_feat(inp: Tensor, scale: float = 1.0)[source]
const_feat(inp: Tensor)[source]
cos_feat(inp: Tensor)[source]
cubic_feat(inp: Tensor)[source]
identity_feat(inp: Tensor)[source]
sig_feat(inp: Tensor, scale: float = 1.0)[source]
sign_feat(inp: Tensor)[source]
sin_feat(inp: Tensor)[source]
sincos_feat(inp: Tensor)[source]
sinsin_feat(inp: Tensor)[source]
squared_feat(inp: Tensor)[source]

initialization

init_param(m, **kwargs)[source]

Initialize the parameters of the PyTorch Module / layer / network / cell according to its type.

Parameters:
  • m – PyTorch Module / layer / network / cell to initialize

  • kwargs – optional keyword arguments, e.g. t_max for LSTM’s chrono initialization [2], or uniform_bias

See also

[1] A.M. Sachse, J. L. McClelland, S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”, 2014

[2] C. Tallec, Y. Ollivier, “Can recurrent neural networks warp time?”, 2018, ICLR

Module contents