policies

base

class Policy(spec: EnvSpec, use_cuda: bool)[source]

Bases: Module, ABC

Base class for all policies in Pyrado

Constructor

Parameters:

spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU

property device: str: Get the device (CPU or GPU) on which the policy is stored.

property env_spec: EnvSpec: Get the specification of environment the policy acts in.

evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') → Tensor[source]

Re-evaluate the given rollout and return a derivable action tensor. The default implementation simply calls forward().

Parameters:

rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks. Defaults to ‘hidden_states’. Change for value functions.

Returns:

actions with gradient data

abstract forward(*args, **kwargs) → Union[Tensor, Tuple[Tensor, Tensor]][source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_hidden(batch_size: Optional[int] = None) → Tensor[source]

Provide initial values for the hidden parameters. This should usually be a zero tensor. The default implementation will raise an error, to enforce override this function for recurrent policies.

Parameters:: batch_size – number of states to track in parallel
Returns:: Tensor of batch_size x hidden_size

abstract init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

property is_recurrent: bool: Bool to signalise it the policy has a recurrent architecture.

name: str = None

property num_param: int: Get the number of policy parameters.

property param_grad: Tensor: Get the gradient of the parameters as 1d array. The values are copied, modifying the return value does not propagate to the actual policy parameters. However, setting this variable will change the gradient.

property param_values: Tensor: Get the parameters of the policy as 1d array. The values are copied, modifying the return value does not propagate to the actual policy parameters. However, setting this variable will change the parameters.

reset(**kwargs)[source]: Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

script() → ScriptModule[source]: Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.

class TracedPolicyWrapper(module: Policy)[source]

Bases: Module

Wrapper for a traced policy. Mainly used to add input_size and output_size attributes.

Constructor

Parameters:: module – non-recurrent network to wrap, which must not be a script module

forward(obs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

input_size: int

output_size: int

class TwoHeadedPolicy(spec: EnvSpec, use_cuda: bool)[source]

Bases: Policy, ABC

Base class for policies with a shared body and two separate heads.

Constructor

Parameters:

spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU

abstract forward(obs: Tensor) → Union[Tensor, Tuple[Tensor, Tensor]][source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

abstract init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

training: bool

features

class ATan2Feat(idx_sin: int, idx_cos: int)[source]

Bases: object

Feature that computes the atan2 from two dimensions of the given input / observation.

Constructor

Parameters:

idx_sin – indices of the numerator, i.e. the sin-transformed observation dimension
idx_cos – indices of the denominator, i.e. the cos-transformed observation dimension

class FeatureStack(*feat_fcns: Sequence[Callable[[Tensor], Any]])[source]

Bases: object

Features are nonlinear transformations of the inputs.

Note

We only consider 1-dim inputs.

Constructor

Parameters:: feat_fcns – feature functions, each of them maps from a multi-dim input to a multi-dim output (e.g. identity_feat, squared_feat, exception: const_feat)

get_num_feat(inp_flat_dim: int) → int[source]

Calculate the number of features which depends on the dimension of the input and the selected feature functions.

Parameters:: inp_flat_dim – flattened dimension input to the feature functions
Returns:: number of feature values

class MultFeat(idcs: Tuple)[source]

Bases: object

Feature that multiplies two dimensions of the given input / observation

Constructor

Parameters:: idcs – indices of the dimensions to multiply

class RBFFeat(num_feat_per_dim: int, bounds: [Sequence[list], Sequence[tuple], Sequence[numpy.ndarray], Sequence[torch.Tensor], Sequence[float]], scale: Optional[Union[Tensor, float]] = None, state_wise_norm: bool = True, use_cuda: bool = False)[source]

Bases: object

Normalized Gaussian radial basis function features

Constructor

Parameters:

num_feat_per_dim – number of radial basis functions, identical for every dimension of the input
bounds – lower and upper bound for the Gaussians’ centers, the input dimension is inferred from them
scale – scaling factor for the squared distance, if None the factor is determined such that two neighboring RBFs have a value of 0.2 at the other center
state_wise_norm – True to apply the normalization across input state dimensions separately (every dimension sums to one), or False to jointly normalize them
use_cuda – True to move the module to the GPU, False (default) to use the CPU

derivative(inp: Tensor) → Tensor[source]

Compute the derivative of the features w.r.t. the inputs.

Note

Only processing of 1-dim input (e.g., no images)! The input can be batched along the first dimension.

Parameters:: inp – input i.e. observations in the RL setting
Returns:: value of all features derivatives given the observations

class RFFeat(inp_dim: int, num_feat_per_dim: int, bandwidth: Union[float, ndarray, Tensor], use_cuda: bool = False)[source]

Bases: object

Random Fourier (RF) features

initialization

init_param(m, **kwargs)[source]

Initialize the parameters of the PyTorch Module / layer / network / cell according to its type.

Parameters:

m – PyTorch Module / layer / network / cell to initialize
kwargs – optional keyword arguments, e.g. t_max for LSTM’s chrono initialization [2], or uniform_bias

policies

base

features

initialization

Module contents