policies
base
- class Policy(spec: EnvSpec, use_cuda: bool)[source]
Bases:
Module
,ABC
Base class for all policies in Pyrado
Constructor
- Parameters:
spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- property device: str
Get the device (CPU or GPU) on which the policy is stored.
- evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Tensor [source]
Re-evaluate the given rollout and return a derivable action tensor. The default implementation simply calls forward().
- Parameters:
rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks. Defaults to ‘hidden_states’. Change for value functions.
- Returns:
actions with gradient data
- abstract forward(*args, **kwargs) Union[Tensor, Tuple[Tensor, Tensor]] [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
Provide initial values for the hidden parameters. This should usually be a zero tensor. The default implementation will raise an error, to enforce override this function for recurrent policies.
- Parameters:
batch_size – number of states to track in parallel
- Returns:
Tensor of batch_size x hidden_size
- abstract init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- property is_recurrent: bool
Bool to signalise it the policy has a recurrent architecture.
- name: str = None
- property num_param: int
Get the number of policy parameters.
- property param_grad: Tensor
Get the gradient of the parameters as 1d array. The values are copied, modifying the return value does not propagate to the actual policy parameters. However, setting this variable will change the gradient.
- property param_values: Tensor
Get the parameters of the policy as 1d array. The values are copied, modifying the return value does not propagate to the actual policy parameters. However, setting this variable will change the parameters.
- reset(**kwargs)[source]
Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.
- script() ScriptModule [source]
Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.
- class TracedPolicyWrapper(module: Policy)[source]
Bases:
Module
Wrapper for a traced policy. Mainly used to add input_size and output_size attributes.
Constructor
- Parameters:
module – non-recurrent network to wrap, which must not be a script module
- forward(obs)[source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- input_size: int
- output_size: int
- class TwoHeadedPolicy(spec: EnvSpec, use_cuda: bool)[source]
Bases:
Policy
,ABC
Base class for policies with a shared body and two separate heads.
Constructor
- Parameters:
spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- abstract forward(obs: Tensor) Union[Tensor, Tuple[Tensor, Tensor]] [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- abstract init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- training: bool
features
- class ATan2Feat(idx_sin: int, idx_cos: int)[source]
Bases:
object
Feature that computes the atan2 from two dimensions of the given input / observation.
Constructor
- Parameters:
idx_sin – indices of the numerator, i.e. the sin-transformed observation dimension
idx_cos – indices of the denominator, i.e. the cos-transformed observation dimension
- class FeatureStack(*feat_fcns: Sequence[Callable[[Tensor], Any]])[source]
Bases:
object
Features are nonlinear transformations of the inputs.
Note
We only consider 1-dim inputs.
Constructor
- Parameters:
feat_fcns – feature functions, each of them maps from a multi-dim input to a multi-dim output (e.g. identity_feat, squared_feat, exception: const_feat)
- class MultFeat(idcs: Tuple)[source]
Bases:
object
Feature that multiplies two dimensions of the given input / observation
Constructor
- Parameters:
idcs – indices of the dimensions to multiply
- class RBFFeat(num_feat_per_dim: int, bounds: [Sequence[list], Sequence[tuple], Sequence[numpy.ndarray], Sequence[torch.Tensor], Sequence[float]], scale: Optional[Union[Tensor, float]] = None, state_wise_norm: bool = True, use_cuda: bool = False)[source]
Bases:
object
Normalized Gaussian radial basis function features
Constructor
- Parameters:
num_feat_per_dim – number of radial basis functions, identical for every dimension of the input
bounds – lower and upper bound for the Gaussians’ centers, the input dimension is inferred from them
scale – scaling factor for the squared distance, if None the factor is determined such that two neighboring RBFs have a value of 0.2 at the other center
state_wise_norm – True to apply the normalization across input state dimensions separately (every dimension sums to one), or False to jointly normalize them
use_cuda – True to move the module to the GPU, False (default) to use the CPU
- derivative(inp: Tensor) Tensor [source]
Compute the derivative of the features w.r.t. the inputs.
Note
Only processing of 1-dim input (e.g., no images)! The input can be batched along the first dimension.
- Parameters:
inp – input i.e. observations in the RL setting
- Returns:
value of all features derivatives given the observations
- class RFFeat(inp_dim: int, num_feat_per_dim: int, bandwidth: Union[float, ndarray, Tensor], use_cuda: bool = False)[source]
Bases:
object
Random Fourier (RF) features
See also
[1] A. Rahimi and B. Recht “Random Features for Large-Scale Kernel Machines”, NIPS, 2007
- Gaussian kernel: \(k(x,y) = \exp(-\sigma**2 / (2*d) * ||x-y||^2)\)
Sample from \(\mathcal{N}(0,1)\) and scale the result by \(\sigma / \sqrt{2*d}\)
- Parameters:
inp_dim – flat dimension of the inputs i.e. the observations, called \(d\) in [1]
num_feat_per_dim – number of random Fourier features, called \(D\) in [1]. In contrast to the RBFFeat class, the output dimensionality, thus the number of associated policy parameters is num_feat_per_dim and not`num_feat_per_dim * inp_dim`.
bandwidth – scaling factor for the sampled frequencies. Pass a constant scalar value, for example env.obs_space.bound_up. According to [1] and the note above we should use d here. Actually, it is not a bandwidth since it is not a frequency.
use_cuda – True to move the module to the GPU, False (default) to use the CPU
initialization
- init_param(m, **kwargs)[source]
Initialize the parameters of the PyTorch Module / layer / network / cell according to its type.
- Parameters:
m – PyTorch Module / layer / network / cell to initialize
kwargs – optional keyword arguments, e.g. t_max for LSTM’s chrono initialization [2], or uniform_bias
See also
[1] A.M. Sachse, J. L. McClelland, S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”, 2014
[2] C. Tallec, Y. Ollivier, “Can recurrent neural networks warp time?”, 2018, ICLR