- class Policy(spec: EnvSpec, use_cuda: bool)[source]
Base class for all policies in Pyrado
- Parameters:
spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- property device: str
Get the device (CPU or GPU) on which the policy is stored.
- evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Tensor [source]
Re-evaluate the given rollout and return a derivable action tensor. The default implementation simply calls forward().
- Parameters:
rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks. Defaults to ‘hidden_states’. Change for value functions.
- Returns:
actions with gradient data
- abstract forward(*args, **kwargs) Union[Tensor, Tuple[Tensor, Tensor]] [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
Provide initial values for the hidden parameters. This should usually be a zero tensor. The default implementation will raise an error, to enforce override this function for recurrent policies.
- Parameters:
batch_size – number of states to track in parallel
- Returns:
Tensor of batch_size x hidden_size
- abstract init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- property is_recurrent: bool
Bool to signalise it the policy has a recurrent architecture.
- name: str = None
- property num_param: int
Get the number of policy parameters.
- property param_grad: Tensor
Get the gradient of the parameters as 1d array. The values are copied, modifying the return value does not propagate to the actual policy parameters. However, setting this variable will change the gradient.
- property param_values: Tensor
Get the parameters of the policy as 1d array. The values are copied, modifying the return value does not propagate to the actual policy parameters. However, setting this variable will change the parameters.
- reset(**kwargs)[source]
Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.
- script() ScriptModule [source]
Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.
- class TracedPolicyWrapper(module: Policy)[source]
Wrapper for a traced policy. Mainly used to add input_size and output_size attributes.
- Parameters:
module – non-recurrent network to wrap, which must not be a script module
- forward(obs)[source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within this function, one should call the
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- input_size: int
- output_size: int
- class TwoHeadedPolicy(spec: EnvSpec, use_cuda: bool)[source]
Base class for policies with a shared body and two separate heads.
- Parameters:
spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- abstract forward(obs: Tensor) Union[Tensor, Tuple[Tensor, Tensor]] [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- abstract init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- training: bool
- class ATan2Feat(idx_sin: int, idx_cos: int)[source]
Feature that computes the atan2 from two dimensions of the given input / observation.
- Parameters:
idx_sin – indices of the numerator, i.e. the sin-transformed observation dimension
idx_cos – indices of the denominator, i.e. the cos-transformed observation dimension
- class FeatureStack(*feat_fcns: Sequence[Callable[[Tensor], Any]])[source]
Features are nonlinear transformations of the inputs.
We only consider 1-dim inputs.
- Parameters:
feat_fcns – feature functions, each of them maps from a multi-dim input to a multi-dim output (e.g. identity_feat, squared_feat, exception: const_feat)
- class MultFeat(idcs: Tuple)[source]
Feature that multiplies two dimensions of the given input / observation
- Parameters:
idcs – indices of the dimensions to multiply
- class RBFFeat(num_feat_per_dim: int, bounds: [Sequence[list], Sequence[tuple], Sequence[numpy.ndarray], Sequence[torch.Tensor], Sequence[float]], scale: Optional[Union[Tensor, float]] = None, state_wise_norm: bool = True, use_cuda: bool = False)[source]
Normalized Gaussian radial basis function features
- Parameters:
num_feat_per_dim – number of radial basis functions, identical for every dimension of the input
bounds – lower and upper bound for the Gaussians’ centers, the input dimension is inferred from them
scale – scaling factor for the squared distance, if None the factor is determined such that two neighboring RBFs have a value of 0.2 at the other center
state_wise_norm – True to apply the normalization across input state dimensions separately (every dimension sums to one), or False to jointly normalize them
use_cuda – True to move the module to the GPU, False (default) to use the CPU
- derivative(inp: Tensor) Tensor [source]
Compute the derivative of the features w.r.t. the inputs.
Only processing of 1-dim input (e.g., no images)! The input can be batched along the first dimension.
- Parameters:
inp – input i.e. observations in the RL setting
- Returns:
value of all features derivatives given the observations
- class RFFeat(inp_dim: int, num_feat_per_dim: int, bandwidth: Union[float, ndarray, Tensor], use_cuda: bool = False)[source]
Random Fourier (RF) features
See also
[1] A. Rahimi and B. Recht “Random Features for Large-Scale Kernel Machines”, NIPS, 2007
- Gaussian kernel: \(k(x,y) = \exp(-\sigma**2 / (2*d) * ||x-y||^2)\)
Sample from \(\mathcal{N}(0,1)\) and scale the result by \(\sigma / \sqrt{2*d}\)
- Parameters:
inp_dim – flat dimension of the inputs i.e. the observations, called \(d\) in [1]
num_feat_per_dim – number of random Fourier features, called \(D\) in [1]. In contrast to the RBFFeat class, the output dimensionality, thus the number of associated policy parameters is num_feat_per_dim and not`num_feat_per_dim * inp_dim`.
bandwidth – scaling factor for the sampled frequencies. Pass a constant scalar value, for example env.obs_space.bound_up. According to [1] and the note above we should use d here. Actually, it is not a bandwidth since it is not a frequency.
use_cuda – True to move the module to the GPU, False (default) to use the CPU
- init_param(m, **kwargs)[source]
Initialize the parameters of the PyTorch Module / layer / network / cell according to its type.
- Parameters:
m – PyTorch Module / layer / network / cell to initialize
kwargs – optional keyword arguments, e.g. t_max for LSTM’s chrono initialization [2], or uniform_bias
See also
[1] A.M. Sachse, J. L. McClelland, S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”, 2014
[2] C. Tallec, Y. Ollivier, “Can recurrent neural networks warp time?”, 2018, ICLR