feed_forward

dummy

class DummyPolicy(spec: EnvSpec, use_cuda: bool = False)[source]

Bases: Policy

Simple policy which samples random values form the action space

Constructor

Parameters:
  • spec – environment specification

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'dummy'
class IdlePolicy(spec: EnvSpec, use_cuda: bool = False)[source]

Bases: Policy

The most simple policy which simply does nothing

Constructor

Parameters:
  • spec – environment specification

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'idle'
class RecurrentDummyPolicy(spec: EnvSpec, hidden_size: int, use_cuda: bool = False)[source]

Bases: RecurrentPolicy

Simple recurrent policy which samples random values form the action space and always returns hidden states with value zero

Constructor

Parameters:
  • spec – environment specification

  • hidden_size – size of the mimic hidden layer

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Tensor[source]

Re-evaluate the given rollout and return a derivable action tensor. This method makes sure that the gradient is propagated through the hidden state.

Parameters:
  • rollout – complete rollout

  • hidden_states_name – name of hidden states rollout entry, used for recurrent networks. Change this string for value functions.

Returns:

actions with gradient data

forward(obs: ~typing.Optional[~torch.Tensor] = None, hidden: ~typing.Optional[~torch.Tensor] = None) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]
Parameters:
  • obs – observation from the environment

  • hidden – the network’s hidden state. If None, use init_hidden()

Returns:

action to be taken and new hidden state

property hidden_size: int

Get the number of hidden state variables.

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'rec_cummy'

playback

class PlaybackPolicy(spec: EnvSpec, act_recordings: List[Union[Tensor, array]], no_reset: bool = False, use_cuda: bool = False)[source]

Bases: Policy

A policy wish simply replays a sequence of actions. If more actions are requested, the policy

Constructor

Parameters:
  • spec – environment specification

  • act_recordings – pre-recorded sequence of actions to be played back later

  • no_resetTrue to turn reset() into a dummy function

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

property curr_rec: int

Get the pointer to the current recording.

property curr_step: int

Get the number of the current replay step (0 for the initial step).

forward(obs: Optional[Tensor] = None) Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'pb'
property no_reset: bool

Returns True if the automatic reset is skipped, i.e. the reset has to be controlled manually.

reset(**kwargs)[source]

Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

reset_curr_rec()[source]

Reset the pointer to the current recording.

script() ScriptModule[source]

Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.

poly_time

class PolySplineTimePolicy(spec: EnvSpec, dt: float, t_end: float, cond_lvl: str, cond_final: Optional[Union[Tensor, List[float], List[List[float]]]] = None, cond_init: Optional[Union[Tensor, List[float], List[List[float]]]] = None, t_init: float = 0.0, overtime_behavior: str = 'hold', init_param_kwargs: Optional[dict] = None, use_cuda: bool = False)[source]

Bases: Policy

A purely time-based policy, were the output is determined by a polynomial function satisfying given conditions

Constructor

Parameters:
  • spec – environment specification

  • dt – time step [s]

  • t_end – final time [s], relative to t_init

  • cond_lvl – highest level of the condition, so far, only velocity ‘vel’ and acceleration ‘acc’ level conditions on the polynomial are supported. These need to be consistent with the actions.

  • cond_final – final condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’

  • cond_init – initial condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’

  • t_init – initial time [s], also used on calling reset(), relative to t_end

  • overtime_behavior – determines how the policy acts when t > t_end, e.g. ‘hold’ to keep the last action

  • init_param_kwargs – additional keyword arguments for the policy parameter initialization

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'pst'
reset(**kwargs)[source]

Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

script() ScriptModule[source]

Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.

class TraceablePolySplineTimePolicy(spec: EnvSpec, dt: float, t_end: float, cond_lvl: str, cond_final: Union[Tensor, List[float], List[List[float]]], cond_init: Union[Tensor, List[float], List[List[float]]], t_init: float = 0.0, overtime_behavior: str = 'hold')[source]

Bases: Module

A scriptable version of PolySplineTimePolicy.

We could try to make PolySplineTimePolicy itself scriptable, but that won’t work anyways due to Policy not being scriptable. Better to just write another class.

In contrast to PolySplineTimePolicy, this constructor needs to be called with learned / working values for cond_final and cond_init.

Parameters:
  • spec – environment specification

  • dt – time step [s]

  • t_end – final time [s], relative to t_init

  • cond_lvl – highest level of the condition, so far, only velocity ‘vel’ and acceleration ‘acc’ level conditions on the polynomial are supported. These need to be consistent with the actions.

  • cond_final – final condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’

  • cond_init – initial condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’

  • t_init – initial time [s], also used on calling reset(), relative to t_end

  • overtime_behavior – determines how the policy acts when t > t_end, e.g. ‘hold’ to keep the last action

act_space_flat_dim: int
act_space_shape: Tuple[int]
compute_coefficients()[source]

Compute the coefficients of the polynomial spline, and set them into the internal linear layer for storing.

compute_feats(t: float) Tensor[source]

Compute the feature matrix depending on the time and the number of conditions.

Parameters:

t – time to evaluate at [s]

Returns:

feature matrix, either of shape [2, 4], or shape [3, 6]

dt: float
forward(obs: Optional[Tensor] = None) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

input_size: int
name: str = 'pst'
output_size: int
overtime_behavior: str
reset()[source]

Reset the policy’s internal state.

t_curr: float
t_end: float
t_init: float

time

class TimePolicy(spec: EnvSpec, fcn_of_time: Callable[[float], Sequence[float]], dt: float, use_cuda: bool = False)[source]

Bases: Policy

A purely time-based policy, mainly useful for testing

Constructor

Usage:

policy = TimePolicy(env, lambda t: to.tensor([-sin(t) * 0.001]), 0.01)
Parameters:
  • spec – environment specification

  • fcn_of_time – time-depended function returning actions

  • dt – time step [s]

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'time'
reset(**kwargs)[source]

Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

script() ScriptModule[source]

Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.

class TraceableTimePolicy(spec: EnvSpec, fcn_of_time: Callable[[float], Sequence[float]], dt: float)[source]

Bases: Module

A scriptable version of TimePolicy.

We could try to make TimePolicy itself scriptable, but that won’t work anyways due to Policy not being scriptable. Better to just write another class.

Constructor

Parameters:
  • spec – environment specification

  • fcn_of_time – time-depended function returning actions

  • dt – time step [s]

dt: float
forward(obs: Optional[Tensor] = None) Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

input_size: int
name: str = 'time'
output_size: int
reset()[source]

Reset the policy’s internal state.

t_curr: float

Module contents