feed_forward

dummy

class DummyPolicy(spec: EnvSpec, use_cuda: bool = False)[source]

Bases: Policy

Simple policy which samples random values form the action space

Constructor

Parameters:

spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) → Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'dummy'

class IdlePolicy(spec: EnvSpec, use_cuda: bool = False)[source]

Bases: Policy

The most simple policy which simply does nothing

Constructor

Parameters:

spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) → Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'idle'

class RecurrentDummyPolicy(spec: EnvSpec, hidden_size: int, use_cuda: bool = False)[source]

Bases: RecurrentPolicy

Simple recurrent policy which samples random values form the action space and always returns hidden states with value zero

Constructor

Parameters:

spec – environment specification
hidden_size – size of the mimic hidden layer
use_cuda – True to move the policy to the GPU, False (default) to use the CPU

evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') → Tensor[source]

Re-evaluate the given rollout and return a derivable action tensor. This method makes sure that the gradient is propagated through the hidden state.

Parameters:

rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks. Change this string for value functions.

Returns:

actions with gradient data

forward(obs: ~typing.Optional[~torch.Tensor] = None, hidden: ~typing.Optional[~torch.Tensor] = None) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]

Parameters:

obs – observation from the environment
hidden – the network’s hidden state. If None, use init_hidden()

Returns:

action to be taken and new hidden state

property hidden_size: int: Get the number of hidden state variables.

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'rec_cummy'

playback

class PlaybackPolicy(spec: EnvSpec, act_recordings: List[Union[Tensor, array]], no_reset: bool = False, use_cuda: bool = False)[source]

Bases: Policy

A policy wish simply replays a sequence of actions. If more actions are requested, the policy

Constructor

Parameters:

spec – environment specification
act_recordings – pre-recorded sequence of actions to be played back later
no_reset – True to turn reset() into a dummy function
use_cuda – True to move the policy to the GPU, False (default) to use the CPU

property curr_rec: int: Get the pointer to the current recording.

property curr_step: int: Get the number of the current replay step (0 for the initial step).

forward(obs: Optional[Tensor] = None) → Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'pb'

property no_reset: bool: Returns True if the automatic reset is skipped, i.e. the reset has to be controlled manually.

reset(**kwargs)[source]: Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

reset_curr_rec()[source]: Reset the pointer to the current recording.

script() → ScriptModule[source]: Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.

poly_time

class PolySplineTimePolicy(spec: EnvSpec, dt: float, t_end: float, cond_lvl: str, cond_final: Optional[Union[Tensor, List[float], List[List[float]]]] = None, cond_init: Optional[Union[Tensor, List[float], List[List[float]]]] = None, t_init: float = 0.0, overtime_behavior: str = 'hold', init_param_kwargs: Optional[dict] = None, use_cuda: bool = False)[source]

Bases: Policy

A purely time-based policy, were the output is determined by a polynomial function satisfying given conditions

Constructor

Parameters:

spec – environment specification
dt – time step [s]
t_end – final time [s], relative to t_init
cond_lvl – highest level of the condition, so far, only velocity ‘vel’ and acceleration ‘acc’ level conditions on the polynomial are supported. These need to be consistent with the actions.
cond_final – final condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’
cond_init – initial condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’
t_init – initial time [s], also used on calling reset(), relative to t_end
overtime_behavior – determines how the policy acts when t > t_end, e.g. ‘hold’ to keep the last action
init_param_kwargs – additional keyword arguments for the policy parameter initialization
use_cuda – True to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) → Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'pst'

reset(**kwargs)[source]: Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

script() → ScriptModule[source]: Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.

class TraceablePolySplineTimePolicy(spec: EnvSpec, dt: float, t_end: float, cond_lvl: str, cond_final: Union[Tensor, List[float], List[List[float]]], cond_init: Union[Tensor, List[float], List[List[float]]], t_init: float = 0.0, overtime_behavior: str = 'hold')[source]

Bases: Module

A scriptable version of PolySplineTimePolicy.

We could try to make PolySplineTimePolicy itself scriptable, but that won’t work anyways due to Policy not being scriptable. Better to just write another class.

In contrast to PolySplineTimePolicy, this constructor needs to be called with learned / working values for cond_final and cond_init.

Parameters:

spec – environment specification
dt – time step [s]
t_end – final time [s], relative to t_init
cond_lvl – highest level of the condition, so far, only velocity ‘vel’ and acceleration ‘acc’ level conditions on the polynomial are supported. These need to be consistent with the actions.
cond_final – final condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’
cond_init – initial condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’
t_init – initial time [s], also used on calling reset(), relative to t_end
overtime_behavior – determines how the policy acts when t > t_end, e.g. ‘hold’ to keep the last action

act_space_flat_dim: int

act_space_shape: Tuple[int]

compute_coefficients()[source]: Compute the coefficients of the polynomial spline, and set them into the internal linear layer for storing.

compute_feats(t: float) → Tensor[source]

Compute the feature matrix depending on the time and the number of conditions.

Parameters:: t – time to evaluate at [s]
Returns:: feature matrix, either of shape [2, 4], or shape [3, 6]

dt: float

forward(obs: Optional[Tensor] = None) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

input_size: int

name: str = 'pst'

output_size: int

overtime_behavior: str

reset()[source]: Reset the policy’s internal state.

t_curr: float

t_end: float

t_init: float

time

class TimePolicy(spec: EnvSpec, fcn_of_time: Callable[[float], Sequence[float]], dt: float, use_cuda: bool = False)[source]

Bases: Policy

A purely time-based policy, mainly useful for testing

Constructor

Usage:

policy = TimePolicy(env, lambda t: to.tensor([-sin(t) * 0.001]), 0.01)

Parameters:

spec – environment specification
fcn_of_time – time-depended function returning actions
dt – time step [s]
use_cuda – True to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) → Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:

args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:

init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'time'

reset(**kwargs)[source]: Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

script() → ScriptModule[source]: Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.

class TraceableTimePolicy(spec: EnvSpec, fcn_of_time: Callable[[float], Sequence[float]], dt: float)[source]

Bases: Module

A scriptable version of TimePolicy.

We could try to make TimePolicy itself scriptable, but that won’t work anyways due to Policy not being scriptable. Better to just write another class.

Constructor

Parameters:

spec – environment specification
fcn_of_time – time-depended function returning actions
dt – time step [s]

dt: float

forward(obs: Optional[Tensor] = None) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

input_size: int

name: str = 'time'

output_size: int

reset()[source]: Reset the policy’s internal state.

t_curr: float

feed_forward

dummy

playback

poly_time

time

Module contents