- class DummyPolicy(spec: EnvSpec, use_cuda: bool = False)[source]
Simple policy which samples random values form the action space
- Parameters:
spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- forward(obs: Optional[Tensor] = None) Tensor [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- name: str = 'dummy'
- class IdlePolicy(spec: EnvSpec, use_cuda: bool = False)[source]
The most simple policy which simply does nothing
- Parameters:
spec – environment specification
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- forward(obs: Optional[Tensor] = None) Tensor [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- name: str = 'idle'
- class RecurrentDummyPolicy(spec: EnvSpec, hidden_size: int, use_cuda: bool = False)[source]
Simple recurrent policy which samples random values form the action space and always returns hidden states with value zero
- Parameters:
spec – environment specification
hidden_size – size of the mimic hidden layer
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- evaluate(rollout: StepSequence, hidden_states_name: str = 'hidden_states') Tensor [source]
Re-evaluate the given rollout and return a derivable action tensor. This method makes sure that the gradient is propagated through the hidden state.
- Parameters:
rollout – complete rollout
hidden_states_name – name of hidden states rollout entry, used for recurrent networks. Change this string for value functions.
- Returns:
actions with gradient data
- forward(obs: ~typing.Optional[~torch.Tensor] = None, hidden: ~typing.Optional[~torch.Tensor] = None) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]
- Parameters:
obs – observation from the environment
hidden – the network’s hidden state. If None, use init_hidden()
- Returns:
action to be taken and new hidden state
Get the number of hidden state variables.
- init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- name: str = 'rec_cummy'
- class PlaybackPolicy(spec: EnvSpec, act_recordings: List[Union[Tensor, array]], no_reset: bool = False, use_cuda: bool = False)[source]
A policy wish simply replays a sequence of actions. If more actions are requested, the policy
- Parameters:
spec – environment specification
act_recordings – pre-recorded sequence of actions to be played back later
no_reset – True to turn reset() into a dummy function
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- property curr_rec: int
Get the pointer to the current recording.
- property curr_step: int
Get the number of the current replay step (0 for the initial step).
- forward(obs: Optional[Tensor] = None) Tensor [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- name: str = 'pb'
- property no_reset: bool
Returns True if the automatic reset is skipped, i.e. the reset has to be controlled manually.
- reset(**kwargs)[source]
Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.
- script() ScriptModule [source]
Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.
- class PolySplineTimePolicy(spec: EnvSpec, dt: float, t_end: float, cond_lvl: str, cond_final: Optional[Union[Tensor, List[float], List[List[float]]]] = None, cond_init: Optional[Union[Tensor, List[float], List[List[float]]]] = None, t_init: float = 0.0, overtime_behavior: str = 'hold', init_param_kwargs: Optional[dict] = None, use_cuda: bool = False)[source]
A purely time-based policy, were the output is determined by a polynomial function satisfying given conditions
- Parameters:
spec – environment specification
dt – time step [s]
t_end – final time [s], relative to t_init
cond_lvl – highest level of the condition, so far, only velocity ‘vel’ and acceleration ‘acc’ level conditions on the polynomial are supported. These need to be consistent with the actions.
cond_final – final condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’
cond_init – initial condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’
t_init – initial time [s], also used on calling reset(), relative to t_end
overtime_behavior – determines how the policy acts when t > t_end, e.g. ‘hold’ to keep the last action
init_param_kwargs – additional keyword arguments for the policy parameter initialization
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- forward(obs: Optional[Tensor] = None) Tensor [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- name: str = 'pst'
- reset(**kwargs)[source]
Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.
- script() ScriptModule [source]
Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.
- class TraceablePolySplineTimePolicy(spec: EnvSpec, dt: float, t_end: float, cond_lvl: str, cond_final: Union[Tensor, List[float], List[List[float]]], cond_init: Union[Tensor, List[float], List[List[float]]], t_init: float = 0.0, overtime_behavior: str = 'hold')[source]
A scriptable version of PolySplineTimePolicy.
We could try to make PolySplineTimePolicy itself scriptable, but that won’t work anyways due to Policy not being scriptable. Better to just write another class.
In contrast to PolySplineTimePolicy, this constructor needs to be called with learned / working values for cond_final and cond_init.
- Parameters:
spec – environment specification
dt – time step [s]
t_end – final time [s], relative to t_init
cond_lvl – highest level of the condition, so far, only velocity ‘vel’ and acceleration ‘acc’ level conditions on the polynomial are supported. These need to be consistent with the actions.
cond_final – final condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’
cond_init – initial condition for the least squares proble,, needs to be of shape [X, dim_act] where X is 2 if cond_lvl == ‘vel’ and 4 if cond_lvl == ‘acc’
t_init – initial time [s], also used on calling reset(), relative to t_end
overtime_behavior – determines how the policy acts when t > t_end, e.g. ‘hold’ to keep the last action
- act_space_flat_dim: int
- act_space_shape: Tuple[int]
- compute_coefficients()[source]
Compute the coefficients of the polynomial spline, and set them into the internal linear layer for storing.
- compute_feats(t: float) Tensor [source]
Compute the feature matrix depending on the time and the number of conditions.
- Parameters:
t – time to evaluate at [s]
- Returns:
feature matrix, either of shape [2, 4], or shape [3, 6]
- dt: float
- forward(obs: Optional[Tensor] = None) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within this function, one should call the
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- input_size: int
- name: str = 'pst'
- output_size: int
- overtime_behavior: str
- t_curr: float
- t_end: float
- t_init: float
- class TimePolicy(spec: EnvSpec, fcn_of_time: Callable[[float], Sequence[float]], dt: float, use_cuda: bool = False)[source]
A purely time-based policy, mainly useful for testing
- Usage:
policy = TimePolicy(env, lambda t: to.tensor([-sin(t) * 0.001]), 0.01)
- Parameters:
spec – environment specification
fcn_of_time – time-depended function returning actions
dt – time step [s]
use_cuda – True to move the policy to the GPU, False (default) to use the CPU
- forward(obs: Optional[Tensor] = None) Tensor [source]
Get the action according to the policy and the observations (forward pass).
- Parameters:
args – inputs, e.g. an observation from the environment or an observation and a hidden state
kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state
- Returns:
outputs, e.g. an action or an action and a hidden state
- init_param(init_values: Optional[Tensor] = None, **kwargs)[source]
Initialize the policy’s parameters. By default the parameters are initialized randomly.
- Parameters:
init_values – tensor of fixed initial policy parameter values
kwargs – additional keyword arguments for the policy parameter initialization
- name: str = 'time'
- reset(**kwargs)[source]
Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.
- script() ScriptModule [source]
Create a ScriptModule from this policy. The returned module will always have the signature action = tm(observation). For recurrent networks, it returns a stateful module that keeps the hidden states internally. Such modules have a reset() method to reset the hidden states.
- class TraceableTimePolicy(spec: EnvSpec, fcn_of_time: Callable[[float], Sequence[float]], dt: float)[source]
A scriptable version of TimePolicy.
We could try to make TimePolicy itself scriptable, but that won’t work anyways due to Policy not being scriptable. Better to just write another class.
- Parameters:
spec – environment specification
fcn_of_time – time-depended function returning actions
dt – time step [s]
- dt: float
- forward(obs: Optional[Tensor] = None) Tensor [source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within this function, one should call the
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- input_size: int
- name: str = 'time'
- output_size: int
- t_curr: float