special

domain_distribution

class DomainDistrParamPolicy(mapping: Dict[int, Tuple[str, str]], trafo_mask: Union[list, Tensor], prior: Optional[DomainRandomizer] = None, scale_params: bool = False, use_cuda: bool = False)[source]

Bases: Policy

A proxy to the Policy class in order to use the policy’s parameters as domain distribution parameters

Constructor

Parameters:
  • mapping – mapping from subsequent integers to domain distribution parameters, where the first string of the value tuple is the name of the domain parameter (e.g. mass, length) and the second string is the name of the distribution parameter (e.g. mean, std). The integers are indices of the numpy array which come from the algorithm.

  • trafo_mask – every domain parameter that is set to True in this mask will be learned via a ‘virtual’ parameter, i.e. in sqrt-space, and then finally squared to retrieve the domain parameter. This transformation is useful to avoid setting a negative variance.

  • prior – prior believe about the distribution parameters in from of a DomainRandomizer

  • scale_params – if True, the sqrt-transformed policy parameters are scaled in the range of \([-0.5, 0.5]\). The advantage of this is to make the parameter-based exploration easier.

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

forward(obs: Optional[Tensor] = None) Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

property mapping: Dict[int, Tuple[str, str]]

Get the mapping from subsequent integers to domain distribution parameters, where the first string of the value tuple is the name of the domain parameter and the second string is the name of the distribution parameter.

name: str = 'ddp'
transform_to_ddp_space(params: Tensor) Tensor[source]

Get the transformed domain distribution parameters. Where ever the mask is True, the corresponding policy parameter is learned in sqrt space. Moreover, the policy parameters can be scaled.

Parameters:

params – policy parameters (can be the sqrt of the actual domain distribution parameter value)

Returns:

policy parameters transformed according to the mask

environment_specific

class QBallBalancerPDCtrl(env_spec: EnvSpec, state_des: Tensor = tensor([0.0, 0.0]), kp: Optional[Tensor] = None, kd: Optional[Tensor] = None, use_cuda: bool = False)[source]

Bases: Policy

PD-controller for the Quanser Ball Balancer. The only but significant difference of this controller to the other PD controller is the clipping of the actions.

Note

This class’s desired state specification deviates from the Pyrado policies which interact with a Task.

Constructor

Parameters:
  • env_spec – environment specification

  • state_des – tensor of desired x and y ball position [m]

  • kp – 2x2 tensor of constant controller feedback coefficients for error [V/m]

  • kd – 2x2 tensor of constant controller feedback coefficients for error time derivative [Vs/m]

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

forward(obs: Tensor) Tensor[source]

Calculate the controller output.

Parameters:

obs – observation from the environment

Return act:

controller output [V]

init_param(kp: Optional[Tensor] = None, kd: Optional[Tensor] = None, verbose: bool = False, **kwargs)[source]

Initialize controller parameters.

Parameters:
  • kp – 2x2 tensor of constant controller feedback coefficients for error [V/m]

  • kd – 2x2 tensor of constant controller feedback coefficients for error time derivative [Vs/m]

  • verbose – print the controller’s gains

name: str = 'qbb-pd'
reset(**kwargs)[source]

Set the domain parameters defining the controller’s model using a dict called domain_param.

class QCartPoleGoToLimCtrl(init_state: ndarray, positive: bool = True)[source]

Bases: object

Controller for going to one of the joint limits (part of the calibration routine)

Constructor

Parameters:
  • init_state – initial state of the system

  • positive – direction switch

class QCartPoleSwingUpAndBalanceCtrl(env_spec: EnvSpec, long: bool = False, use_cuda: bool = False)[source]

Bases: Policy

Swing-up and balancing controller for the Quanser Cart-Pole

Constructor

Parameters:
  • env_spec – environment specification

  • long – flag for long or short pole

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

property K_pd
forward(obs: Tensor) Tensor[source]

Calculate the controller output.

Parameters:

obs – observation from the environment

Return act:

controller output [V]

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

property k_e
property k_p
name: str = 'qcp-sub'
property u_max
class QQubeEnergyCtrl(env_spec: EnvSpec, ref_energy: float, energy_gain: float, th_gain: float, acc_max: float, reset_domain_param: bool = True, use_cuda: bool = False)[source]

Bases: Policy

Energy-based controller used to swing the Furuta pendulum up

Constructor

Parameters:
  • env_spec – environment specification

  • ref_energy – reference energy level [J]

  • energy_gain – P-gain on the energy [m/s/J]

  • th_gain – P-gain on angle theta

  • acc_max – maximum linear acceleration of the pendulum pivot [m/s**2]

  • reset_domain_param – if True the domain parameters are reset if the they are present as a entry in the kwargs passed to reset(). If False they are ignored.

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

property E_gain

Get the energy gain, called \(\mu\) in the Quanser documentation.

property E_ref

Get the reference energy level.

forward(obs: Tensor) Tensor[source]

Control step of energy-based controller which is used in the swing-up controller

Parameters:

obs – observations pre-processed in the forward method of QQubeSwingUpAndBalanceCtrl

Returns:

action

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

reset(**kwargs)[source]

If desired, set the domain parameters defining the controller’s model using a dict called domain_param.

training: bool
class QQubeGoToLimCtrl(positive: bool = True, cnt_done: int = 250)[source]

Bases: object

Controller for going to one of the joint limits (part of the calibration routine)

Constructor

Parameters:

positive – direction switch

class QQubePDCtrl(env_spec: EnvSpec, pd_gains: Tensor = tensor([4.0, 0.0, 1.0, 0.0]), th_des: float = 0.0, al_des: float = 0.0, tols: Tensor = tensor([0.0262, 0.0087, 0.0017, 0.0017], dtype=torch.float64), use_cuda: bool = False)[source]

Bases: Policy

PD-controller for the Quanser Qube. Drives Qube to \(x_{des} = [\theta_{des}, \alpha_{des}, 0.0, 0.0]\). Flag done is set when \(|x_des - x| < tol\).

Constructor

Parameters:
  • env_spec – environment specification

  • pd_gains – controller gains, the default values stabilize the pendulum at the center hanging down

  • th_des – desired rotary pole angle [rad]

  • al_des – desired pendulum pole angle [rad]

  • tols – tolerances for the desired angles \(\theta\) and \(\alpha\) [rad]

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

forward(meas: Tensor) Tensor[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

training: bool
class QQubeSwingUpAndBalanceCtrl(env_spec: EnvSpec, ref_energy: float = 0.025, energy_gain: float = 50.0, energy_th_gain: float = 0.4, acc_max: float = 5.0, alpha_max_pd_enable: float = 20.0, pd_gains: Tensor = tensor([-2.0, 35.0, -1.5, 3.0]), reset_domain_param: bool = True, use_cuda: bool = False)[source]

Bases: Policy

Hybrid controller (QQubeEnergyCtrl, QQubePDCtrl) switching based on the pendulum pole angle alpha

Note

Extracted Quanser’s values from q_qube2_swingup.mdl

Constructor

Parameters:
  • env_spec – environment specification

  • ref_energy – reference energy level

  • energy_gain – P-gain on the difference to the reference energy

  • energy_th_gain – P-gain on angle theta for the Energy controller. This term does not exist in Quanser’s implementation. Its purpose it to keep the Qube from moving too much around the vertical axis, i.e. prevent bouncing against the mechanical boundaries.

  • acc_max – maximum acceleration

  • alpha_max_pd_enable – angle threshold for enabling the PD -controller [deg]

  • pd_gains – gains for the PD-controller

  • use_cudaTrue to move the policy to the GPU, False (default) to use the CPU

Note

The controller’s parameters strongly depend on the frequency at which it is operating.

forward(obs: tensor)[source]

Get the action according to the policy and the observations (forward pass).

Parameters:
  • args – inputs, e.g. an observation from the environment or an observation and a hidden state

  • kwargs – inputs, e.g. an observation from the environment or an observation and a hidden state

Returns:

outputs, e.g. an action or an action and a hidden state

init_param(init_values: Optional[Tensor] = None, **kwargs)[source]

Initialize the policy’s parameters. By default the parameters are initialized randomly.

Parameters:
  • init_values – tensor of fixed initial policy parameter values

  • kwargs – additional keyword arguments for the policy parameter initialization

name: str = 'qq-sub'
pd_enabled(cos_al: [<class 'float'>, <class 'torch.Tensor'>]) bool[source]

Check if the PD-controller should be enabled based oin a predefined threshold on the alpha angle.

Parameters:

cos_al – cosine of the pendulum pole angle

Returns:

bool if condition is met

reset(**kwargs)[source]

Reset the policy’s internal state. This should be called at the start of a rollout. The default implementation does nothing.

create_mg_joint_pos_policy(env: SimEnv, t_strike_end: float = 0.5) TimePolicy[source]

Create a policy that executes the strike for mini golf by setting joint position commands. Used in the experiments of [1].

See also

[1] F. Muratore, T. Gruner, F. Wiese, B. Belousov, M. Gienger, J. Peters, “TITLE”, VENUE, YEAR

Parameters:

env – mini golf simulation environment

:param t_strike_end:time when to finish the movement [s] :return: policy which executes the strike solely dependent on the time

create_pend_excitation_policy(env: PendulumSim, num_rollouts: int, f_sin: float = 1.0) PlaybackPolicy[source]

Create a policy that returns a previously recorded action time series. Used in the experiments of [1].

See also

[1] F. Muratore, T. Gruner, F. Wiese, B. Belousov, M. Gienger, J. Peters, “TITLE”, VENUE, YEAR

Parameters:
  • env – pendulum simulation environment

  • num_rollouts – number of rollouts to store in the policy’s buffer

  • f_sin – frequency of the sinus [Hz]

Returns:

policy with recorded action time series

wam_jsp_7dof_sin(t: float, flip_sign: bool = False)[source]

A sin-based excitation function for the 7-DoF WAM, describing desired a desired joint angle offset and its velocity at every point in time

Parameters:
  • t – time

  • flip_sign – if True, flip the sign

Returns:

joint angle positions and velocities

Module contents