How to run an experiment ------------------------ This file provides a step-by-step example of how to write a training script in Pyrado. There are many valid possibilities to deviate from this scheme. However, the following sequence is battle-tested. .. code-block:: python import pyrado from pyrado.algorithms.episodic.hc import HCNormal from pyrado.environment_wrappers.action_normalization import ActNormWrapper from pyrado.environments.pysim.ball_on_beam import BallOnBeamSim from pyrado.logger.experiment import setup_experiment, save_dicts_to_yaml from pyrado.policies.features import FeatureStack, identity_feat, sin_feat from pyrado.policies.feed_back.linear import LinearPolicy from pyrado.sampling.rollout import rollout, after_rollout_query from pyrado.utils.data_types import RenderMode from pyrado.utils.input_output import print_cbt First, we create an `Experiment`, which basically is a folder (by default in `Pyrado/data/temp`). The experiments are stored using the following scheme: ///--. This rule is only required for the automatic search for experiments (e.g. used in `sim_policy()`). This search function requires the individual experiment folders to start with `date_time`. Aside from this, you can name your experiments and folders however you like. Use the `load_experiment()` function to later oad your results. It will look for an environment as well as a policy file in the provided path. .. code-block:: python ex_dir = setup_experiment(BallOnBeamSim.name, f'{HCNormal.name}_{LinearPolicy.name}', 'ident-sin') Additionally, you can set a seed for the random number generators. It is suggested to do so, if you want to compare changes of certain hyper-parameters to eliminate the effect of the initial state and the initial policy parameters (both are sampled randomly in most cases). .. code-block:: python pyrado.set_seed(seed=0, verbose=True) Set up the environment a.k.a. domain to train in. After creating the environment, you can apply various wrappers which are modular. Note that the order of wrappers might be of importance. For example, wrapping an environment with an `ObsNormWrapper` and then with an `GaussianObsNoiseWrapper` applies the noise on the normalized observations, and yields different results than the reverse order of wrapping. Environments in Pyrado can be of different types: (i) written in Python only (like the Qunaser simulations or simple OpenAI Gym environments), (ii) wrapped as well as self-designed MuJoCo-based simulations, or (iii) self-designed robotic environments powered by Rcs using either the Bullet or Vortex physics engine. None of the simulations includes any computer vision aspects. It is all about dynamics-based interaction and (continuous) control. The degree of randomization for the environments varies strongly, since it is a lot of work to randomize them properly (including testing) and I have to graduate after all ;) .. code-block:: python env_hparam = dict( dt=1/50., max_steps=300 ) env = BallOnBeamSim(**env_hparam) env = ActNormWrapper(env) Set up the policy after the environment since it needs to know the dimensions of the policies observation and action space. There are many different policy architectures available under `Pyrado/pyrado/policies`, which significantly vary in terms of required hyper-parameters. You can find some examples at `Pyrado/scripts/training`. Note that all policies must inherit from `Policy` which inherits from `torch.nn.Module`. Moreover, all `Policy` instances are deterministic. The exploration is handled separately (see `Pyrado/pyrado/exploration`). .. code-block:: python policy_hparam = dict( feats=FeatureStack(identity_feat, sin_feat) ) policy = LinearPolicy(spec=env.spec, **policy_hparam) Specify the algorithm you want to use for learning the policy parameters. For deterministic sampling, you need to set `num_workers=1`. If `num_workers>1`, PyTorch's multiprocessing library will be used to parallelize sampling from the environment on the CPU. The resulting behavior is non-deterministic, i.e. even for the same random seed, you will get different results. Moreover, it is advised to set `num_workers` to 1 if you want to debug your code. The algorithms can be categorized in two different types: one type randomizes the action every step (their exploration strategy inherits from `StochasticActionExplStrat`), and the other type randomizes the policy parameters once every rollout their exploration strategy inherits from `StochasticParamExplStrat`). It goes without saying that every algorithm has different hyper-parameters. However, they all use the same `rollout()` function to generate their data. .. code-block:: python algo_hparam = dict( max_iter=10, pop_size=20, num_rollouts=10, expl_factor=1.1, expl_std_init=1., num_workers=4, ) algo = HCNormal(ex_dir, env, policy, **algo_hparam) Save the hyper-parameters before staring the training in a YAML-file. This step is not strictly necessary, but it helps you to later see which hyper-parameters you used, i.e. which setting leads to a successfully trained policy. .. code-block:: python save_dicts_to_yaml([ dict(env=env_hparam, seed=0), dict(policy=policy_hparam), dict(algo=algo_hparam, algo_name=algo.name)], ex_dir ) Finally, start the training. The `train()` function is the same for all algorithms inheriting from the `Algorithm` base class. It repetitively calls the algorithm's custom `step()` and `update()` functions. You can load and continue a previous experiment using the Algorithm's `load()` method. The `snapshot_mode()` method determines when to save the current training state, e.g. 'latest' saves after every step of the algorithm, and 'best' only saves if the average return is a new highscore. Moreover, you can set the random number generator's seed. This second option for setting the seed comes in handy when you want to continue from a previous experiment multiple times. .. code-block:: python algo.train(snapshot_mode='latest', seed=None) input('Finished training. Hit enter to simulate the policy.') Simulate the learned policy in the environment it has been trained in. The following is a part of `scripts/sim_policy.py` which can be executed to simulate any policy given the experiment's directory. .. code-block:: python done, state, param = False, None, None while not done: ro = rollout(env, policy, render_mode=RenderMode(video=True), eval=True, reset_kwargs=dict(domain_param=param, init_state=state)) print_cbt(f'Return: {ro.undiscounted_return()}', 'g', bright=True) done, state, param = after_rollout_query(env, policy, ro)