Skip to content

init_noise_std silently ignored when distribution_type='tanh_normal' (default) #661

@AIRJASON50

Description

@AIRJASON50

Summary

When using make_ppo_networks() with the default distribution_type='tanh_normal', the init_noise_std parameter is accepted without any warning but has no effect.

Root Cause

In make_policy_network() (networks.py:393-399), the tanh_normal branch creates a plain MLP:

if distribution_type == 'tanh_normal':
    policy_module = MLP(
        layer_sizes=list(hidden_layer_sizes) + [param_size],
        activation=activation,
        kernel_init=kernel_init,
        layer_norm=layer_norm,
    )

init_noise_std, noise_std_type, and state_dependent_std are all accepted by the function signature but never passed to this branch. The std is entirely determined by the network's output (second half of 2*action_size, passed through softplus in NormalTanhDistribution.create_dist()), with the initial value depending on random weight initialization.

Only the normal branch creates PolicyModuleWithStd, where init_noise_std actually initializes the learnable LogParam/Param.

Impact

  • Users who set init_noise_std while using the default tanh_normal get zero feedback that the parameter is being ignored
  • Any hyperparameter sweep over init_noise_std under tanh_normal produces identical results — wasted compute
  • learner.py default flags combine tanh_normal with init_noise_std=1.0 (dead code)
  • train_test.py tests tanh_normal + init_noise_std=0.8 without verifying it has any effect

We discovered this while training dexterous manipulation policies with MJX. We had been tuning init_noise_std under the default tanh_normal for some time before realizing it had zero effect.

Suggested Fix (any of)

  1. Raise a warning when init_noise_std is explicitly set with tanh_normal
  2. Document that init_noise_std only applies to distribution_type='normal'
  3. Consider changing the PPO default to distribution_type='normal' — most PPO implementations in the community (IsaacLab, rl_games, rsl_rl, CleanRL, Stable-Baselines3) use state-independent std without tanh squashing, consistent with Schulman's original PPO

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions