Meta-Algorithms

MAML-Algorithm (Interface)

class meta_policy_search.meta_algos.MAMLAlgo(policy, inner_lr=0.1, meta_batch_size=20, num_inner_grad_steps=1, trainable_inner_step_size=False)[source]

Bases: meta_policy_search.meta_algos.base.MetaAlgo

Provides some implementations shared between all MAML algorithms

Parameters:
  • policy (Policy) – policy object
  • inner_lr (float) – gradient step size used for inner step
  • meta_batch_size (int) – number of meta-learning tasks
  • num_inner_grad_steps (int) – number of gradient updates taken per maml iteration
  • trainable_inner_step_size (boolean) – whether make the inner step size a trainable variable
build_graph()

Creates meta-learning computation graph

Pseudocode:

for task in meta_batch_size:
    make_vars
    init_dist_info_sym
for step in num_grad_steps:
    for task in meta_batch_size:
        make_vars
        update_dist_info_sym
set objectives for optimizer
make_vars(prefix='')
Parameters:prefix (str) – a string to prepend to the name of each variable
Returns:a tuple containing lists of placeholders for each input type and meta task
Return type:(tuple)
optimize_policy(all_samples_data, log=True)

Performs MAML outer step for each task

Parameters:
  • all_samples_data (list) – list of lists of lists of samples (each is a dict) split by gradient update and meta task
  • log (bool) – whether to log statistics
Returns:

None

ProMP-Algorithm

class meta_policy_search.meta_algos.ProMP(*args, name='ppo_maml', learning_rate=0.001, num_ppo_steps=5, num_minibatches=1, clip_eps=0.2, target_inner_step=0.01, init_inner_kl_penalty=0.01, adaptive_inner_kl_penalty=True, anneal_factor=1.0, **kwargs)[source]

Bases: meta_policy_search.meta_algos.base.MAMLAlgo

ProMP Algorithm

Parameters:
  • policy (Policy) – policy object
  • name (str) – tf variable scope
  • learning_rate (float) – learning rate for optimizing the meta-objective
  • num_ppo_steps (int) – number of ProMP steps (without re-sampling)
  • num_minibatches (int) – number of minibatches for computing the ppo gradient steps
  • clip_eps (float) – PPO clip range
  • target_inner_step (float) – target inner kl divergence, used only when adaptive_inner_kl_penalty is true
  • init_inner_kl_penalty (float) – initial penalty for inner kl
  • adaptive_inner_kl_penalty (bool) – whether to used a fixed or adaptive kl penalty on inner gradient update
  • anneal_factor (float) – multiplicative factor for annealing clip_eps. If anneal_factor < 1, clip_eps <- anneal_factor * clip_eps at each iteration
  • inner_lr (float) – gradient step size used for inner step
  • meta_batch_size (int) – number of meta-learning tasks
  • num_inner_grad_steps (int) – number of gradient updates taken per maml iteration
  • trainable_inner_step_size (boolean) – whether make the inner step size a trainable variable
build_graph()[source]

Creates the computation graph

make_vars(prefix='')
Parameters:prefix (str) – a string to prepend to the name of each variable
Returns:a tuple containing lists of placeholders for each input type and meta task
Return type:(tuple)
optimize_policy(all_samples_data, log=True)[source]

Performs MAML outer step

Parameters:
  • all_samples_data (list) – list of lists of lists of samples (each is a dict) split by gradient update and meta task
  • log (bool) – whether to log statistics
Returns:

None

TRPO-MAML-Algorithm

class meta_policy_search.meta_algos.TRPOMAML(*args, name='trpo_maml', step_size=0.01, inner_type='likelihood_ratio', exploration=False, **kwargs)[source]

Bases: meta_policy_search.meta_algos.base.MAMLAlgo

Algorithm for TRPO MAML

Parameters:
  • policy (Policy) – policy object
  • name (str) – tf variable scope
  • step_size (int) – trust region size for the meta policy optimization through TPRO
  • inner_type (str) – One of ‘log_likelihood’, ‘likelihood_ratio’, ‘dice’, choose which inner update to use
  • exploration (bool) – whether to use E-MAML or MAML
  • inner_lr (float) – gradient step size used for inner step
  • meta_batch_size (int) – number of meta-learning tasks
  • num_inner_grad_steps (int) – number of gradient updates taken per maml iteration
  • trainable_inner_step_size (boolean) – whether make the inner step size a trainable variable
build_graph()[source]

Creates the computation graph

make_vars(prefix='')
Parameters:prefix (str) – a string to prepend to the name of each variable
Returns:a tuple containing lists of placeholders for each input type and meta task
Return type:(tuple)
optimize_policy(all_samples_data, log=True)[source]

Performs MAML outer step

Parameters:
  • all_samples_data (list) – list of lists of lists of samples (each is a dict) split by gradient update and meta task
  • log (bool) – whether to log statistics
Returns:

None

VPG-MAML-Algorithm

class meta_policy_search.meta_algos.VPGMAML(*args, name='vpg_maml', learning_rate=0.001, inner_type='likelihood_ratio', exploration=False, **kwargs)[source]

Bases: meta_policy_search.meta_algos.base.MAMLAlgo

Algorithm for PPO MAML

Parameters:
  • policy (Policy) – policy object
  • name (str) – tf variable scope
  • learning_rate (float) – learning rate for the meta-objective
  • exploration (bool) – use exploration / pre-update sampling term / E-MAML term
  • inner_type (str) – inner optimization objective - either log_likelihood or likelihood_ratio
  • inner_lr (float) – gradient step size used for inner step
  • meta_batch_size (int) – number of meta-learning tasks
  • num_inner_grad_steps (int) – number of gradient updates taken per maml iteration
  • trainable_inner_step_size (boolean) – whether make the inner step size a trainable variable
build_graph()[source]

Creates the computation graph

make_vars(prefix='')
Parameters:prefix (str) – a string to prepend to the name of each variable
Returns:a tuple containing lists of placeholders for each input type and meta task
Return type:(tuple)
optimize_policy(all_samples_data, log=True)[source]

Performs MAML outer step

Parameters:
  • all_samples_data (list) – list of lists of lists of samples (each is a dict) split by gradient update and meta task
  • log (bool) – whether to log statistics
Returns:

None