2015-02-26 09:02:31 -06:00

126 lines
6.4 KiB
Plaintext

/**
@page LEVEL2 Trick Master/Slave Synchronization
Master/Slave is a way to do distributed processing in Trick using multiple simulations.
Master/Slave synchronization synchronizes a master simulation to one or more slave
simulations. The slaves may be run on a different computer or computers, independent of
the computer running the master simulation.
While synchronized, the master controls the overall simulation time for all simulations. Slaves
will use the same real-time software frame as the master and will sync with the master's time at the end of each frame.
The master also controls the overall simulation execution mode for all simulations. However, a slave that goes into freeze
mode will cause the master to go into freeze mode (which will propagate freeze mode to all slaves). Also,
a slave that loses sync or terminates unexpectedly can affect the master's mode (see sync_error_terminate below).
Trick provides a connection class for synchronization that uses sockets (MSSocket derived from MSConnect) for master/slave
connection and communication. The user may also provide their own way of synchronization if desired, in which
case you must provide your own class derived from MSConnect.
@see Trick::MSSocket\n Trick::MSConnect
Master/Slave synchronization is an optional class and need not be specified in
the S_define file when building a normal (non-Master/Slave) simulation.
@section LEVEL3 User accessible routines
The following routine is used to create a socket connection for either Master or Slave:
@li Trick::MSSocket::MSSocket() ;
In order for Master/Slave synchronization to take place, it must be enabled:
@li int Trick::Master::enable() ;
@li int Trick::Master::disable() ;
The following are routines and attributes for configuring the Master's interface to each Slave:
@li Trick::SlaveInfo::SlaveInfo() ;
@li int Trick::SlaveInfo::set_connection_type(Trick::MSConnect * in_connection) ;
@li Trick::SlaveInfo * Trick::Master::add_slave(Trick::SlaveInfo * new_slave) ;
@li Trick::SlaveInfo::machine_name
@li Trick::SlaveInfo::machine_display
@li Trick::SlaveInfo::other_args
@li Trick::SlaveInfo::remote_shell
@li Trick::SlaveInfo::remote_shell_args
@li Trick::SlaveInfo::run_input_file
@li Trick::SlaveInfo::S_main_name
@li Trick::SlaveInfo::sim_path
@li Trick::SlaveInfo::sync_error_terminate
@li Trick::SlaveInfo::sync_wait_limit
@li Trick::SlaveInfo::user_remote_shell
The following are routines and attributes for configuring a Slave's interface to the Master:
@li int Trick::Slave::set_connection_type(Trick::MSConnect * in_connection) ;
@li Trick::Slave::sync_error_terminate
@section LEVEL3 Input File Setup
The way to set up Master/Slave synchronization is to use the above routines and attributes
in a Python input file. Here is an example of how to set up a Master and a Slave in their respective input files.
@section LEVEL4 Master Input File Example
@verbatim
new_connection = trick.MSSocket()
new_slave = trick.SlaveInfo()
new_slave.set_connection_type(new_connection)
new_slave.sim_path = "/users/bob/trick_sims/SIM_myslave"
new_slave.run_input_file = "RUN_test/slave.py"
new_slave.sync_error_terminate = 1
new_slave.sync_wait_limit = 0.2
master_slave.master.add_slave(new_slave)
master_slave.master.enable()
@endverbatim
@section LEVEL4 Slave Input File Example
@verbatim
new_connection = trick.MSSocket()
master_slave.slave.set_connection_type(new_connection)
master_slave.slave.sync_error_terminate = 1
@endverbatim
@section LEVEL4 Dumping and Loading a Checkpoint
By default, the Master will command the Slave to dump or load a checkpoint when the Master dumps or loads a checkpoint.
The Master will send its checkpoint file name to the Slave, and the Slave will use that file name for its checkpoint (in
its own RUN_ directory).
The following are attributes for configuring the Master's checkpointing interface to each Slave:
@li Trick::SlaveInfo::chkpnt_dump_auto
@li Trick::SlaveInfo::chkpnt_load_auto
@li Trick::SlaveInfo::chkpnt_binary
@li Trick::SlaveInfo::reconnect_wait_limit
If you do not want the Slave to dump/load a checkpoint when the Master does, you can turn off either feature in the
Master input file like so:
@verbatim
new_slave.chkpnt_dump_auto = 0
new_slave.chkpnt_load_auto = 0
@endverbatim
in which case your Slave would have to have its own model code to perform a checkpoint dump/load.
If the Slave simulation was built with DMTCP enabled (for more information see TBD DMTCP Section), you can set the following variable
in the Master input file, causing the Slave to dump/load a binary DMTCP checkpoint when the Master commands:
@verbatim
new_slave.chkpnt_binary = 1
@endverbatim
Note that when loading a binary DMTCP checkpoint, the checkpoint is a new executable file that must be run. When the
Master commands a checkpoint (and chkpnt_load_auto=1), the Slave will send the Master its checkpoint file name path
and then terminate itself. The Master will then automatically restart the Slave by executing that checkpoint file.
If chkpnt_load_auto=0, the user is responsible for terminating and restarting the Slave.
To run the Slave with DMTCP enabled, you'll have to modify some of the Slave's attributes in the Master input file.
@li S_main_name normally defaults to your Slave's executable, which Trick fills in for you. But you will set it to instead run the dmtcp_checkpoint command.
@li run_input_file should not change from what your Slave would normally run, but it needs the keyword "dmtcp" specified after it.
@li sync_error_terminate needs to be 0 because the Slave must be terminated before the Master restarts it, and we don't want Master to also terminate.
@li reconnect_wait_limit is the time the Master will then wait for the Slave to be restarted and get reconnected with the Master. If the Slave does not
reconnect within this time, the Master will deactivate the Slave for the rest of the simulation.
Here's an example:
@verbatim
new_slave.S_main_name = "/users/bob/dmtcp-1.2.7/bin/dmtcp_checkpoint --quiet --new-coordinator --checkpoint-open-files --ckptdir ./dmtcp_checkpoints ./S_main_Linux_4.4_x86_64.exe"
new_slave.run_input_file = "RUN_test/slave.py dmtcp"
new_slave.sync_error_terminate = 0
new_slave.reconnect_wait_limit = 10.0
@endverbatim
When chkpnt_load_auto=1, the Slave restarting and reconnecting should occur within a second or two. If chkpnt_load_auto=0, the user has
to restart the slave himself (and may even be typing in the checkpoint executable on the command line), so reconnect_wait_limit should be
set accordingly.
*/