Overview
========
MOP is a system for distributing CMS production jobs. There is a MOP
master site where jobs are defined by the CMS production scripts. The
mop_submitter then distributes those jobs to remote sites through
CondorG and Globus. When the jobs are finished, the output is
collected by GDMP.
Installing the MOP master
=========================
The MOP master only runs at the master site (Fermilab). It does not
need to be installed anywhere else. It is, however, available for
anonymous cvs checkout for the interested. To check out the MOP master:
% setenv CVSROOT :pserver:anonymous@cdcvs.fnal.gov:/cvs/cd_read_only
% cvs login
(Logging in to anonymous@cdcvs.fnal.gov)
CVS password: anoncvs
% cvs -Q checkout mop_master
% cd mop_master
% ./install
Checking out impala
Checking out mop_submitter
%
To use the mop_master, source mop_master/setup.(c)sh, then look at the
demo script "doit" in the mop_master/run directory.
To reiterate: mop_master is not needed at remote sites.
Preparing remote sites for MOP jobs
===================================
Remote site overview:
In general, the remote site will need Globus installed on one or more
machines, a Globus job manager for the local batch system and an
installation of GDMP. The following is a detailed description of what
is needed and why. Skip to the section "Remote site summary" to avoid
the details.
Specifically, in order to run jobs at a remote site, MOP needs to:
(1) copy input files to the remote site,
(2) run a local copy of the CMS software on the local batch system,
(3) put the output where the local GDMP server can publish it,
and
(4) tell the local GDMP server to publish the output.
All of these requirements could be met with a single machine. It may
be convenient, however, to spread the tasks across several machines.
(1) In order to copy the input files to the remote site, MOP needs
access to a Globus jobmanager that will run jobs on a machine that has
gsincftp. This job manager will be referred to has the stage-in job
manager. At both Wisconsin and Fermilab we are using the default fork
jobmanager on the batch submission machine for this task. The
downloaded files must be accessible to the nodes in the batch system
for step (2).
(2) Running a local copy of the CMS software requires getting the
DAR distribution of the CMS software from Fermilab and installing
it. MOP then requires access to a Globus job manager for the batch
system. This job manager will be referred to as the At Fermilab we are
currently using the Globus PBS job manager on a small test system. At
Wisconsin we are using the Globus Condor job manager. Globus job
managers are available for virtually all popular batch
systems. Installing a new job manager is not difficult, but it is
unfortunately also not well documented. I (Jim Amundson) can provide
assistance if it will help.
(3) Production jobs need to put their output where GDMP can access
them. The current implementation of GDMP requires that all flat files
be put in one directory. Therefore, there needs to be a directory on a
shared file system accessible both to the batch nodes and to the GDMP
server. The remote nodes should be able to execute commands in the
GDMP directory, but they do not need to have Globus access nor do they
need to run the GDMP server.
(4) The GDMP server needs to be notified to publish the files it knows
about. The publish command must be executed on the server node
itself. MOP therefore needs access to a Globus job manager that will
run jobs on the GDMP server node. This job manager will be referred to
as the publish job manager. At Fermilab we are currently using the
Globus default fork job manager.
Accounts:
Initial MOP tests have run partially as cmsprod and partially as
amundson. New grid-mapfile entries should be for the cmsprod account
at Fermilab. The contact string is:
"/O=Grid/O=Globus/OU=fnal.gov/CN=CMS Production"
Remote site summary:
Remote sites need to install GDMP. If you do not have GDMP installed
and/or do not have GDMP installation instructions, please contact
Shahzad Muzaffar mailto:muzaffar@fnal.gov. Remote sites also need to
install Globus on one or more machines. Finally the following
information needs to be conveyed:
--------------------------------------------------------------------------
(1-1) stage-in job manager
(1-2) GLOBUS_LOCATION value.
(1-3) Shared directory for mop files if not home directory.
(2-1) run job manager.
(2-2) location of CMS DAR installation. NB: only the path needs to be
provided. The DAR file(s) themselves can be installed through MOP.
(3-1) GDMP install directory
(3-2) GDMP flat file directory
(3-3) GDMP Objectivity file directory
(4-1) GDMP job manager
--------------------------------------------------------------------------
Example remote site values:
The first "remote" site is at Fermilab. Here are the Fermilab values:
(1-1) droidf.fnal.gov:/jobmanager
(1-2) /opt/globus/globus20
(1-3) use default
(2-1) droidf.fnal.gov:/jobmanager (We are using the fork job manager
for now. You probably do *not* want to do this.)
(2-2) /data/tarballs
(3-1) /data/gdmp/current
(3-2) /data/GDMP_DATA/FlatFiles
(3-2) /data/GDMP_DATA/Objectivity
(4-1) use default
The second remote site example is Wisconsin.
(1-1) beak.cs.wisc.edu:/jobmanager
(1-2) (?)
(1-3) /shared/scratch/amundson
(2-1) beak.cs.wisc.edu:/jobmanager-condor-INTEL-LINUX
(2-2) /shared/scratch/amundson
etc.
The site parameters are stored in the mop_submitter/site-info
directory. Job manager and scratch directory info is in the *.site.*
files. The .vars files hold the following information:
MOP_REMOTE_GLOBUS_LOCATION (1-2)
MOP_REMOTE_GDMP_OBJYFILE_DIR (3-3)
MOP_REMOTE_DAR_ROOT= (2-2)
MOP_REMOTE_CMS_DB (derived from 2-2)
MOP_REMOTE_GDMP_DIR= (3-1)
MOP_REMOTE_GDMP_FLATFILE_DIR (3-2)
Appendum 1:
============
(1) In the $GLOBUS_LOCATION/etc/globus-job-manager-condor.conf file,
edit the two lines at the bottom according to the comment above
them, adding INTEL and LINUX as arguments, like so:
-condor-arch INTEL
-condor-os LINUX
and remove the two comment lines (appearing just before):
# Edit the following two lines to complete
# the configuration of your condor jobmanager
Then add another line to retain debugging info on errors:
-save-logfile on_errors
Finally, rename the file to globus-job-manager-condor-INTEL-LINUX.conf
(2) Let's use testulix.phys.ufl.edu as an example.
In $GLOBUS_LOCATION/etc/jobmanager-condor, add "-condor-os
LINUX -condor-arch INTEL" to the end of the argument list, and change the
existing -conf and -rdn arguments to refer to
globus-job-manager-condor-INTEL-LINUX.conf and
testulix.phys.ufl.edu/jobmanager-condor-INTEL-LINUX instead of the old names.
Finally, rename the file to jobmanager-condor-INTEL-LINUX.
We thereafter refer to that job manager as
testulix.phys.ufl.edu/jobmanager-condor-INTEL-LINUX instead of just
testulix.phys.ufl.edu/jobmanager-condor (when using globus-job-run, Condor-G, etc).
At Florida, the condor job manager is then:
% cat jobmanager-condor-INTEL-LINUX
stderr_log,local_cred -
/usr/local/globus/globus-2.0/libexec/globus-job-manager
globus-job-manager -conf /usr/local/globus/globus-2.0/etc/globus-job-manager-condor-INTEL-LINUX.conf
-type condor -rdn testulix.phys.ufl.edu/condor-INTEL-LINUX
-machine-type unknown -publish-jobs -condor-os LINUX -condor-arch
INTEL
and the condor job manager config file reads as:
% cat globus-job-manager-condor-INTEL-LINUX.conf
-home "/usr/local/globus/globus-2.0"
-e /usr/local/globus/globus-2.0/libexec
-globus-gatekeeper-host testulix.phys.ufl.edu
-globus-gatekeeper-port 2119
-globus-gatekeeper-subject "/O=Grid/O=Globus/CN=testulix.phys.ufl.edu"
-globus-host-cputype i686
-globus-host-manufacturer pc
-globus-host-osname Linux
-globus-host-osversion 2.2.14-5.0smp
-condor-arch INTEL
-condor-os LINUX
-save-logfile on_errors
Appendum 2:
============
The default Condor installation is configured so that Condor will suspend all
jobs on detection of keyboard activity. The instructions on how to fix
this are to modify "PART 3" of the $CONDOR_LOCATION/etc/condor_config
file underneath where it says:
#####################################################################
## This where you choose the configuration that you would like to
## use. It has no defaults so it must be defined. We start this
## file off with the UWCS_* policy.
######################################################################
The modifications are:
Original condor_config file:
START = $(UWCS_START)
SUSPEND = $(UWCS_SUSPEND)
CONTINUE = $(UWCS_CONTINUE)
PREEMPT = $(UWCS_PREEMPT)
KILL = $(UWCS_KILL)
Modified condor_config file:
#START = $(UWCS_START)
START = True
#SUSPEND = $(UWCS_SUSPEND)
#CONTINUE = $(UWCS_CONTINUE)
#PREEMPT = $(UWCS_PREEMPT)
SUSPEND = False
CONTINUE = True
PREEMPT = False
#KILL = $(UWCS_KILL)
KILL = $(ActivityTimer) > $(MaxVacateTime)
Then, to enact the changes for the new Condor configuration, execute
% condor_reconfig node1 node2 node3 ...
where node1, node2, node3 ... are the different Condor compute machines (including the Condor Master machine).
|