MOP Installation Instructions
Mail List
Minutes
Software
Accounts
Heart Beat Monitor
GriPhyN-CMS
MOP
Overview
========

MOP is a system for distributing CMS production jobs. There is a MOP master site where jobs are defined by the CMS production scripts. The mop_submitter then distributes those jobs to remote sites through CondorG and Globus. When the jobs are finished, the output is collected by GDMP.

Installing the MOP master
=========================

The MOP master only runs at the master site (Fermilab). It does not need to be installed anywhere else. It is, however, available for anonymous cvs checkout for the interested. To check out the MOP master:

% setenv CVSROOT :pserver:anonymous@cdcvs.fnal.gov:/cvs/cd_read_only
% cvs login
(Logging in to anonymous@cdcvs.fnal.gov)
CVS password: anoncvs
% cvs -Q checkout mop_master
% cd mop_master
% ./install
Checking out impala
Checking out mop_submitter
%

To use the mop_master, source mop_master/setup.(c)sh, then look at the demo script "doit" in the mop_master/run directory.

To reiterate: mop_master is not needed at remote sites.

Preparing remote sites for MOP jobs
===================================

Remote site overview:

In general, the remote site will need Globus installed on one or more machines, a Globus job manager for the local batch system and an installation of GDMP. The following is a detailed description of what is needed and why. Skip to the section "Remote site summary" to avoid the details.

Specifically, in order to run jobs at a remote site, MOP needs to: (1) copy input files to the remote site, (2) run a local copy of the CMS software on the local batch system, (3) put the output where the local GDMP server can publish it, and (4) tell the local GDMP server to publish the output.

All of these requirements could be met with a single machine. It may be convenient, however, to spread the tasks across several machines.

(1) In order to copy the input files to the remote site, MOP needs access to a Globus jobmanager that will run jobs on a machine that has gsincftp. This job manager will be referred to has the stage-in job manager. At both Wisconsin and Fermilab we are using the default fork jobmanager on the batch submission machine for this task. The downloaded files must be accessible to the nodes in the batch system for step (2).

(2) Running a local copy of the CMS software requires getting the DAR distribution of the CMS software from Fermilab and installing it. MOP then requires access to a Globus job manager for the batch system. This job manager will be referred to as the At Fermilab we are currently using the Globus PBS job manager on a small test system. At Wisconsin we are using the Globus Condor job manager. Globus job managers are available for virtually all popular batch systems. Installing a new job manager is not difficult, but it is unfortunately also not well documented. I (Jim Amundson) can provide assistance if it will help.

(3) Production jobs need to put their output where GDMP can access them. The current implementation of GDMP requires that all flat files be put in one directory. Therefore, there needs to be a directory on a shared file system accessible both to the batch nodes and to the GDMP server. The remote nodes should be able to execute commands in the GDMP directory, but they do not need to have Globus access nor do they need to run the GDMP server.

(4) The GDMP server needs to be notified to publish the files it knows about. The publish command must be executed on the server node itself. MOP therefore needs access to a Globus job manager that will run jobs on the GDMP server node. This job manager will be referred to as the publish job manager. At Fermilab we are currently using the Globus default fork job manager.

Accounts:

Initial MOP tests have run partially as cmsprod and partially as amundson. New grid-mapfile entries should be for the cmsprod account at Fermilab. The contact string is: "/O=Grid/O=Globus/OU=fnal.gov/CN=CMS Production"

Remote site summary:

Remote sites need to install GDMP. If you do not have GDMP installed and/or do not have GDMP installation instructions, please contact Shahzad Muzaffar mailto:muzaffar@fnal.gov. Remote sites also need to install Globus on one or more machines. Finally the following information needs to be conveyed:

--------------------------------------------------------------------------
(1-1) stage-in job manager
(1-2) GLOBUS_LOCATION value.
(1-3) Shared directory for mop files if not home directory.

(2-1) run job manager.
(2-2) location of CMS DAR installation. NB: only the path needs to be provided. The DAR file(s) themselves can be installed through MOP.

(3-1) GDMP install directory
(3-2) GDMP flat file directory
(3-3) GDMP Objectivity file directory

(4-1) GDMP job manager
--------------------------------------------------------------------------

Example remote site values:

The first "remote" site is at Fermilab. Here are the Fermilab values:

(1-1) droidf.fnal.gov:/jobmanager
(1-2) /opt/globus/globus20
(1-3) use default

(2-1) droidf.fnal.gov:/jobmanager (We are using the fork job manager for now. You probably do *not* want to do this.)
(2-2) /data/tarballs

(3-1) /data/gdmp/current
(3-2) /data/GDMP_DATA/FlatFiles
(3-2) /data/GDMP_DATA/Objectivity

(4-1) use default

The second remote site example is Wisconsin.

(1-1) beak.cs.wisc.edu:/jobmanager
(1-2) (?)
(1-3) /shared/scratch/amundson

(2-1) beak.cs.wisc.edu:/jobmanager-condor-INTEL-LINUX
(2-2) /shared/scratch/amundson

etc.

The site parameters are stored in the mop_submitter/site-info directory. Job manager and scratch directory info is in the *.site.* files. The .vars files hold the following information:

MOP_REMOTE_GLOBUS_LOCATION (1-2)

MOP_REMOTE_GDMP_OBJYFILE_DIR (3-3)

MOP_REMOTE_DAR_ROOT= (2-2)

MOP_REMOTE_CMS_DB (derived from 2-2)

MOP_REMOTE_GDMP_DIR= (3-1)

MOP_REMOTE_GDMP_FLATFILE_DIR (3-2)

Appendum 1:
============

(1) In the $GLOBUS_LOCATION/etc/globus-job-manager-condor.conf file, edit the two lines at the bottom according to the comment above them, adding INTEL and LINUX as arguments, like so:

-condor-arch INTEL
-condor-os LINUX

and remove the two comment lines (appearing just before):

# Edit the following two lines to complete
# the configuration of your condor jobmanager

Then add another line to retain debugging info on errors:

-save-logfile on_errors

Finally, rename the file to globus-job-manager-condor-INTEL-LINUX.conf

(2) Let's use testulix.phys.ufl.edu as an example. In $GLOBUS_LOCATION/etc/jobmanager-condor, add "-condor-os LINUX -condor-arch INTEL" to the end of the argument list, and change the existing -conf and -rdn arguments to refer to globus-job-manager-condor-INTEL-LINUX.conf and testulix.phys.ufl.edu/jobmanager-condor-INTEL-LINUX instead of the old names.

Finally, rename the file to jobmanager-condor-INTEL-LINUX.

We thereafter refer to that job manager as
testulix.phys.ufl.edu/jobmanager-condor-INTEL-LINUX instead of just
testulix.phys.ufl.edu/jobmanager-condor (when using globus-job-run, Condor-G, etc).

At Florida, the condor job manager is then:

% cat jobmanager-condor-INTEL-LINUX
stderr_log,local_cred - /usr/local/globus/globus-2.0/libexec/globus-job-manager globus-job-manager -conf /usr/local/globus/globus-2.0/etc/globus-job-manager-condor-INTEL-LINUX.conf -type condor -rdn testulix.phys.ufl.edu/condor-INTEL-LINUX -machine-type unknown -publish-jobs -condor-os LINUX -condor-arch INTEL

and the condor job manager config file reads as:

% cat globus-job-manager-condor-INTEL-LINUX.conf
-home "/usr/local/globus/globus-2.0"
-e /usr/local/globus/globus-2.0/libexec
-globus-gatekeeper-host testulix.phys.ufl.edu
-globus-gatekeeper-port 2119
-globus-gatekeeper-subject "/O=Grid/O=Globus/CN=testulix.phys.ufl.edu"
-globus-host-cputype i686
-globus-host-manufacturer pc
-globus-host-osname Linux
-globus-host-osversion 2.2.14-5.0smp
-condor-arch INTEL
-condor-os LINUX
-save-logfile on_errors

Appendum 2:
============

The default Condor installation is configured so that Condor will suspend all jobs on detection of keyboard activity. The instructions on how to fix this are to modify "PART 3" of the $CONDOR_LOCATION/etc/condor_config file underneath where it says:

#####################################################################
## This where you choose the configuration that you would like to
## use. It has no defaults so it must be defined. We start this
## file off with the UWCS_* policy.
######################################################################

The modifications are:

Original condor_config file:
START = $(UWCS_START)
SUSPEND = $(UWCS_SUSPEND)
CONTINUE = $(UWCS_CONTINUE)
PREEMPT = $(UWCS_PREEMPT)
KILL = $(UWCS_KILL)

Modified condor_config file:
#START = $(UWCS_START)
START = True
#SUSPEND = $(UWCS_SUSPEND)
#CONTINUE = $(UWCS_CONTINUE)
#PREEMPT = $(UWCS_PREEMPT)
SUSPEND = False
CONTINUE = True
PREEMPT = False
#KILL = $(UWCS_KILL)
KILL = $(ActivityTimer) > $(MaxVacateTime)

Then, to enact the changes for the new Condor configuration, execute

% condor_reconfig node1 node2 node3 ...

where node1, node2, node3 ... are the different Condor compute machines (including the Condor Master machine).