Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Warning

By default, if you don’t ask for a specific type of node, the job might be submitted to a GPU node. Be careful about this as it could carry unwanted costs. 


On Computerome 2.0, there are three types of machines:

  • Fat nodes with 40 CPU cores and 1,5TB of memory
  • Thin nodes with 40 CPU cores and 192 GB of memory
  • GPU nodes with 40 CPU cores, 192 GB of memory and one NVIDIA Tesla V100 GPU card

You can submit jobs via the command qsub and/or msub. We strongly encourage you to take advantage of modules in your pipelines as it gives you better control of your environment.In order to submit jobs that will run on one node only you will only have to specify the following resources:

  1. How long time you expect the job to run ⇒ '-l walltime=<time>'
  2. How much memory your job requires ⇒ '-l mem=xxxgb'
  3. How many CPUs and GPUs ⇒ '-l nodes=1:ppn=<number of CPUs>:gpus=<number of GPUs>'  ; CPU will be from 1 to 40, GPU will be 0 or 1 (':gpus=...' can be left out if not used) .
  4. The <group_NAME> for your current project ⇒ '-W group_list=<group_NAME> -A <group_NAME>' .

To run a job with 23 CPUs, 100GB memory lasting an hour you can use the command:

Code Block
languagepowershell
themeMidnight
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=1:ppn=23,mem=100gb,walltime=3600 <your script>

Same job as above, also using GPU:

Code Block
languagepowershell
themeMidnight
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=1:ppn=23:gpus=1,mem=100gb,walltime=3600 <your script>

Example using msub:

Code Block
languagepowershell
themeMidnight
$ msub -W group_list=<group_NAME> -A <group_NAME> -l nodes=1:ppn=23,mem=100gb,walltime=3600 <your script>
Note

The parameters nodes, ppn, mem is just an example and you should be change to suit your specific job

Interactive jobs
Anchor
InteractiveJobs
InteractiveJobs

When you want to test something in the batch system, it is strongly recommended to run in an interactive job, by using the following:

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -X -I

This will give you access to a single compute node, where you can perform your testing without affecting other users.

iqsub

Computerome is now offering an even more straightforward way to work interactively, the way you do on your own computer or a local linux server, instead of having to submit everything through the queuing system.Just login and type iqsub and the system will ask you 3 simple questions, after which you'll be redirected to a full, private node.

Code Block
languagepowershell
themeRDark
$ iqsub

[ Interactive job ]

  => [ Select group ]

    => [ Select time needed (non extendable) ]

      => [ Enter number of Processors needed (1-40) ]

         => [ Enter number of GPUs needed (0-1) ]

            => [ Enter amount of memory needed ]
Warning
Under no circumstances should you ever run jobs or scripts on the Computerome login node.

Script file example

A script for a file to be submitted with qsub might begin with lines like:


Code Block
languagebash
#!/bin/sh
### Note: No commands may be executed until after the #PBS lines
### Account information
#PBS -W group_list=pr_12345 -A pr_12345
### Job name (comment out the next line to get the name of the script used as the job name)
#PBS -N test
### Output files (comment out the next 2 lines to get the job name used instead)
#PBS -e test.err
#PBS -o test.log
### Only send mail when job is aborted or terminates abnormally
#PBS -m n
### Number of nodes
#PBS -l nodes=1:ppn=8 
### Memory
#PBS -l mem=120gb
### Requesting time - format is <days>:<hours>:<minutes>:<seconds> (here, 12 hours)
#PBS -l walltime=12:00:00
 
# Go to the directory from where the job was submitted (initial directory is $HOME)
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR

### Here follows the user commands:
# Define number of processors
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes

# Load all required modules for the job
module load tools
module load perl/5.20.2
module load <other stuff>

# This is where the work is done
# Make sure that this script is not bigger than 64kb ~ 150 lines, otherwise put in seperat script and execute from here
<your script>

The $PBS... variables are set for the batch job by Torque.

Info

If you already have loaded some modules in your login environment , you do not need to specify them in the jobscript. 

However, we recommend that you do it anyway, since it improves the portability of the jobscript and serves as a reminder of the requirements.

We also strongly advise against the use of the "-V" option, as it makes it hard to debug possible errors during runtime. 

The complete list of variables is documented in Exported batch environment variables.Further examples of Torque batch job submission is documented in Job submission

Specifying a different project account

If you run jobs under different projects, for instance pr_12345 and pr_54321, you must make sure that each project gets accounted for separately in the system's accounting statistics.You specify the relevant project account (for example, pr_54321) for each individual job by using these flags to the qsub command:

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=pr_54321 -A pr_54321 ... 

or in the job script file, add line like this near the top:

Code Block
languagepowershell
themeRDark
#PBS -W group_list=pr_54321 -A pr_54321

Please use project names only by agreement with your project owner.

Estimating job resource requirements

First time you run your script, you may not have a clear picture of what kind of resource requirements it has. To get a rough estimate, you could submit a job to a full node, with large walltime:

Regular compute node (aka. 'thinnode'):

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=1:ppn=40:thinnode,walltime=99:00:00,mem=180gb -m n <script>

Fat node:

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=1:ppn=40:fatnode,walltime=99:00:00,mem=1200gb -m n <script>
Vis_skjul
0true



Info

You can add this line to the bottom of your script

checkjob -v $PBS_JOBID

It will generate something like the following:

Code Block
languagepowershell
themeRDark
Total Requested Tasks: 20
Total Requested Nodes: 1
Req[0]  TaskCount: 20  Partition: torque
Dedicated Resources Per Task: PROCS: 1  MEM: 12G
Utilized Resources Per Task:  PROCS: 0.37  MEM: 12G  SWAP: 2020M
Avg Util Resources Per Task:  PROCS: 0.37
Max Util Resources Per Task:  PROCS: 0.80  MEM: 12G  SWAP: 2020M
Average Utilized Memory: 10761.74 MB
Average Utilized Procs: 8.47

To calculate what you should use for the "-l mem=" parameter you have to times the number of tasks with "Max Util Resource Per Task" "MEM:" Here it would bbe 20 * 12 gb = 240gb.


To see the actual resource usage, see output from command qstat

Info

You can add this line to the bottom of your script

qstat -f -1 $PBS_JOBID

It will generate something like the following:

Code Block
languagepowershell
themeRDark
Job Id: <jobid>
    Job_Name = <job_NAME>
    Job_Owner = <user>
    resources_used.cput = 323:00:30
    resources_used.energy_used = 0
    resources_used.mem = 1129928kb
    resources_used.vmem = 3082824kb
    resources_used.walltime = 12:00:35
...
    Resource_List.nodes = 1:ppn=28
    Resource_List.mem = 120gb
    Resource_List.walltime = 12:00:00
    Resource_List.nodect = 1
    Resource_List.neednodes = 1:ppn=28
...

Look at resources_used.xyz for hints.



Requesting a maximum memory size

A number of node features can be requested, see the Torque Job Submission page. For example, you may require a minimum physical memory size by requesting:

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=2:ppn=16,mem=120gb <your script>

i.e.: 2 entire nodes, 16 CPU cores on each, the total memory of all nodes >= 120 GB RAM.

Info
Do not request the maximum physical amount of RAM, since the RAM memory available to users is slightly less than the physical RAM memory.

To see the available RAM memory sizes on the different nodes types see the Hardware page.

Waiting for specific jobs

It is possible to specify that a job should only run after another job has completed succesfully, please see the -W flags in the qsub page.To run <your script> after job 12345 has completed succesfully::

Code Block
languagepowershell
themeRDark
$ qsub -W depend=afterok:12345 <your script>

Be sure that the exit status of job 12345 is meaningful: if it exits with status 0, you second job will run. If it exits with any other status, you second job will be cancelled.It is also possible to run a job if another job fails (``afternotok``) or after another job completes, regardless of status (``afterany``). Be aware that the keyword ``after`` (as in ``-W depend=after:12345``) means run after job 12345 has *started*.

Submitting jobs to 40-CPU fat nodes

The high memory (1536 GB) nodes we define to have a node property of fatnode. You could submit a batch job like in these examples:: 2 entire fatnodes, 32 CPUs each, total 64 CPU cores

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=2:ppn=40:fatnode,mem=1200gb <your script>

Explicitly the g-11-f0042 node, 40 CPU cores:

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=g-11-f0042:ppn=40,mem=120gb <your script>

2 entire fatnodes,  each, memory of all nodes => 2000 GB RAM)

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=2:ppn=40:fatnode,mem=2000gb <your script>

Submitting jobs to 40-CPU thin nodes

The standard memory (192 GB) nodes we define to have a node property of thinnode.You could submit a batch job like in these examples::2 entire thinnodes, 40 CPUs each, total 80 CPU cores)

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=2:ppn=40:thinnode,mem=10gb <your script>

Explicitly the g-01-c0052 node, 40 CPU cores

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=g-01-c0052:ppn=40,mem=50gb <your script>

Submitting 1-CPU jobs

You could submit a batch job like in this example:

Code Block
languagepowershell
themeRDark
$ qsub -W group_list=<group_NAME> -A <group_NAME> -l nodes=1:ppn=1 <your script>

Running parallel jobs using MPI

Code Block
languagebash
#!/bin/sh
### Note: No commands may be executed until after the #PBS lines
### Account information
#PBS -W group_list=pr_12345 -A pr_12345
### Job name (comment out the next line to get the name of the script used as the job name)
#PBS -N test
### Output files (comment out the next 2 lines to get the job name used instead)
#PBS -e test.err
#PBS -o test.log
### Only send mail when job is aborted or terminates abnormally
#PBS -m n
### Number of nodes, request 240 cores from 6 nodes
#PBS -l nodes=6:ppn=40
### Requesting time - 720 hours
#PBS -l walltime=720:00:00

### Here follows the user commands:
# Go to the directory from where the job was submitted (initial directory is $HOME)
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
# NPROCS will be set to 240, not sure if it used here for anything.
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes
 
module load moab torque openmpi/gcc/64/1.10.2 gromacs/5.1.2-plumed

export OMP_NUM_THREADS=1
# Using 236 cores for MPI threads leaving 4 cores for overhead, '--mca btl_tcp_if_include ib0' forces InfiniBand interconnect for improved latency
mpirun -np 236 $mdrun -s gmx5_double.tpr -plumed plumed2_path_re.dat -deffnm md-DTU -dlb yes -cpi md-DTU -append --mca btl_tcp_if_include ib0
Info

In order to optimize performance, the queuing system is configured to place jobs on nodes connected to the same InfiniBand switch (30 nodes per switch) if possible.


Vis_skjul
0true
To get nodes close to each other, use procs=<number_of_procs> and leave out node= and ppn=.To avoid interference with other jobs, procs= should be a multiple of cores per node (ie. 28 for mpinode).

Job Arrays
Anchor
jobArrays
jobArrays

Submitting multiple identical jobs can be done using job arrays. Job arrays can be created by using the -t option in the qsub submission script. The -t option allows many copies of the same script to be submitted at once. Additional information about -t option can be found in the qsub command reference. Moreover, PBS_ARRAYID environmental variable allows to differentiate the different jobs in the array. The amount of resources required in the qsub submission script is the amount of resources that each job will get.
For instance adding the line:

Code Block
languagepowershell
themeRDark
  #PBS -t 0-14%5

in the qsub script will cause running the job 15 times with not more than 5 actives jobs at any given time.

Note
Please, please, please, use the %# option for limiting the number of active jobs.

PBS_ARRAYID values will run from 0 to 14, as shown below:

Code Block
languagepowershell
themeRDark
 ( perl process.pl dataset${PBS_ARRAYID} )
  
  perl process.pl dataset0
  perl process.pl dataset1
  perl process.pl dataset2
  ….
  perl process.pl dataset14
Info
Jobs in jobs array are run independently and not in any specific order.