Implement FlexLm license-aware scheduling with PBS Pro

ALTAIR PBS Pro is a powerful workload manager essential for an efficient management of a cluster and other computing resources allowing users of a same group to share efficiently the resources by allocating, dispatching and reserving the correct set of hardware necessary for a job. At each scheduling cycle, the resource manager allocates the correct amount of CPU cores, RAM, scratch space or even GPU requested by the job, marking them busy for the other users. More complex scheduling policies can be implemented such as fair-share or backfilling, but when it comes to CAE software, more than often licenses and tokens can become as much as a bottleneck as hardware resources, even more for most of the companies which certainly have more engineers than software tokens (except software resellers maybe!).

Using a centralized computing infrastructure can complexify software license sharing even more: imagine that between the time you finish your model and send it to the remote cluster, your favorite colleague decides to check-out the required license resulting in your precious job to instantly crash due to a lack of available licenses. Certain program, like ANSYS or Altair, implement a switch to wait for an available license, which can prevent your job from crashing in this case, but now your job is running in the eyes of the workload manager, so the hardware resources are reserved, and no one can run another job on these resources, but your job is not doing anything except waiting for a license… It is another issue of non-optimal resource usage.

Software licensing competition can become difficult to manage when dealing with workstations and a cluster in a company sharing the same license pool, fortunately PBS has the ability to manage, besides common hardware resources, custom resources: custom resources can be defined as simply as a Boolean set to true on some machines and false on the other, or as a number defining the maximum number of consumable resources available for all the machines or on a single machine.

From this perspective, the PBS Pro Administration Guide details a step-by-step to implement consumable licenses (tokens) as site-wide consumable resources taking its value from on an external script. In our case, the external script, written in Python, will query a FlexLm license server with lmutil for the number of available licenses for a feature. The external script, to be valid for PBS, must return a single number representing the amount of free resources.

Implementing licenses as custom resources will allow each user to request features in addition to the commonly used hardware resources. A job will then stay in queue as long as the requested features are not available preventing the job from crashing. However, there are some drawbacks to this implementation:

  • There is no way to set the maximum amount of tokens for a feature, so your job will stay in queue forever without any error message if you request more features that your site provides,
  • The call to the FlexLm server is read-only and punctual, meaning that PBS will query the FlexLm at each scheduling cycle only (default to 10 minutes if there is no event to trigger it such as a new job submission). If the licenses become available but get instantly checked-out by someone else between scheduling cycles, PBS will not be aware of it and your job will stay in queue. Same thing goes if the tokens are available for a scheduling cycle and the tokens are checked-out while your job is dispatched because there is no way to reserve licenses without another mechanism than with lmutil for example.

Apart from those, having implemented license aware scheduling will mostly prove to be a major improvment on job efficiency and reduce the number of crashed jobs. Besides, there is still the possibility to submit a job without specifying license requirements, adding a feature request will only prevent the job to start without it.

 

You can find a simple python script to spwan and parse the output of lmutil lmstat for a specific feature on our Github. The script takes the license server and the feature name as arguments.

https://github.com/quantumhpc/pbs_flexlm_parser

 

Step-by-step to implement license features as custom resources

Those steps are mostly inspired by the PBS Admin Guide with some shortcuts to ease the process and some examples. In our configuration, we will take the example of 4 features on 2 different license servers:

  • mech_solver, mech_hpc on 2325@fea_licserver
  • cfd_solver, cfd_hpc on 1999@cfd_licserver
  1. First, source your PBS configuration file to get some variables in your environment

. /etc/pbs.conf

  1. Define the new resources of type long in the server’s resourcedef file located in $PBS_HOME/server_priv/resourcedef. This should look like this:
  • mech_solver, type=long
  • mech_hpc type=long
  • cfd_solver type=long
  • cfd_hpc type=long

! If you have a long list of features you want to implement, here is a command to automatically get all the available features and inject it inside resourcedef by calling lmutil:

lmutil lmstat -a -c LIC_SERVER | grep Users | awk -F'[ :]' '{print $3}' | xargs -I {} echo {} "type=long" >> $PBS_HOME/server_priv/resourcedef

  1. Append the new resource names to the resources line in $PBS_HOME/sched_priv/sched_config

resources: “ncpus, mem, arch, host, [...], mech_solver, mech_hpc, cfd_solver, cfd_hpc”

! Again if you have a long list of features, here is a command to get them all in one line:

lmutil lmstat -a -c LIC_SERVER | grep Users | awk -F'[ :]' '{print ","$3}'   | tr '\\n' ' '

  1. Still in the sched_config file, add a line to call the script for each feature with a server_dyn_res line

server_dyn_res: " mech_solver !/path/to/lmparser.py -f mech_solver -c 2325@fea_licserver"
server_dyn_res: " mech_hpc !/path/to/lmparser.py -f mech_hpc -c 2325@fea_licserver"
server_dyn_res: " cfd_solver !/path/to/lmparser.py -f cfd_solver -c 1999@cfd_licserver"
server_dyn_res: " cfd_hpc!/path/to/lmparser.py -f cfd_hpc -c 1999@cfd_licserver"

  1. Restart the server and the scheduler

qterm -t quick
$PBS_EXEC/sbin/pbs_server

and

ps –ef | grep pbs_sched
kill -HUP <Scheduler PID>

or simply if no one is using the server:

service pbs restart

License feature are now implemented in the scheduler and each job requesting features will not be checked against the response of the FlexLm server. Remember that there is no way to see from PBS the amount of features available because those information are taken externally.

 

Example:

You can request features in your qsub statement like this:

qsub -l select=48:ncpus=48:mpiprocs=48 -l mech_solver=1 -l mech_hpc=24 script

to request 48 cores with 1 mech_solver token and 24 mech_hpc tokens

Result:

  • If the resources and licenses are available your job will start, and you will see in resource_list section:

    Resource_List.mech_hpc = 24

  • If for example licenses are not available, you will get in the comment section:

    comment = Can Never Run: Insufficient amount of server resource: mech_hpc (R: 24 A: 12 T: 12)

meaning only 12 tokens are available on the 24 requested. Note that the total number is equal to the available number because like mentioned earlier, PBS have no knowledge of the total amount of licenses normally available.

Do not hesitate to comment or fork the Python script to improve it.

Share this post

Comments (0)

Leave a comment