US ATLAS / HTCondor meeting

Name: US ATLAS / HTCondor meeting
Start: 2024-12-11T15:00:00-05:00
End: 2024-12-11T16:00:00-05:00
Location: No location set

Wednesday 11 Dec 2024, 15:00 → 16:00 US/Eastern

Description

Zoom Link: https://bnl.zoomgov.com/j/1614360980?pwd=AQm6x3reOaNtEze9H7GjadACjvWaBE.1

Hide

USATLAS HTCondor meeting- minutes

(see google doc: https://docs.google.com/document/d/1zTl-HIB07SEWgwB8hLH5O9deX4O-z_ezq1YQI5__VKI/edit?tab=t.0 )

Useful Links:

https://htcondor.org/htcondor/release-plan/

11 DEC 2024:

This is what we are doing at MWT2

# SIGTERM to kill jobs, SIGKILL after 5 minutes

GRACEFULLY_REMOVE_JOBS = true

MachineMaxVacateTime = 5 * 60

# cgroup additions for limiting memory to 1.1x the job request for ATLAS and 3x for non-ATLAS

CGROUP_MEMORY_LIMIT_POLICY = custom

CGROUP_HARD_MEMORY_LIMIT_EXPR = ifThenElse(regexp("usatlas[1-4]", Owner), 1.1 * RequestMemory, 3 * RequestMemory)

CGROUP_SOFT_MEMORY_LIMIT_EXPR = ifThenElse(regexp("usatlas[1-4]", Owner), 1 * RequestMemory, 1.1 * RequestMemory)

MWT2 testing of cgroups:

sudo su usatlas1
condor_submit memory_allocator.submit

condor_allocator.submit contains:

universe = vanilla

executable = memory_allocator

arguments = 1500 30

request_memory = 1024M

log = memory_job.$(Cluster).$(Process).log

output = memory_job.$(Cluster).$(Process).out

error = memory_job.$(Cluster).$(Process).err

should_transfer_files = yes

when_to_transfer_output = ON_EXIT_OR_EVICT

transfer_executable = True

JobPrio = 100000

requirements = regexp("cit2", Machine)

queue

The executable arguments are:

1500 is the number of MiB to allocate.
30 is the number of seconds to wait after the memory is allocated before exiting

There is a signal handler that will log a message on SIGINT or SIGTERM and wait the same amount of seconds before exiting

The memory request is 1024 MiB.

The requirement to run on a machine with a name containing cit2 insures that the job runs on a server that is updated to condor 24.0.2.

Judith put the cgroups config related to memory above. The entire contents of /etc/condor/config.d/02-cnode.conf is:

use role:execute

use feature:partitionableslot

MWT2_CpuUsed = int((CondorLoadAvg / TotalLoadAvg) * (ifthenelse((TotalLoadAvg < TotalCpus), TotalLoadAvg, TotalCpus)) * 100) / 100.0

MWT2_CpuUsage = ifthenelse(((TotalLoadAvg > 0.0) && (Activity != "Idle")), MWT2_CpuUsed, 0)

MWT2_CpuExceeded = (MWT2_CpuUsage > (Cpus + 0.8))

MWT2_CpuMemory = int(TotalMemory / TotalCpus)

START = TRUE

HAS_CVMFS = TRUE

TRUST_UID_DOMAIN = TRUE

STARTD_ATTRS = $(STARTD_ATTRS) HAS_CVMFS MWT2_CpuUsed MWT2_CpuUsage MWT2_CpuExceeded MWT2_CpuMemory

# SIGTERM to kill jobs, SIGKILL after 5 minutes

GRACEFULLY_REMOVE_JOBS = true

MachineMaxVacateTime = 5 * 60

# cgroup additions for limiting memory to 1.1x the job request for ATLAS and 3x for non-ATLAS

CGROUP_MEMORY_LIMIT_POLICY = custom

CGROUP_HARD_MEMORY_LIMIT_EXPR = ifThenElse(regexp("usatlas[1-4]", Owner), 1.1 * RequestMemory, 3 * RequestMemory)

CGROUP_SOFT_MEMORY_LIMIT_EXPR = ifThenElse(regexp("usatlas[1-4]", Owner), 1 * RequestMemory, 1.1 * RequestMemory)

DISABLE_SWAP_FOR_JOB = true

IGNORE_LEAF_OOM = false

Opensearch / Adstash

24.0.3/ 23.0.19 includes some improvements to opensearch 2.0 implementation

Possible to run a condor 24.0.3 vm with adstash to talk to 23.X cluster

New for late 23.X 24+ in addition to machine ad they now have singular ad for each EP startd daemon ad which has aggregate view of whole machine

Other/ Misc.

Can make use of backfill slots for lower priority jobs that get eviction when higher priority jobs come through. Progress will be lost on evicted jobs

HTCONDOR release schedule

HTConor Release Plans

Contact us page for HTCondor - Link to Page

There are minutes attached to this event. Show them.

- 15:00 → 15:10
  
  Intro 10m
  
  Introductions / brief status of each site's condor deployment
- 15:10 → 16:00
  
  Discussion 50m
  
  Some topics of discussion:
  
  -Condor 24 readiness / addressing cgroups issues
  -Mechanism behind job eviction / how it works
  -Condor monitoring / adstash
  
  Anything else of interest / open discussion

Choose timezone

US ATLAS / HTCondor meeting