SDCC Liaison Meeting
→
US/Eastern
3-192 (Bldg 510)
3-192
Bldg 510
, ,
Description
Join via BlueJeans (https://bluejeans.com/819381923). Passcode not required. You can also join via phone (Meeting ID: 819 381 923), by calling one of the numbers below:
+1.408.740.7256 (United States)
+1.888.240.2560 (US Toll Free)
+1.408.317.9253 (Alternate number)
Thursday August 27, 2020 Liaison Meeting Minutes
Facility News
- SDCC job opening with HPSS jobID 2303
- 2 remaining upcoming topics for discussion “on-boarding” and collaboration on tools and services
- Password Policy changes will comply with BNL CYBER plan will roll out soon
- Long Kerberos passwords (16+ char) with no expiration
- implementation of the new policy will be designed to cause minimal disruption to SDCC users
- more information in coming weeks
Network & Facility Operations
- recent steady pace of on-site work resuming after MINSAFE ended
- OSG storage backend purchase in progress (.5 PB usable addition to the Lustre in row45S)
- 3 x new ELK cluster servers are delivered, installing in row44S
- new 2-frame CSI tape library will deploy row50N ETA early-mid September
- new HPSS core server replacement purchase is in progress
- new CSI HPSS mover server is racked, its JBOD is just delivered to CDCE
- NSLS-II HPC cluster purchase is in progress, provisioning fiber is in progress
Storage
- ATLAS dCache upgrade completed on Mon/Tues with no issue
- Belle-II will schedule upgrade target for next Monday to dCache version 6.2
- preparations for new HPSS core server, targeting ETA October connect new library
Fabric
- working out a few issues testing latest Singularity release with liaisons
- retiring some machines soon, planning to boost the shard pool resources
General Services
- lookout for announcements on RHEV maintenance, rebooting web servers
- asking for feedback from the liaisons, testing recent changes with NX
- working out issues still with scanning the QR Token
- CYBER followup to update on incident with PHENIX web server, moving forward with a 3-phase plan:
- block external access and restore campus access (read/write)
- exract some external facing programs and then restore external access (read-only and no executable scripts)
- improve/update the full-stack of the remaining external facing programs
Topical Discussion: Compute Resource Allocation Policies and Procedures
HTCondor
- No allocations per-se (would sacrifice throughput - anathema to HTC)
- Shared Pool model groups buy hardware, SDCC gives guaranteed share proportional to the buy-in
- E.G. sPHENIX buys 10000 cores -> SDCC gives 10000 cores to “group_sphenix” to be split up by the Liaison
- Group Quotas enforce guaranteed usage, but can go over if others are idle
- User can request CPUs/Memory, enforced limits via cGroups
- Defaults of 1CPU / 1.5 GB RAM, 50% grace in RAM enforcement
- Monitoring all in Grafana
- https://monitoring.sdcc.bnl.gov/grafana/d/000000028/condor-overview?orgId=1
HPC
- Allocations on the HPC systems are done through the CSI office lead by Grace Giuffre
- Allocations are calculated in core-hours each fiscal year
- Only users with valid allocations are allowed to login to submit nodes and run
- Jobs run in full node allocation mode
- Users have access to monitoring pages for slum, worker nodes, allocations
- https://monitoring.sdcc.bnl.gov/grafana/d/000000018/sdcc-slurm-batch
- https://monitoring.sdcc.bnl.gov/grafana/d/000000029/sdcc-collectd?orgId=1
- https://monitoring.sdcc.bnl.gov/grafana/d/000000021/sdcc-slurm-accounting?orgId=1
- https://monitoring.sdcc.bnl.gov/grafana/d/000000027/sdcc-knl-slurm-batch?orgId=1&refresh=1m
- https://monitoring.sdcc.bnl.gov/grafana/d/000000030/sdcc-knl-collectd?orgId=1&refresh=1m
- https://monitoring.sdcc.bnl.gov/grafana/d/000000032/sdcc-knl-slurm-accounting?orgId=1
- Depending on what resources an allocation has, jobs will be limited to cluster-specific hardware
- (KNL, V100, K80, P100, Skylake)
There are minutes attached to this event.
Show them.