SDCC Liaison Meeting

US/Eastern
3-192 (Bldg 510)

3-192

Bldg 510

Kevin Casella (SDCC), Saroj Kandasamy (BNL), Tony Wong (Brookhaven National Lab (Physics Department))
Description

Join via BlueJeans (https://bluejeans.com/819381923). Passcode not required. You can also join via phone (Meeting ID: 819 381 923), by calling one of the numbers below:

+1.408.740.7256 (United States)
+1.888.240.2560 (US Toll Free)
+1.408.317.9253 (Alternate number)

Thursday August 27, 2020 Liaison Meeting Minutes

Facility News

  • SDCC job opening with HPSS jobID 2303
  • 2 remaining upcoming topics for discussion “on-boarding” and collaboration on tools and services
  • Password Policy changes will comply with BNL CYBER plan will roll out soon
  • Long Kerberos passwords (16+ char) with no expiration
  • implementation of the new policy will be designed to cause minimal disruption to SDCC users
  • more information in coming weeks

Network & Facility Operations

  • recent steady pace of on-site work resuming after MINSAFE ended
  • OSG storage backend purchase in progress (.5 PB usable addition to the Lustre in row45S)
  • 3 x new ELK cluster servers are delivered, installing in row44S
  • new 2-frame CSI tape library will deploy row50N ETA early-mid September
  • new HPSS core server replacement purchase is in progress
  • new CSI HPSS mover server is racked, its JBOD is just delivered to CDCE
  • NSLS-II HPC cluster purchase is in progress, provisioning fiber is in progress

Storage

  • ATLAS dCache upgrade completed on Mon/Tues with no issue
  • Belle-II will schedule upgrade target for next Monday to dCache version 6.2
  • preparations for new HPSS core server, targeting ETA October connect new library

Fabric

  • working out a few issues testing latest Singularity release with liaisons
  • retiring some machines soon, planning to boost the shard pool resources

General Services

  • lookout for announcements on RHEV maintenance, rebooting web servers
  • asking for feedback from the liaisons, testing recent changes with NX
  • working out issues still with scanning the QR Token
  • CYBER followup to update on incident with PHENIX web server, moving forward with a 3-phase plan:
  1. block external access and restore campus access (read/write)
  2. exract some external facing programs and then restore external access (read-only and no executable scripts)
  3. improve/update the full-stack of the remaining external facing programs

Topical Discussion:  Compute Resource Allocation Policies and Procedures

HTCondor

  • No allocations per-se (would sacrifice throughput - anathema to HTC)
  • Shared Pool model groups buy hardware, SDCC gives guaranteed share proportional to the buy-in
  • E.G. sPHENIX buys 10000 cores -> SDCC gives 10000 cores to “group_sphenix” to be split up by the Liaison
  • Group Quotas enforce guaranteed usage, but can go over if others are idle
  • User can request CPUs/Memory, enforced limits via cGroups
  • Defaults of 1CPU / 1.5 GB RAM, 50% grace in RAM enforcement
  • Monitoring all in Grafana
  • https://monitoring.sdcc.bnl.gov/grafana/d/000000028/condor-overview?orgId=1

HPC

There are minutes attached to this event. Show them.