SDCC Liaison Meeting

US/Eastern
3-192 (Bldg 510)

3-192

Bldg 510

Kevin Casella (SDCC), Saroj Kandasamy (BNL), Tony Wong (Brookhaven National Lab (Physics Department))
Description

Join via BlueJeans (https://bluejeans.com/819381923). Passcode not required. You can also join via phone (Meeting ID: 819 381 923), by calling one of the numbers below:

+1.408.740.7256 (United States)
+1.888.240.2560 (US Toll Free)
+1.408.317.9253 (Alternate number)

Thursday September 24, 2020 Liaison Meeting Minutes

Facility News

  • Change in BNL status transition to additional staff return to work Phase2
  • change in presentation 1 slide summary from each group

Network & Facility Operations

  • CSI tape library assembled now, provisioning fiber and copper cable
  • new HPSS core server pair scheduled for 10/5/2020
  • NSLS-II HPC cluster 2 racks [33-4] [33-5] in QCDOC purchase and cabling in progress
  • NSLS-II Lustre 2 racks [33-2] [33-3] in QCDOC purchase in progress
  • sPHENIX central storage (5.4 PB) rack [46-8] purchase in progress
  • migration of RHIC HPSS Movers from row16  to row 44S by mid Oct. 2020
  • BCF HPSS Silo#2 relocation to CDCE Silo#8 expected in mid-Nov. 2020
  • B515 <=> B725 basement fiber conduit expected sometime Oct. 2020
  • B515 SciCore Cut-In Intervention tentatively scheduled Dec 1, 2020 06:00 - 18:00

Storage

  • ATLAS:  SRR and QoS enabled after upgrade;  adding 2 new storage HW with ZFS
  • Belle-II dCache:  SRR enabled after upgrade calibration user added
  • STAR XROOTd:  working on adding user auth
  • PHENIX dCache:  retire old write pools; enable qgp006 as write pools
  • HPSS:  create two FF for PHENIX; relocations to CDCE; new CSI silo; new HPSS core
  • Belle-II RUCIO migration work continues
  • Belle-II Calibration farm setup ongoing
  • PHENIX/sPHENIX question about adding new storage so sPHENIX upgrades won’t be hindered by legacy

Fabric

  • NSLS-II HPC cluster PO dispatched to Supermicro:
  • 2 racks, 30 nodes (12 with 2 x V100S GPU)
  • 2 x Intel Xeon Gold 6252 Cascade Lake CPU @2.10 GHz (96 logical cores)
  • 768 GB (12 x 64 GB DDR4-2933 DIMMs) memory
  • 1 x EDR Infiniband
  • 2 x 10 Gbps NICs (initially using 1 port)
  • NSLS-II AD accounts/authentication requirement, working on it with ITD/Centrify
  • HTC/HPC Singularity upgrade resolves ‘singularity shell’  in 3.6 release
  • Test 3.6.x on rplay53 updated to 3.6.3 to resolve a security issue
  • Also testing on ATLAS T1, planning to upgrade entire farm within 2 weeks
  • Password changing interface went live
  • https://web.sdcc.bnl.gov/apps/passwd
  • closing 85 ATLAS T1 (2015 PO), moving to shared pool shortly
  • creating a local Docker registry with pull-through caching
  • discussing sPHENIX VOMS alternatives with SDCC ISSOs & federated auth admins

Tools Services

  • BNLBox newly updated BNL usage policy paced in home directory of each user
  • installed and testing anti-virus (ClamAV) on the BNLBox testbed
  • AV scanning has been requested as a component of our cybersecurity profile
  • ELK:  monitor usage of BNLBox (file transfer stats, # users, etc), extended to Globus usage recently
  • Digital Repositories:  EIC/Zenodo application port (443) OPEN TO THE WORLD
  • EIC digital repo is restricted to BNL people part of InCommon/COmanage, using SDCC/BNL Incommon IdPs
  • Cybersecurity policy enabled in the login page
  • Discussion on EIC Zenodo digital repository community manager (curator)
  • Currently Zenodo only allows one curator per community
  • Discussion of moving sPHENIX digital repository from CSI-based custom Invenio app to InvenioRDM
  • VOMS:  coordinating with OSG to provide a VOMS client solution for sPHENIX
  • Service will be configured to allow sPHENIX jobs to run at remote sites via PanDA

General Services

  • planning to retire rssh & atlasgw ssh gateways soon
  • rftpexp gateways are not affected as of now still work to be done
  • have 4 ssh.sdcc.bnl.gov ssh gateways in production now ( can add more if load requires)
  • NX testing status? Need more testers & feedback
  • Close to putting in production (need better OTP setup system first)
  • Warning still testing not in production yet, more changes may still occur
  • questions about a “Log out” button or feature where users can explicitly log out
  • Status of password changes topical discussion will conclude the meeting

Topical Discussion:  Compute Resource Allocation Policies and Procedures

  • As of 9/24:  358 / 1782 total “active” accounts (~ 20% in 10 days)
  • 1323 active + 459 active + expired principal (never set password since 2 years IPA-migration)
  • There is a chance CYBER will enforce  active accounts with expired principal be deactivated (no ssh)
  • please spread the word to maximize conversion with this transition
  • if any user doesn’t know their current password submit a ticket to RT useraccts to get a temp password
  • RT-RACF-UserAccounts@bnl.gov
  • Download slides for more detailed stats and by-group stats
  • Reminders will go out with increasing frequency as the deadline approaches (10/12/2020)
There are minutes attached to this event. Show them.