1 Account Management on a Large-Scale HPC ResourceBrett Bode, Tim Bouvet, Sharif Islam and Jeremy Enos National Center for Supercomputing Applications University of Illinois
2 Blue Waters Computing SystemAggregate Memory – 1.66 PB IB Switch >1 TB/sec Scuba Subsystem - Storage Configuration for User Best Access 10/40/100 Gb Ethernet Switch External Servers 66 GB/sec 120+ GB/sec Spectra Logic: 200 usable PB Sonexion: 26 usable PB 300+ Gbps WAN HPC Systems Professionals Workshop 2016
3 g NCSAnet Gemini Fabric (HSN) Cray XE6/XK7 - 288 Cabinets Boot CabinetDSL 48 Nodes XE6 Compute Nodes - 5,659 Blades – 22,636 Nodes – 362,176 FP (bulldozer) Cores – 724,352 Integer Cores XK7 GPU Nodes 1057 Blades – 4,228 Nodes 33,824 FP Cores – 4,228 GPUs Resource Manager (MOM) 64 Nodes BOOT 2 Nodes SDB 2 Nodes RSIP 12Nodes Network GW 8 Nodes Unassigned 74 Nodes LNET Routers 582 Nodes SMW esLogin 4 Nodes Boot RAID InfiniBand fabric Boot Cabinet Import/Export Nodes Sonexion 25+ usable PB online storage 36 racks Blue Waters supersystem, Need to emphasize the unprecedented scale. Largest Cray, IE, HPSS, Lustre, Network. Hold performance discussion to later slides. Note all storage sizes are usable storage. 10/40/100 Gb Ethernet Switch HPSS Data Mover Nodes Cyber Protection IDPS Management Node NCSAnet Near-Line Storage 200+ usable PB esServers Cabinets NPCF Supporting systems: LDAP, RSA, Portal, JIRA, Globus CA, Bro, test systems, Accounts/Allocations, Wiki HPC Systems Professionals Workshop 2016
4 HPC Systems Professionals Workshop 2016Security Strategy Separate user and administrative login points Eliminate privilege escalation on user accessible hosts Limit administrative access to originate on a small number of administrative hosts Administrative access must be one way! Layout and get buy in on a policy for critical security issues! On Blue Waters user exploitable privilege escalation issues warrant emergency maintenance and possible user immediate lockout. HPC Systems Professionals Workshop 2016
5 HPC Systems Professionals Workshop 2016Authentication Historically compromised passwords have been the top vector for intrusions on NCSA HPC systems. If using passwords make sure you have a reasonable policy and that default passwords are not used or are expired quickly. For Blue Waters two-factor One Time Passwords were specified for both admin and user access. Largely solves the compromised account problem, but does add cost and significant overhead. HPC Systems Professionals Workshop 2016
6 HPC Systems Professionals Workshop 2016Network Design NCSA operates a very high-bandwidth open network environment Currently 370 Gbps No firewalls – active intrusion detection using Bro Even on a firewalled network administrative hosts should be isolated from user networks. Blue Waters has four separate administrative domains! HPC Systems Professionals Workshop 2016
7 Logical Network DesignHPC Systems Professionals Workshop 2016
8 HPC Systems Professionals Workshop 2016Bastion Hosts The Blue Waters bastion hosts provide (the only) login route to the multiple administrative domains. Admins login using their regular accounts and OTP Host based access used internally to the administrative servers – restricted through LDAP groups. sudo used to escalate privs on the admin servers, also restricted by LDAP groups pfsense firewalls do allow very limited egress to allow “normal” software update tools to function. HPC Systems Professionals Workshop 2016
9 Administrative AccessEscalation is only allowed on the administrative hosts. From there keybased access for root is used within that administrative domain. One-way access. Administrative hosts do not allow user or admin access from a user accessible host, and bastions do not allow reverse path from administrative hosts Root can not cross administrative domains. Allows granting admin rights on a subset of the overall system. HPC Systems Professionals Workshop 2016
10 User Access ManagementPotentially separate groups have access to logins, lustre data transfer and nearline data transfer. Access is granted based on group membership and the standard linux /etc/security/access.conf file. + : TRAIN_aaaa TRAIN_bbbb : xxx.xx/32 : ALL EXCEPT root crayadm globus bw_staff PRAC_cccc ILL_dddd … : ALL What about maintenance? Desirable to have a fast way to restrict access to all nodes in a service class. Blue Waters has a centralized monitoring and control workstation (ISC). Using a web portal admins can quickly add/remove projects from the access list or switch to a restricted maintenance access list. Clients pull a new access.conf file once per minute. (motd and ssh.banner are done the same way) HPC Systems Professionals Workshop 2016
11 Account and Group ManagementManaging projects and adding/removing users is performed using an external database and web portal – outside the scope of this talk. Changes need to be pushed to 27,000+ clients quickly and efficiently! Our solution was to build our own LDAP infrastructure emphasizing scalability and fault tolerance. All changes are made on an external host that is the LDAP master (LDAP is not writable from anywhere else) LDAP replicaservers are setup in redundant pairs both externally to BWs and inside the high-speed fabric, with presence on each separate network (administrative, HSN, user private, user public) No clients pull from the master directly. SSL is used, though LDAP is not used for passwords. HPC Systems Professionals Workshop 2016
12 HPC Systems Professionals Workshop 2016Extending LDAP LDAP provides support for standard account and group information. However, it is also quite easy to extend LDAP to provide additional features. BWs extends the LDAP schema to include storage quotas, project PI information and gridmap information. All are set at project/user creation on the LDAP master. On login a PAM module checks and creates the home and scratch directory if needed. Quotas are also checked and changed if needed. The gridFTP daemon was modified to call out to LDAP to lookup the gridmap entry rather than relying on the traditional file. HPC Systems Professionals Workshop 2016
13 Account/Project removalPIs are provided the ability to add and remove users via a web portal so accounts can be removed at anytime. When projects end they are provided a 90 day grace period and then are removed. Home and project data are permanently removed! Nearline access can be extended up to one year, but is also eventually closed and data removed. HPC Systems Professionals Workshop 2016
14 Education/Training ProjectsThe use of BWs for Education and training projects is encouraged, but required an alternative account setup and distribution process. Some training projects may use hundreds of accounts spread across multiple remote sites making the distribution of OTP tokens impractical. The BWs solution is to allow limited single-factor logins for certain, short-duration projects. These accounts are generic – instr*… train* and get recycled. All access is required to go through a single bounce host. Each account is assigned a unique password on the bounce host and a self-signed certificate granting access from the bounce host to a regular login node for only the duration of the project. The passwords are included in a generated pdf that is encrypted prior to distribution to the instructors for the event. A separate channel is used to distribute the encryption key. An admin enables the group for access to the logins (from the bounce host) at the beginning of the course and disables at the end of the course. Since these accounts are not two-factor they may be disabled without notice in the event of a known security issue. HPC Systems Professionals Workshop 2016
15 HPC Systems Professionals Workshop 2016Conclusions A carefully planned administrative network provides secure and effective system administrative access. The Blue Waters use of LDAP has enabled very efficient and resilient account and project management changes with a very large client count. LDAP has also proven to be very extensible for helping manage a range of quotas and project information. The use of OTP can be carefully mixed with limited non-OTP accounts for special purposes. HPC Systems Professionals Workshop 2016
16 HPC Systems Professionals Workshop 2016Questions Acknowledgements Mark Klein Jason Alt NCSA security team Supported by: The National Science Foundation through awards OCI and ACI The State and University of Illinois HPC Systems Professionals Workshop 2016