The School of Computational Science (now DSC), the Center for Ocean-Atmospheric Prediction Studies, and the Office of Technology Integration have developed a plan to replace the current FSU supercomputer with a new multidisciplinary HPC facility.
While the current FSU supercomputers have provided a reliable, leading edge research platform to the FSU community for the last 6 years, newer technologies to support HPC are now available. The School of Computational Science, the Center for Ocean-Atmospheric Prediction Studies, and Office of Technology Integration have come together to developed a plan to replace the current systems with a new leading edge multidisciplinary HPC facility on the FSU campus to support research and education at FSU.
The mission of the new FSU HPC facility will be to:
- Support multidisciplinary research. The operating system, software applications, libraries, job management system, and hardware architecture of the HPC resource will be selected to support a broad range of computational research requirements across a wide range of disciplines.
- Provide a general access computing platform to FSU researchers following a simple application process. Access to local compute cycles will encourage curiosity driven research, novel approaches, and increased productivity. Accounts will include access to high availability disk storage and to computing cycles based on a fair allocation and scheduling policy.
- Encourage owner-based participation. Scientific and academic units with a need for dedicated computing cycles can purchase machines that will be added to the existing HPC system and funds from the general University HPC budget will be used to match the number of machines purchased by a unit. Members of the contributing unit will be given priority access on the machines they purchased, as well as, those purchased by the university under the matching agreement; however, when resource owners are not utilizing the their machines, the cycles will be made available to the general access research community.
- Provide a broad-base of support and training opportunities. Staff hired to support the HPC facility will work directly with FSU researchers to facilitate access and use of the resource. Workshops and training sessions will be presented periodically to the University describing new applications and technologies related to research in HPC.
The system will be built around three guiding principles:
- Modularity - The system will consist of modular commodity computing components, which can easily be added and removed without significant interruption to computing services.
- Scalability - The system will easily accommodate new computing components without requiring significant changes to scientific applications or to systems administration tools.
- Compatibility - The system will support existing scientific code with little or no modification needed. The system architecture and operating system must be compatible with a wide variety of existing scientific applications and libraries.
The emphasis on modularity, scalability, and compatibility forms the basis of a sustainable strategy for maintaining a powerful HPC facility on the FSU campus.
- First Year -- The primary focus of the first year will be to hire technical support personnel and to build the core-infrastructure components. In addition, we will start to develop the computational component of the facility, with particular emphasis on acquiring systems that will involve some cost sharing by departments. Core-infrastructure components will include a large shared storage system, Infiniband fabric, and IP fabric.
- Second and Third Years -- By the end of the third year the proposed facility will have double the throughput of the current IBM facility (i.e., 5 TeraFlops) and will provide at least as much storage capacity (450 Terabytes), putting FSU back at the top among academic HPC facilities in the world. The emphasis during years 2 and 3 will be on acquiring the computational components necessary to support competitive and innovative research at FSU. The computational components will come from a combination of budget allocations for this facility and from academic units through a system of cost sharing. By the end of the third year, this facility will consists of a large general-access component, designed to support computational research on a first-come-first-served basis, and a large owner-based component, which will provide guaranteed compute cycles to research units with dedicated computing requirements. General-access and owner-based computational components will be tightly integrated to allow seamless cycle sharing among available resources.
- Fourth Year and Beyond -- Some of the core infrastructure equipment purchased in the first year will be replaced, particularly critical components beyond their warranty. New state-of-the art computing components will be added to the exiting facility. These new computing elements will serve to increase the overall throughput of the HPC facility and will also serve as replacements for failed units purchased in the first year and beyond their warranty period.
Academic and scientific units with a need for dedicated computing cycles will be able to purchase machines that will be added to the existing HPC facility. To encourage owner-based participation, funds from the general University HPC budget will be used to match the number of machines purchased by an academic unit. Members of the contributing unit will be given priority access on the machines they purchased, as well as, those purchased by the university under the matching agreement as long as the machines remain under warranty and are compatible with the existing facility. Owners are responsible for additional warranty/repair costs beyond the basic three-year warranty period. Encouraging cost sharing in this way represents a substantial value to the University. For one, owner-based participation provides a cost effective way for the University to substantially increase its effective HPC power. Owner-based systems rarely run at full capacity. Our load distribution systems will be configured to allow jobs to run on any idle machine, whether it is an owner-based or general access system. Second, attempting to provide each academic unit with the necessary space, electricity, cooling, and technical support needed by even a small HPC installation is not cost effective. To participate in the cost sharing incentive an academic unit must contribute a minimum of $25,000. Costs that are not included in the standard account allocation (e.g, costs related to commercial software packages and additional storage capacity) are the responsibility of the researcher or academic unit. In order to encourage new faculty to contribute to and use the facility, part of their startup amount could be eligible under this cost sharing agreement. For example, the College of Arts and Sciences might offer a new faculty member $25,000 in computer startup and double that amount if the new faculty were to invest in the HPC facility.
The long-term success of the proposed HPC resource will require a clear management structure and users should have a perception of ownership when using the resource, with few bureaucratic obstacles in the way of implementing leading edge research. Likewise, the policies governing resource allocation should be as transparent as possible while the underlying complexity of the proposed interdisciplinary HPC resource should be as unobtrusive as possible. These sometimes-conflicting requirements will be met by assigning the management of the resource to an engaged and experienced leadership and by charging an interdisciplinary advisory group with the responsibility of setting and regularly reviewing policies related to account application and acquisition, load balancing, and resource allocation. A supportive working relationship with the FSU Office of Technology Integration will also be important for the effective transition of the existing facility to SCS and for the long-term success of the resource. The specific administrative entities and their general duties are as follows:
- SCS Director -- The SCS director (Dr. Max Gunzberger) will have overall responsibility for the HPC system and will work with the HPC director and systems manager to ensure continuity.
- HPC Director -- The HPC director (Dr. Jim Wilgenbusch) will be responsible for managing and maintaining the resource. A committee of three SCS faculty members, known as the Local Systems Committee (LSC), will assist the director and will regularly convene the HPC Advisory Group.
- UCS Director -- The University Computing Services (a unit of the Office of Technology Integration) director (Dr. J. Michael Barker) will have responsibility for providing machine room space for the HPC system. He will work with the HPC Systems Manager and Technical support staff to coordinate the following tasks: installation and providing of power and network services, coordinating access to the UCS machine room for HPC staff and providing after hours operations services to on-call HPC analysts. UCS will provide a "co-location" service to SCS for the HPC; HPC staff will be responsible for the physical operation of the machine (including power-up, reboots, configuration, end-user support, identity management, troubleshooting at the operating system layer and above, etc.).
- HPC Advisory Group -- The HPC Advisory Group will be composed of members of the University community actively engaged in research programs representing a broad spectrum of the computational research taking place at the University (6 to 8 members including the chair: Dr. Eric Chassignet). This group will be responsible for setting policy pertaining to account application and acquisition, load balancing, and resource allocation. In addition, the Advisory Group will also make recommendations with respect to hardware purchases.
- HPC Systems Manager -- (Dr. Jeff McDonald) The systems manager will be responsible for the day-to-day operation of the facility and the technical support staff assigned to it. In addition, the systems manager will be responsible for implementing procedures and general user guidelines pursuant to the policies set by the HPC Advisory Group. The systems manager will have an advanced degree in a computational science and experience managing a multiply user facility.
- Technical Support Staff -- The SCS recognizes that effective support of research and application development requires more than a traditional approach to systems administration. Three core-systems professionals will maintain the basic integrity of the HPC infrastructure, while also focusing on building, testing, developing, and documenting specific research applications and scientific libraries. The support staff will have advanced degrees in computational sciences or at a minimum will have some level of familiarity with the research taking place on the HPC systems. Support staff will communicate directly with HPC users in order to facilitate effective use of the HPC resource. We also anticipate situations that will, at most, require short to medium term support. In this case it may be necessary to recruit professionals for a short duration.
- Student Assistants -- Student participation in a limited number of HPC systems administration tasks can be valuable from at least a couple of perspectives. First, students will gain valuable hands on training opportunities in leading edge technologies, which is consistent with the mission of the University. Second, students can provide a valuable service for certain tasks, giving core professionals time to work on other more central HPC issues. For examples, students will be assigned to build and test applications, perform systems testing and benchmarking, and replace faulty hardware components. All of these tasks provide students with valuable training opportunities, while not putting the reliability of the systems in jeopardy. Students assistants will come from academic units and will be assigned tasks and supervised by the HPC manager.
Access and Computer Cycle Allocation
FSU researchers will be able to apply for access to the HPC facility by submitting a simple web-based form. General access accounts will be created for FSU faculty and the members their research groups. This means that anyone can be given an account on the HPC provided a member of the FSU faculty is willing to sponsor them. Sponsored accounts are intended to be used to support the research of the FSU faculty sponsor. Sponsored accounts may be suspended at the discretion of the HPC director if they are not used as intended. Accounts on owner-based components will be created at the discretion of a designated representative for the owner-based system.
The installation, configuration and upkeep of the HPC system is the responsibility of the SCS and the HPC systems staff. OTI will provide facility management services during the transition from the existing Teragold and Eclipse systems to the new HPC system. The specific details of the transition and the transition schedule are as follows:
- The deadlines and timeframes given in the transition schedule included below will be communicated to the user community by the SCS by the appropriate authority (e.g., the HPC Director) as soon as possible.
- SCS will identify which user accounts are to be moved from Teragold and/or Eclipse onto the new HPC system; OTI will assist SCS in moving said accounts.
- SCS and OTI will engage in a joint effort to move a proper subset of user data from the Teragold/Eclipse file systems to the other file storage environment.
- OTI will be responsible for dismantling Teragold, Eclipse and peripherals and disposing of the boxes. The IBM 3584 tape subsystem will be transferred to the UCS property inventory. SCS will retain a single Eclipse node and console in the main datacenter to facilitate software development and trouble code porting to the new HPC facility.
- SCS will provide support for porting code from the Teragold and Eclipse systems onto the new HPC system.
- SCS will provide support for optimizing code for the new HPC platform.